Dilated Residual Networks - CVF Open Access

Dilated Residual NetworksFisher YuPrinceton UniversityVladlen KoltunIntel LabsThomas FunkhouserPrinceton UniversityAbstractConvolutional Networks for image classification progres-sively reduce resolution until the image is represented bytiny feature maps in which the spatial structure of the sceneis no longer discernible. Such loss of spatial acuity can limitimage classification accuracy and complicate the transferof the model to downstream applications that require de-tailed scene understanding. These problems can be allevi-ated by dilation, which increases the resolution of outputfeature maps without reducing the receptive field of indi-vidual neurons. We show that Dilated Residual Networks (DRNs) outperform their non- Dilated counterparts in im-age classification without increasing the model s depth orcomplexity. We then study gridding artifacts introduced bydilation, develop an approach to removing these artifacts( degridding ), and show that this further increases the per-formance of DRNs.

In addition, we show that the accuracyadvantage of DRNs is further magnified in downstream ap-plications such as object localization and semantic IntroductionConvolutional Networks were originally developed forclassifying hand-written digits [9]. More recently, convolu-tional network architectures have evolved to classify muchmore complex images [8,13,14,6]. Yet a central aspect ofnetwork architecture has remained largely in place. Convo-lutional Networks for image classification progressively re-duce resolution until the image is represented by tiny featuremaps that retain little spatial information (7 7is typical).While convolutional Networks have done well, the al-most complete elimination of spatial acuity may be prevent-ing these models from achieving even higher accuracy, forexample by preserving the contribution of small and thinobjects that may be important for correctly understandingthe image. Such preservation may not have been importantin the context of hand-written digit classification, in whicha single object dominated the image, but may help in theanalysis of complex natural scenes where multiple objectsand their relative configurations must be taken into , image classification is rarely a convolu-tional network s raison d etre.

Image classification is mostoften a proxy task that is used to pretrain a model beforeit is transferred to other applications that involve more de-tailed scene understanding [4,10]. In such tasks, severe lossof spatial acuity is a significant handicap. Existing tech-niques compensate for the lost resolution by introducingup-convolutions [10,11], skip connections [5], and otherpost-hoc convolutional Networks crush the image in order toclassify it? In this paper, we show that this is not neces-sary, or even desirable. Starting with the Residual networkarchitecture, the current state of the art for image classifica-tion [6], we increase the resolution of the network s outputby replacing a subset of interior subsampling layers by di-lation [18]. We show that Dilated Residual Networks (DRNs)yield improved image classification performance. Specifi-cally, DRNs yield higher accuracy in ImageNet classifica-tion than their non- Dilated counterparts, with no increase indepth or model output resolution of a DRN on typical ImageNet in-put is28 28, comparable to small thumbnails that conveythe structure of the image when examined by a human [15].

While it may not be clear a priori that average pooling canproperly handle such high-resolution output, we show thatit can, yielding a notable accuracy gain. We then study grid-ding artifacts introduced by dilation, propose a scheme forremoving these artifacts, and show that such degridding further improves the accuracy of also show that DRNs yield improved accuracy ondownstream applications such as weakly-supervised objectlocalization and semantic segmentation. With a remarkablysimple approach, involving no fine-tuning at all, we obtainstate-of-the-art top-1 accuracy in weakly-supervised local-ization on ImageNet. We also study the performance ofDRNs on semantic segmentation and show, for example,that a 42-layer DRN outperforms a ResNet-101 baseline onthe Cityscapes dataset by more than 4 percentage points,despite lower depth by a factor of Dilated Residual NetworksOur key idea is to preserve spatial resolution in convolu-tional Networks for image classification.

Although progres-sive downsampling has been very successful in classifyingdigits or iconic views of objects, the loss of spatial infor-mation may be harmful for classifying natural images andcan significantly hamper transfer to other tasks that involvespatially detailed image understanding. Natural images of-ten feature many objects whose identities and relative con-figurations are important for understanding the scene. Theclassification task becomes difficult when a key object is notspatially dominant for example, when the labeled objectis thin ( , a tripod) or when there is a big backgroundobject such as a mountain. In these cases, the backgroundresponse may suppress the signal from the object of inter-est. What s worse, if the object s signal is lost due to down-sampling, there is little hope to recover it during , if we retain high spatial resolution throughout themodel and provide output signals that densely cover the in-put field, backpropagation can learn to preserve importantinformation about smaller and less salient starting point of our construction is the set of net-work architectures presented by He et al.

[6]. Each of thesearchitectures consists of five groups of convolutional lay-ers. The first layer in each group performs downsamplingby striding: that is, the convolutional filter is only evaluatedat even rows and columns. Let each group of layers be de-noted byG , for = 1, .. ,5. Denote theithlayer in group byG i. For simplicity of exposition, consider an idealizedmodel in which each layer consists of a single feature map:the extension to multiple feature maps is ibe the filter associated with layerG i. In the originalmodel, the output ofG iis(G i f i)(p) = a+b=pG i(a)f i(b),(1)where the domain ofpis the feature map inG i. This is fol-lowed by a nonlinearity, which does not affect the naive approach to increasing resolution in higher lay-ers of the network would be to simply remove subsampling(striding) from some of the interior layers. This does in-crease downstream resolution, but has a detrimental side ef-fect that negates the benefits: removing subsampling corre-spondingly reduces the receptive field in subsequent removing striding such that the resolution of the out-put layer is increased by a factor of 4 also reduces the recep-tive field of each output unit by a factor of 4.

This severelyreduces the amount of context that can inform the predictionproduced by each unit. Since contextual information is im-portant in disambiguating local cues [3], such reduction inreceptive field is an unacceptable price to pay for higher res-olution. For this reason, we use Dilated convolutions [18] toincrease the receptive field of the higher layers, compensat -ing for the reduction in receptive field induced by removingsubsampling. The effect is that units in the Dilated layershave the same receptive field as corresponding units in theoriginal focus on the two final groups of convolutional layers:G4andG5. In the original ResNet, the first layer in eachgroup (G41andG51) is strided: the convolution is evaluated ateven rows and columns, which reduces the output resolutionof these layers by a factor of 2 in each dimension. Thefirst step in the conversion to DRN is to remove the stridingin bothG41andG51. Note that the receptive field of eachunit inG41remains unaffected: we just doubled the outputresolution ofG41without affecting the receptive field of itsunits.

However, subsequent layers are all affected: theirreceptive fields have been reduced by a factor of 2 in eachdimension. We therefore replace the convolution operatorsin those layers by 2- Dilated convolutions [18]:(G4i 2f4i)(p) = a+2b=pG4i(a)f4i(b)(2)for alli 2. The same transformation is applied toG51:(G51 2f51)(p) = a+2b=pG51(a)f51(b).(3)Subsequent layers inG5follow two striding layers that havebeen eliminated. The elimination of striding has reducedtheir receptive fields by a factor of 4 in each convolutions need to be Dilated by a factor of 4 tocompensate for the loss:(G5i 4f5i)(p) = a+4b=pG5i(a)f5i(b)(4)for alli 2. Finally, as in the original architecture,G5is followed by global average pooling, which reduces theoutput feature maps to a vector, and a1 1convolution thatmaps this vector to a vector that comprises the predictionscores for all classes. The transformation of a ResNet intoa DRN is illustrated in converted DRN has the same number of layers andparameters as the original ResNet.

The key difference isthat the original ResNet downsamples the input image by afactor of 32 in each dimension (a thousand-fold reductionin area), while the DRN downsamples the input by a factorof 8. For example, when the input resolution is224 224,the output resolution ofG5in the original ResNet is7 7,which is not sufficient for the spatial structure of the input tobe discernable. The output ofG5in a DRN is28 28. Globalaverage pooling therefore takes in24times more values,which can help the classifier recognize objects that covera smaller number of pixels in the input image and take suchobjects into account in its d=1d=1 d=12ch/2w/2d=1 d=1h/4d=1 d=1 Group 4 Group 5(a) ResNetc2c4chwwwhd=1 d=1 d=2d=2 d=4hhwd=2 d=2d=4 d=4 Group 4 Group 5(b) DRNF igure 1: Converting a ResNet into a DRN. The originalResNet is shown in (a), the resulting DRN is shown in (b).Striding inG41andG51is removed, bringing the resolutionof all layers inG4andG5to the resolution ofG3.

To com-pensate for the consequent shrinkage of the receptive field,G4iandG51are Dilated by a factor of 2 andG5iare Dilated bya factor of 4, for alli ,2c, and4cdenote the num-ber of feature maps in a layer,wandhdenote feature mapresolution, anddis the dilation presented construction could also be applied to ear-lier groups of layers (G1,G2, orG3), in the limit retainingthe full resolution of the input. We chose not to do this be-cause a downsampling factor of 8 is known to preserve mostof the information necessary to correctly parse the originalimage at pixel level [10]. Furthermore, a28 28thumbnail,while small, is sufficiently resolved for humans to discernthe structure of the scene [15]. Additional increase in res-olution has costs and should not be pursued without com-mensurate gains: when feature map resolution is increasedby a factor of 2 in each dimension, the memory consump-tion of that feature map increases by a factor of 4. Operatingat full resolution throughout, with no downsampling at all,is beyond the capabilities of current LocalizationGiven a DRN trained for image classification, we can di-rectly produce dense pixel-level class activation maps with-out any additional training or parameter tuning.

Dilated Residual Networks - CVF Open Access

Tags:

Information

Advertisement

Transcription of Dilated Residual Networks - CVF Open Access

Related search queries

Dilated Residual Networks - CVF Open Access

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries