Abstract arXiv:1706.05587v3 [cs.CV] 5 Dec 2017

rethinking Atrous Convolution for Semantic image SegmentationLiang-Chieh Chen George Papandreou Florian Schroff Hartwig AdamGoogle Inc.{lcchen, gpapan, fschroff, this work, we revisit atrous convolution, a powerful toolto explicitly adjust filter s field-of-view as well as control theresolution of feature responses computed by Deep Convolu-tional Neural Networks, in the application of semantic imagesegmentation. To handle the problem of segmenting objectsat multiple scales, we design modules which employ atrousconvolution in cascade or in parallel to capture multi-scalecontext by adopting multiple atrous rates. Furthermore, wepropose to augment our previously proposed Atrous SpatialPyramid Pooling module, which probes convolutional fea-tures at multiple scales, with image -level features encodingglobal context and further boost performance.}

We also elab-orate on implementation details and share our experienceon training our system. The proposed DeepLabv3 systemsignificantly improves over our previous DeepLab versionswithout DenseCRF post-processing and attains comparableperformance with other state-of-art models on the PASCALVOC 2012 semantic image segmentation IntroductionFor the task of semantic segmentation [20,63,14,97,7],we consider two challenges in applying Deep ConvolutionalNeural Networks (DCNNs) [50]. The first one is the reducedfeature resolution caused by consecutive pooling operationsor convolution striding, which allows DCNNs to learn in-creasingly Abstract feature representations. However, thisinvariance to local image transformation may impede denseprediction tasks, where detailed spatial information is de-sired. To overcome this problem, we advocate the use ofatrous convolution [36,26,74,66], which has been shownto be effective for semantic image segmentation [10,90,11].

Atrous convolution, also known as dilated convolution, al-lows us to repurpose ImageNet [72] pretrained networksto extract denser feature maps by removing the downsam-pling operations from the last few layers and upsamplingthe corresponding filter kernels, equivalent to inserting holes( trous in French) between filter weights. With atrous convo-lution, one is able to control the resolution at which featurerate = 6rate = 24rate = 1 Convkernel: 3x3rate: 1 Convkernel: 3x3rate: 6 Convkernel: 3x3rate: 24 Feature mapFeature mapFeature mapFigure 1. Atrous convolution with kernel size3 3and differentrates. Standard convolution corresponds to atrous convolutionwithrate= 1. Employing large value of atrous rate enlarges themodel s field-of-view, enabling object encoding at multiple are computed within DCNNs without requiringlearning extra difficulty comes from the existence of objectsat multiple scales.

Several methods have been proposed tohandle the problem and we mainly consider four categoriesin this work, as illustrated in Fig. 2. First, the DCNN isapplied to an image pyramid to extract features for eachscale input [22,19,69,55,12,11] where objects at differentscales become prominent at different feature maps. Sec-ond, the encoder-decoder structure [3,71,25,54,70,68,39]exploits multi-scale features from the encoder part and re-covers the spatial resolution from the decoder part. Third,extra modules are cascaded on top of the original network forcapturing long range information. In particular, DenseCRF[45] is employed to encode pixel-level pairwise similarities[10,96,55,73], while [59,90] develop several extra convo-lutional layers in cascade to gradually capture long rangecontext. Fourth, spatial pyramid pooling [11,95] probesan incoming feature map with filters or pooling operationsat multiple rates and multiple effective field-of-views, thuscapturing objects at multiple this work, we revisit applying atrous convolution,which allows us to effectively enlarge the field of view offilters to incorporate multi-scale context, in the framework ofboth cascaded modules and spatial pyramid pooling.

In par-ticular, our proposed module consists of atrous convolutionwith various rates and batch normalization layers which we1 [ ] 5 Dec 2017 image Scale 1 image Scale 2 MergeImage2x up2x up2x upImageSmall ResolutionAtrousConvolutionImageImageSpa tial Pyramid Pooling(a) image Pyramid(b) Encoder-Decoder(c) Deeper w. Atrous Convolution(d) Spatial Pyramid PoolingFigure 2. Alternative architectures to capture multi-scale important to be trained as well. We experiment withlaying out the modules in cascade or in parallel (specifically,Atrous Spatial Pyramid Pooling (ASPP) method [11]). Wediscuss an important practical issue when applying a3 3atrous convolution with an extremely large rate, which failsto capture long range information due to image boundaryeffects, effectively simply degenerating to1 1convolu-tion, and propose to incorporate image -level features intothe ASPP module.

Furthermore, we elaborate on imple-mentation details and share experience on training the pro-posed models, including a simple yet effective bootstrappingmethod for handling rare and finely annotated objects. In theend, our proposed model, DeepLabv3 improves over ourprevious works [10,11] and attains performance of the PASCAL VOC 2012testset without DenseCRF Related WorkIt has been shown that global features or contextual in-teractions [33,76,43,48,27,89] are beneficial in cor-rectly classifying pixels for semantic segmentation . Inthis work, we discuss four types of Fully ConvolutionalNetworks (FCNs) [74,60] (see Fig. 2 for illustration)that exploit context information for semantic segmentation [30, 15, 62, 9, 96, 55, 65, 73, 87]. image pyramid:The same model, typically with sharedweights, is applied to multi-scale inputs.

Feature responsesfrom the small scale inputs encode the long-range context,while the large scale inputs preserve the small object examples include Farabetet al. [22] who transformthe input image through a Laplacian pyramid, feed eachscale input to a DCNN and merge the feature maps fromall the scales. [19,69] apply multi-scale inputs sequentiallyfrom coarse-to-fine, while [55,12,11] directly resize theinput for several scales and fuse the features from all thescales. The main drawback of this type of models is that itdoes not scale well for larger/deeper DCNNs ( , networkslike [32, 91, 86]) due to limited GPU memory and thus it isusually applied during the inference stage [16].Encoder-decoder:This model consists of two parts: (a)the encoder where the spatial dimension of feature mapsis gradually reduced and thus longer range information ismore easily captured in the deeper encoder output, and (b)the decoder where object details and spatial dimension aregradually recovered.

For example, [60,64] employ deconvo-lution [92] to learn the upsampling of low resolution featureresponses. SegNet [3] reuses the pooling indices from theencoder and learn extra convolutional layers to densify thefeature responses, while U-Net [71] adds skip connectionsfrom the encoder features to the corresponding decoder acti-vations, and [25] employs a Laplacian pyramid reconstruc-tion network. More recently, RefineNet [54] and [70,68,39]have demonstrated the effectiveness of models based onencoder-decoder structure on several semantic segmentationbenchmarks. This type of model is also explored in thecontext of object detection [56, 77].Context module:This model contains extra moduleslaid out in cascade to encode long-range context. One ef-fective method is to incorporate DenseCRF [45] (with effi-cient high-dimensional filtering algorithms [2]) to DCNNs[10,11].

Furthermore, [96,55,73] propose to jointly trainboth the CRF and DCNN components, while [59,90] em-ploy several extra convolutional layers on top of the beliefmaps of DCNNs (belief maps are the final DCNN featuremaps that contain output channels equal to the number ofpredicted classes) to capture context information. Recently,[41] proposes to learn a general and sparse high-dimensionalconvolution (bilateral convolution), and [82,8] combineGaussian Conditional Random Fields and DCNNs for se-mantic pyramid pooling:This model employs spatialpyramid pooling [28,49] to capture context at several image -level features are exploited in ParseNet [58] forglobal context information. DeepLabv2 [11] proposes atrousspatial pyramid pooling (ASPP), where parallel atrous con-volution layers with different rates capture multi-scale infor-mation. Recently, Pyramid Scene Parsing Net (PSP) [95]performs spatial pooling at several grid scales and demon-strates outstanding performance on several semantic segmen-tation benchmarks.

There are other methods based on LSTM[35] to aggregate global context [53,6,88]. Spatial pyramidpooling has also been applied in object detection [31].In this work, we mainly explore atrous convolution[36,26,74,66,10,90,11] as acontext moduleand toolforspatial pyramid pooling. Our proposed framework isgeneral in the sense that it could be applied to any be concrete, we duplicate several copies of the originallast block in ResNet [32] and arrange them in cascade, andalso revisit the ASPP module [11] which contains severalatrous convolutions in parallel. Note that our cascaded mod-ules are applied directly on the feature maps instead of beliefmaps. For the proposed modules, we experimentally find itimportant to train with batch normalization [38]. To furthercapture global context, we propose to augment ASPP withimage-level features, similar to [58, 95].

Abstract arXiv:1706.05587v3 [cs.CV] 5 Dec 2017

Tags:

Information

Transcription of Abstract arXiv:1706.05587v3 [cs.CV] 5 Dec 2017

Abstract arXiv:1706.05587v3 [cs.CV] 5 Dec 2017

Tags:

Information

Documents from same domain