RefineNet: Multi-Path Refinement Networks for High ...

RefineNet: Multi-Path Refinement Networksfor High-Resolution Semantic SegmentationGuosheng Lin1 Anton Milan2 Chunhua Shen2,3 Ian Reid2,31 Nanyang Technological University2 University of Adelaide3 Australian Centre for Robotic VisionAbstractRecently, very deep convolutional neural Networks (CNNs) have shown outstanding performance in objectrecognition and have also been the first choice for denseclassification problems such as semantic , repeated subsampling operations like pooling orconvolution striding in deep CNNs lead to a significant de-crease in the initial image resolution. Here, we presentRefineNet, a generic Multi-Path Refinement network thatexplicitly exploits all the information available along thedown-sampling process to enable high-resolution predic-tion using long-range residual connections. In this way,the deeper layers that capture high-level semantic featurescan be directly refined using fine-grained features from ear-lier convolutions.

The individual components of RefineNetemploy residual connections following the identity map-ping mindset, which allows for effective end-to-end train-ing. Further, we introduce chained residual pooling, whichcaptures rich background context in an efficient manner. Wecarry out comprehensive experiments and set new state-of-the-art results on seven public datasets. In particular,we achieve an intersection-over-union score thechallenging PASCAL VOC 2012 dataset, which is the bestreported result to IntroductionSemantic segmentation is a crucial component in imageunderstanding. The task here is to assign a unique label (orcategory) to every single pixel in the image, which can beconsidered as a dense classification problem. The relatedproblem of so-called object parsing can usually be cast assemantic segmentation. Recently, deep learning methods,and in particular convolutional neural Networks (CNNs), , VGG [42], residual Net [24], have shown remark-able results in recognition tasks.

However, these approachesexhibit clear limitations when it comes to dense prediction This work was done when G. Lin was with The University of Adelaideand Australian Centre for Robotic 1. Example results of our method on the task of object pars-ing(left)and semantic segmentation(right).in tasks like dense depth or normal estimation [13,33,34]and semantic segmentation [36,5]. Multiple stages of spa-tial pooling and convolution strides reduce the final imageprediction typically by a factor of 32 in each dimension,thereby losing much of the finer image way to address this limitation is to learn deconvolu-tional filters as an up-sampling operation [38,36] to gener-ate high-resolution feature maps. The deconvolution oper-ations are not able to recover the low-level visual featureswhich are lost after the downsampling operation in the con-volution forward stage. Therefore, they are unable to outputaccurate high-resolution prediction. Low-level visual infor-mation is essential for accurate prediction on the bound-aries or details.

The method DeepLab recently proposed byChenet al. [6] employs atrous (or dilated) convolutions toaccount for larger receptive fields without downscaling theimage. DeepLab is widely applied and represents state-of-the-art performance on semantic segmentation. This strat-egy, although successful, has at least two limitations. First,it needs to perform convolutions on a large number of de-tailed (high-resolution) feature maps that usually have high-dimensional features, which are computational , a large number of high-dimensional and high-resolution feature maps also require huge GPU memory re-sources, especially in the training stage. This hampers thecomputation of high-resolution predictions and usually lim-its the output size to1/8of the original input. Second, di-lated convolutions introduce a coarse sub-sampling of fea-tures, which potentially leads to a loss of important type of methods exploits features from interme-1925diate layers for generating high-resolution prediction, ,the FCN method in [36] and Hypercolumns in [22].

The in-tuition behind these works is that features from middle lay-ers are expected to describe mid-level representations forobject parts, while retaining spatial information. This infor-mation is thought to be complementary to the features fromearly convolution layers which encode low-level spatial vi-sual information like edges, corners, circles,etc., and alsocomplementary to high-level features from deeper layerswhich encode high-level semantic information, includingobject- or category-level evidence, but which lack strongspatial argue that features from all levels are helpful for se-mantic segmentation. High-level semantic features help thecategory recognition of image regions, while low-level vi-sual features help to generate sharp, detailed boundaries forhigh-resolution prediction. How to effectively exploit mid-dle layer features remains an open question and deservesmore attentions. To this end, we propose a novel networkarchitecture which effectively exploits multi-level featuresfor generating high-resolution predictions.

Our maincon-tributionsare as follows:1. We propose a Multi-Path Refinement network (Re-fineNet) which exploits features at multiple levels of ab-straction for high-resolution semantic segmentation. Re-fineNet refines low-resolution (coarse) semantic featureswith fine-grained low-level features in a recursive manner togenerate high-resolution semantic feature maps. Our modelis flexible in that it can be cascaded and modified Our cascaded RefineNets can be effectively trainedend-to-end, which is crucial for best prediction perfor-mance. More specifically, all components in RefineNet em-ploy residual connections [24] with identity mappings [25],such that gradients can be directly propagated throughshort-rangeandlong-range residual connections allowingfor both effective and efficient end-to-end We propose a new network component we callchained residual poolingwhich is able to capture back-ground context from a large image region. It does so byefficiently pooling features with multiple window sizes andfusing them together with residual connections and learn-able The proposed RefineNet achieves new state-of-the-art performance on 7 public datasets, including PASCALVOC 2012, PASCAL-Context, NYUDv2, SUN-RGBD,Cityscapes, ADE20K, and the object parsing Person-Partsdataset.

In particular, we achieve an IoU score the PASCAL VOC 2012 dataset, outperforming the cur-rently best approach DeepLab by a large facilitate future research, we release both source codeand trained models for our : Related WorkCNNs have become the most successful methods for se-mantic segmentation in recent years. The early methodsin [18,23] are region-proposal-based methods which clas-sify region proposals to generate segmentation results. Re-cently, fully convolution Networks (FCNNs) [36,5,10] haveshown effective feature generation and end-to-end train-ing, and have thus become the most popular choice for se-mantic segmentation. FCNNs have also been widely ap-plied in other dense-prediction tasks, , depth estima-tion [15,13,33], image restoration [14], and image super-resolution [12]. The proposed method here is also based onfully convolution-style methods usually have the limitation of low-resolution prediction. There are a number of proposedtechniques which address this limitation and aim to gen-erate high-resolution predictions.

The atrous convolutionapproach DeepLab-CRF in [5] directly output a middle-resolution score map and then applies a dense CRF method[27] to refine boundaries by leveraging color contrast in-formation. CRF-RNN [47] extends this approach by im-plementing recurrent layers for end-to-end learning of thedense CRF and FCNN. Deconvolution methods [38,2]learn deconvolution layers to upsample the low-resolutionpredictions. The depth estimation method in [34] employssuper-pixel pooling to output high-resolution exist several methods which exploit middle layerfeatures for segmentation. The FCN method of Longetal. [36] adds prediction layers to middle layers to gener-ate prediction scores at multiple resolutions. They averagethe multi-resolution scores to generate the final predictionmask. Their system is trained in a stage-wise manner ratherthan end-to-end. The Hypercolumn approach [22] mergesfeatures from middle layers and learns dense classificationlayers. That method also employs a stage-wise trainingstrategy instead of end-to-end training.

Both SegNet [2] andU-Net [40] apply skip-connections in the deconvolution ar-chitecture to exploit the features from middle several related works exist, it still remains anopen question how to effectively exploit middle layer fea-tures. We propose a novel network architecture, RefineNet,to address this question. The network architecture of Re-fineNet is distinct from existing methods. It consists of anumber of specially designed components which are able torefine the coarse high-level semantic features by exploitinglow-level visual features. In particular, RefineNet employsshort-range and long-range residual connections with iden-tity mappings which enable effective end-to-end training ofthe whole system, and thus help to achieve superior perfor-mance. Comprehensive empirical results clearly verify theeffectiveness of our novel network architecture for exploit-ing middle layer BackgroundBefore presenting our approach, we first review thestructure of fully convolutional Networks for semantic seg-mentation [36] in more detail and also discuss the recentdilated convolution technique [6] which is specifically de-signed to generate high-resolution deep CNNs have shown outstanding performanceon object recognition problems.

Specifically, the re-cently proposed residual network (ResNet) [24] has shownstep-change improvements over earlier architectures, andResNet models pre-trained for ImageNet recognition tasksare publicly available. Because of this, in the following weadopt ResNet as our fundamental building block for seman-tic segmentation. Note, however, that replacing it with anyother deep network is semantic segmentation can be cast as a dense clas-sification problem, the ResNet model can be easily modifiedfor this task. This is achieved by replacing the single labelprediction layer with a dense prediction layer that outputsthe classification confidence for each class at every approach is illustrated in (a). As can be seen, dur-ing the forward pass in ResNet, the resolution of the featuremaps (layer outputs) is decreased, while the feature depth, the number of feature maps per layer (orchannels) isincreased. The former is caused by striding during convo-lutional and pooling ResNet layers can be naturally divided into 4 blocksaccording to the resolution of the output feature maps, asshown in (a).

RefineNet: Multi-Path Refinement Networks for High ...

Tags:

Information

Advertisement

Transcription of RefineNet: Multi-Path Refinement Networks for High ...

Related search queries

RefineNet: Multi-Path Refinement Networks for High ...

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries