Abstract arXiv:1411.4038v2 [cs.CV] 8 Mar 2015

Fully Convolutional Networks for Semantic SegmentationJonathan Long Evan Shelhamer Trevor DarrellUC networks are powerful visual models thatyield hierarchies of show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build fully convolutional networks that take input of arbitrary size and producecorrespondingly-sized output with efficient inference andlearning. We define and detail the space of fully convolu-tional networks, explain their application to spatially denseprediction tasks, and draw connections to prior models. Weadapt contemporary classification networks (AlexNet [19],the VGG net [31], and GoogLeNet [32]) into fully convolu-tional networks and transfer their learned representationsby fine-tuning [4] to the segmentation task.

We then de-fine a novel architecture that combines semantic informa-tion from a deep, coarse layer with appearance informationfrom a shallow, fine layer to produce accurate and detailedsegmentations. Our fully convolutional network achievesstate-of-the-art segmentation of PASCAL VOC (20% rela-tive improvement to mean IU on 2012), NYUDv2,and SIFT Flow, while inference takes less than one fifth of asecond for a typical IntroductionConvolutional networks are driving advances in recog-nition. Convnets are not only improving for whole-imageclassification [19, 31, 32], but also making progress on lo-cal tasks with structured output. These include advances inbounding box object detection [29, 12, 17], part and key-point prediction [39, 24], and local correspondence [24, 9].The natural next step in the progression from coarse tofine inference is to make a prediction at every pixel.

Priorapproaches have used convnets for semantic segmentation[27, 2, 8, 28, 16, 14, 11], in which each pixel is labeled withthe class of its enclosing object or region, but with short-comings that this work addresses. Authors contributed equally96384256409640962121backward/lear ningforward/inferencepixelwise predictionsegmentation convolutional networks can efficiently learn tomake dense predictions for per-pixel tasks like semantic show that a fully convolutional network (FCN),trained end-to-end, pixels-to-pixels on semantic segmen-tation exceeds the state-of-the-art without further machin-ery. To our knowledge, this is the first work to train FCNsend-to-end (1) for pixelwise prediction and (2) from super-vised pre-training. Fully convolutional versions of existingnetworks predict dense outputs from arbitrary-sized learning and inference are performed whole-image-at-a-time by dense feedforward computation and backpropa-gation.

In-network upsampling layers enable pixelwise pre-diction and learning in nets with subsampled method is efficient, both asymptotically and abso-lutely, and precludes the need for the complications in otherworks. Patchwise training is common [27, 2, 8, 28, 11], butlacks the efficiency of fully convolutional training. Our ap-proach does not make use of pre- and post-processing com-plications, including superpixels [8, 16], proposals [16, 14],or post-hoc refinement by random fields or local classifiers[8, 16]. Our model transfers recent success in classifica-tion [19, 31, 32] to dense prediction by reinterpreting clas-sification nets as fully convolutional and fine-tuning fromtheir learned representations. In contrast, previous workshave applied small convnets without supervised pre-training[8, 28, 27].Semantic segmentation faces an inherent tension be-tween semantics and location: global information resolveswhat while local information resolves where.

Deep feature1 [ ] 8 Mar 2015hierarchies jointly encode location and semantics in a local-to-global pyramid. We define a novel skip architectureto combine deep, coarse, semantic information and shallow,fine, appearance information in Section (see Figure 3).In the next section, we review related work on deep clas-sification nets, FCNs, and recent approaches to semanticsegmentation using convnets. The following sections ex-plain FCN design and dense prediction tradeoffs, introduceour architecture with in-network upsampling and multi-layer combinations, and describe our experimental frame-work. Finally, we demonstrate state-of-the-art results onPASCAL VOC 2011-2, NYUDv2, and SIFT Related workOur approach draws on recent successes of deep netsfor image classification [19, 31, 32] and transfer learning[4, 38]. Transfer was first demonstrated on various visualrecognition tasks [4, 38], then on detection, and on bothinstance and semantic segmentation in hybrid proposal-classifier models [12, 16, 14].

We now re-architect and fine-tune classification nets to direct, dense prediction of seman-tic segmentation. We chart the space of FCNs and situateprior models, both historical and recent, in this convolutional networksTo our knowledge, theidea of extending a convnet to arbitrary-sized inputs firstappeared in Matanet al. [25], which extended the classicLeNet [21] to recognize strings of digits. Because their netwas limited to one-dimensional input strings, Matanet Viterbi decoding to obtain their outputs. Wolf and Platt[37] expand convnet outputs to 2-dimensional maps of de-tection scores for the four corners of postal address of these historical works do inference and learningfully convolutionally for detection. Ninget al. [27] definea convnet for coarse multiclass segmentation ofC. eleganstissues with fully convolutional convolutional computation has also been exploitedin the present era of many-layered nets.

Sliding windowdetection by Sermanetet al. [29], semantic segmentationby Pinheiro and Collobert [28], and image restoration byEigenet al. [5] do fully convolutional inference. Fully con-volutional training is rare, but used effectively by Tompsonet al. [35] to learn an end-to-end part detector and spatialmodel for pose estimation, although they do not exposit onor analyze this ,Heet al. [17] discard the non-convolutional portion of classification nets to make afeature combine proposals and spatialpyramid pooling to yield a localized, fixed-length featurefor fast and effective, this hybridmodel cannot be learned prediction with convnetsSeveral recent workshave applied convnets to dense prediction problems, includ-ing semantic segmentation by Ninget al. [27], Farabetet al.[8], and Pinheiro and Collobert [28]; boundary predictionfor electron microscopy by Ciresanet al.

[2] and for naturalimages by a hybrid neural net/nearest neighbor model byGanin and Lempitsky [11]; and image restoration and depthestimation by Eigenet al. [5, 6]. Common elements of theseapproaches include small models restricting capacity and receptive fields; patchwise training [27, 2, 8, 28, 11]; post-processing by superpixel projection, random fieldregularization, filtering, or local classification [8, 2,11]; input shifting and output interlacing for dense output[28, 11] as introduced by OverFeat [29]; multi-scale pyramid processing [8, 28, 11]; saturatingtanhnonlinearities [8, 5, 28]; and ensembles [2, 11],whereas our method does without this machinery. However,we do study patchwise training and shift-and-stitch dense output from the perspective of FCNs. We alsodiscuss in-network upsampling , of which the fully con-nected prediction by Eigenet al. [6] is a special these existing methods, we adapt and extend deepclassification architectures, using image classification as su-pervised pre-training, and fine-tune fully convolutionally tolearn simply and efficiently from whole image inputs andwhole image ground al.

[16] and Guptaet al. [14] likewise adaptdeep classification nets to semantic segmentation, but doso in hybrid proposal-classifier models. These approachesfine-tune an R-CNN system [12] by sampling boundingboxes and/or region proposals for detection, semantic seg-mentation, and instance segmentation. Neither method islearned achieve state-of-the-art results on PASCAL VOCsegmentation and NYUDv2 segmentation respectively, sowe directly compare our standalone, end-to-end FCN totheir semantic segmentation results in Section Fully convolutional networksEach layer of data in a convnet is a three-dimensionalarray of sizeh w d, wherehandware spatial dimen-sions, anddis the feature or channel dimension. The firstlayer is the image, with pixel sizeh w, anddcolor chan-nels. Locations in higher layers correspond to the locationsin the image they are path-connected to, which are calledtheirreceptive are built on translation invariance.

Their ba-sic components (convolution, pooling, and activation func-tions) operate on local input regions, and depend only onrelativespatial coordinates. Writingxijfor the data vectorat location(i,j)in a particular layer, andyijfor the follow-ing layer, these functions compute outputsyijbyyij=fks({xsi+ i,sj+ j}0 i, j k)wherekis called the kernel size,sis the stride or subsam-pling factor, andfksdetermines the layer type: a matrixmultiplication for convolution or average pooling, a spatialmax for max pooling, or an elementwise nonlinearity for anactivation function, and so on for other types of functional form is maintained under composition,with kernel size and stride obeying the transformation rulefks gk s = (f g)k +(k 1)s ,ss .While a general deep net computes a general nonlinearfunction, a net with only layers of this form computes anonlinearfilter, which we call adeep filterorfully convolu-tional network.

Abstract arXiv:1411.4038v2 [cs.CV] 8 Mar 2015

Tags:

Information

Transcription of Abstract arXiv:1411.4038v2 [cs.CV] 8 Mar 2015

Related search queries

Abstract arXiv:1411.4038v2 [cs.CV] 8 Mar 2015

Tags:

Information

Documents from same domain

Related documents

Related search queries