Tech report (v5) - arXiv

Rich feature hierarchies for accurate object detection and semantic segmentationTech report (v5)Ross Girshick Jeff Donahue Trevor Darrell Jitendra MalikUC detection performance, as measured on thecanonical PASCAL VOC dataset, has plateaued in the lastfew years. The best-performing methods are complex en-semble systems that typically combine multiple low-levelimage features with high-level context. In this paper, wepropose a simple and scalable detection algorithm that im-proves mean average precision (mAP) by more than 30%relative to the previous best result on VOC 2012 achievinga mAP of Our approach combines two key insights:(1) one can apply high-capacity convolutional neural net-works (CNNs) to bottom-up region proposals in order tolocalize and segment objects and (2) when labeled trainingdata is scarce, supervised pre-training for an auxiliary task,followed by domain-specific fine-tuning, yields a significantperformance boost.

Since we combine region proposalswith CNNs, we call our methodR-CNN:Regions with CNNfeatures. We also compare R-CNN to OverFeat, a recentlyproposed sliding-window detector based on a similar CNNarchitecture. We find that R-CNN outperforms OverFeatby a large margin on the 200-class ILSVRC2013 detectiondataset. Source code for the complete system is available IntroductionFeatures matter. The last decade of progress on variousvisual recognition tasks has been based considerably on theuse of SIFT [29] and HOG [7]. But if we look at perfor-mance on the canonical visual recognition task, PASCALVOC object detection [15], it is generally acknowledgedthat progress has been slow during 2010-2012, with smallgains obtained by building ensemble systems and employ-ing minor variants of successful and HOG are blockwise orientation histograms,a representation we could associate roughly with complexcells in V1, the first cortical area in the primate visual path-way.

But we also know that recognition occurs severalstages downstream, which suggests that there might be hier-1. Input image2. Extract region proposals (~2k)3. Compute CNN featuresaeroplane? Classify regionswarped : Regions with CNN featuresFigure 1: object detection system system (1)takes an input image, (2) extracts around 2000 bottom-up regionproposals, (3) computes features for each proposal using a largeconvolutional neural network (CNN), and then (4) classifies eachregion using class-specific linear SVMs. R-CNN achieves a meanaverage precision (mAP) on PASCAL VOC 2010. Forcomparison, [39] reports mAP using the same region pro-posals, but with a spatial pyramid and bag-of-visual-words ap-proach.

The popular deformable part models perform at the 200-classILSVRC2013 detection dataset, R-CNN smAP is , a large improvement over OverFeat [34], whichhad the previous best result at , multi-stage processes for computing features thatare even more informative for visual s neocognitron [19],a biologically-inspired hierarchical and shift-invariant model for patternrecognition, was an early attempt at just such a neocognitron, however, lacked a supervised trainingalgorithm. Building on Rumelhart et al. [33], LeCun etal. [26] showed that stochastic gradient descent via back-propagation was effective for training convolutional neuralnetworks (CNNs), a class of models that extend the saw heavy use in the 1990s ( , [27]), but thenfell out of fashion with the rise of support vector 2012, Krizhevsky et al.

[25] rekindled interest in CNNsby showing substantially higher image classification accu-racy on the ImageNet Large Scale Visual Recognition Chal-lenge (ILSVRC) [9, 10]. Their success resulted from train-ing a large CNN on million labeled images, togetherwith a few twists on LeCun s CNN ( ,max(x,0)rectify-ing non-linearities and dropout regularization).The significance of the ImageNet result was vigorously1 [ ] 22 Oct 2014debated during the ILSVRC 2012 workshop. The centralissue can be distilled to the following: To what extent dothe CNN classification results on ImageNet generalize toobject detection results on the PASCAL VOC Challenge?We answer this question by bridging the gap betweenimage classification and object detection.

This paper is thefirst to show that a CNN can lead to dramatically higher ob- ject detection performance on PASCAL VOC as comparedto systems based on simpler HOG-like features. To achievethis result, we focused on two problems: localizing objectswith a deep network and training a high-capacity modelwith only a small quantity of annotated detection image classification, detection requires localiz-ing (likely many) objects within an image. One approachframes localization as a regression problem. However, workfrom Szegedy et al. [38], concurrent with our own, indi-cates that this strategy may not fare well in practice (theyreport a mAP of on VOC 2007 compared to achieved by our method).

An alternative is to build asliding-window detector. CNNs have been used in this wayfor at least two decades, typically on constrained object cat-egories, such as faces [32, 40] and pedestrians [35]. In orderto maintain high spatial resolution, these CNNs typicallyonly have two convolutional and pooling layers. We alsoconsidered adopting a sliding-window approach. However,units high up in our network, which has five convolutionallayers, have very large receptive fields (195 195pixels)and strides (32 32pixels) in the input image, which makesprecise localization within the sliding-window paradigm anopen technical , we solve the CNN localization problem by oper-ating within the recognition using regions paradigm [21],which has been successful for both object detection [39] andsemantic segmentation [5].

At test time, our method gener-ates around 2000 category-independent region proposals forthe input image, extracts a fixed-length feature vector fromeach proposal using a CNN, and then classifies each regionwith category-specific linear SVMs. We use a simple tech-nique (affine image warping) to compute a fixed-size CNNinput from each region proposal, regardless of the region sshape. Figure 1 presents an overview of our method andhighlights some of our results. Since our system combinesregion proposals with CNNs, we dub the method R-CNN:Regions with CNN this updated version of this paper, we provide a head-to-head comparison of R-CNN and the recently proposedOverFeat [34] detection system by running R-CNN on the200-class ILSVRC2013 detection dataset.

OverFeat uses asliding-window CNN for detection and until now was thebest performing method on ILSVRC2013 detection. Weshow that R-CNN significantly outperforms OverFeat, witha mAP of versus second challenge faced in detection is that labeled datais scarce and the amount currently available is insufficientfor training a large CNN. The conventional solution to thisproblem is to useunsupervisedpre-training, followed by su-pervised fine-tuning ( , [35]). The second principle con-tribution of this paper is to show thatsupervisedpre-trainingon a large auxiliary dataset (ILSVRC), followed by domain-specific fine-tuning on a small dataset (PASCAL), is aneffective paradigm for learning high-capacity CNNs whendata is scarce.

In our experiments, fine-tuning for detectionimproves mAP performance by 8 percentage points. Afterfine-tuning, our system achieves a mAP of 54% on VOC2010 compared to 33% for the highly-tuned, HOG-baseddeformable part model (DPM) [17, 20]. We also point read-ers to contemporaneous work by Donahue et al. [12], whoshow that Krizhevsky s CNN can be used (without fine-tuning) as a blackbox feature extractor, yielding excellentperformance on several recognition tasks including sceneclassification, fine-grained sub-categorization, and system is also quite efficient. The only class-specificcomputations are a reasonably small matrix-vector productand greedy non-maximum suppression.

Tech report (v5) - arXiv

Tags:

Information

Transcription of Tech report (v5) - arXiv

Related search queries

Tech report (v5) - arXiv

Tags:

Information

Documents from same domain

Related documents

Related search queries