Jonathan Huang Vivek Rathod Chen Sun Menglong Zhu …

Speed/accuracy trade-offs for modern convolutional object detectorsJonathan HuangVivek RathodChen SunMenglong ZhuAnoop KorattikaraAlireza FathiIan FischerZbigniew WojnaYang SongSergio GuadarramaKevin MurphyGoogle ResearchAbstractThe goal of this paper is to serve as a guide for se-lecting a detection architecture that achieves the rightspeed/memory/accuracy balance for a given applicationand platform. To this end, we investigate various ways totrade accuracy for speed and memory usage in modern con-volutional object detection systems. A number of successfulsystems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base fea-ture extractors ( , VGG, Residual Networks), differentdefault image resolutions, as well as different hardware andsoftware platforms. We present a unified implementation ofthe Faster R-CNN [31], R-FCN [6] and SSD [26] systems,which we view as meta-architectures and trace out thespeed/accuracy trade-off curve created by using alterna-tive feature extractors and varying other critical parameterssuch as image size within each of these one extreme end of this spectrum where speed and mem-ory are critical, we present a detector that achieves realtime speeds and can be deployed on a mobile device.

Onthe opposite end in which accuracy is critical, we presenta detector that achieves state-of-the-art performance mea-sured on the COCO detection IntroductionA lot of progress has been made in recent years on objectdetection due to the use of convolutional neural networks(CNNs). Modern object detectors based on these networks such as Faster R-CNN [31], R-FCN [6], Multibox [40],SSD [26] and YOLO [29] are now good enough to bedeployed in consumer products ( , Google Photos, Pin-terest Visual Search) and some have been shown to be fastenough to be run on mobile , it can be difficult for practitioners to decidewhat architecture is best suited to their application. Stan-dard accuracy metrics, such as mean average precision(mAP), do not tell the entire story, since for real deploy-ments of computer vision systems, running time and mem-ory usage are also critical. For example, mobile devicesoften require a small memory footprint, and self drivingcars require real time performance.

Server-side productionsystems, like those used in Google, Facebook or Snapchat,have more leeway to optimize for accuracy, but are still sub-ject to throughput constraints. While the methods that wincompetitions, such as the COCO challenge [25], are opti-mized for accuracy, they often rely on model ensemblingand multicrop methods which are too slow for practical , only a small subset of papers ( , R-FCN [6], SSD [26] YOLO [29]) discuss running time inany detail. Furthermore, these papers typically only statethat they achieve some frame-rate, but do not give a fullpicture of the speed/accuracy trade-off, which depends onmany other factors, such as which feature extractor is used,input image sizes, this paper, we seek to explore the speed/accuracytrade-off of modern detection systems in an exhaustive andfair way. While this has been studied for full image clas-sification( ( , [3]), detection models tend to be signif-icantly more complex.)

We primarily investigate single-model/single-pass detectors, by which we mean modelsthat do not use ensembling, multi-crop methods, or other tricks such as horizontal flipping. In other words, we onlypass a single image through a single network. For simplicity(and because it is more important for users of this technol-ogy), we focus only on test-time performance and not onhow long these models take to it is impractical to compare every recently pro-posed detection system, we are fortunate that many of theleading state of the art approaches have converged on acommon methodology (at least at a high level). This hasallowed us to implement and compare a large number of de-tection systems in a unified manner. In particular, we havecreated implementations of the Faster R-CNN, R-FCN andSSD meta-architectures, which at a high level consist of asingle convolutional network, trained with a mixed regres-sion and classification objective, and use sliding windowstyle summarize, our main contributions are as follows: We provide a concise survey of modern convolutional1 [ ] 25 Apr 2017detection systems, and describe how the leading onesfollow very similar designs.

We describe our flexible and unified implementationof three meta-architectures (Faster R-CNN, R-FCNand SSD) in Tensorflow which we use to do exten-sive experiments that trace the accuracy/speed trade-off curve for different detection systems, varying meta-architecture, feature extractor, image resolution, etc. Our findings show that using fewer proposals forFaster R-CNN can speed it up significantly withouta big loss in accuracy, making it competitive with itsfaster cousins, SSD and RFCN. We show that SSDsperformance is less sensitive to the quality of the fea-ture extractor than Faster R-CNN and R-FCN. And weidentify sweet spots on the accuracy/speed trade-offcurve where gains in accuracy are only possible by sac-rificing speed (within the family of detectors presentedhere). Several of the meta-architecture and feature-extractorcombinations that we report have never appeared be-fore in literature. We discuss how we used some ofthese novel combinations to train the winning entry ofthe 2016 COCO object detection Meta-architecturesNeural nets have become the leading method for highquality object detection in recent years.

In this section wesurvey some of the highlights of this literature. The R-CNNpaper by Girshick et al. [11] was among the first modernincarnations of convolutional network based detection. In-spired by recent successes on image classification [20], theR-CNN method took the straightforward approach of crop-ping externally computed box proposals out of an input im-age and running a neural net classifier on these crops. Thisapproach can be expensive however because many cropsare necessary, leading to significant duplicated computationfrom overlapping crops. Fast R-CNN [10] alleviated thisproblem by pushing the entire image once through a featureextractor then cropping from an intermediate layer so thatcrops share the computation load of feature both R-CNN and Fast R-CNN relied on an exter-nal proposal generator, recent works have shown that it ispossible to generate box proposals using neural networksas well [41, 40, 8, 31]. In these works, it is typical to have acollection of boxes overlaid on the image at different spatiallocations, scales and aspect ratios that act as anchors (sometimes called priors or default boxes ).

A modelis then trained to make two predictions for each anchor:(1) a discrete class prediction for each anchor, and (2) acontinuous prediction of an offset by which the anchorneeds to be shifted to fit the groundtruth bounding that follow this anchors methodology thenminimize a combined classification and regression loss thatwe now describe. For each anchora, we first find the bestmatching groundtruth boxb(if one exists). If such a matchcan be found, we callaa positive anchor , and assign it(1) a class labelya { }and (2) a vector encodingof boxbwith respect to anchora(called the box encoding (ba;a)). If no match is found, we callaa negativeanchor and we set the class label to beya= 0. If forthe anchorawe predict box encodingfloc(I;a, )andcorresponding classfcls(I;a, ), whereIis the image and the model parameters, then the loss forais measured asa weighted sum of a location-based loss and a classificationloss:L(a,I; ) = 1[ais positive] `loc( (ba;a) floc(I;a, ))+ `cls(ya,fcls(I;a, )),(1)where , are weights balancing localization and classi-fication losses.

To train the model, Equation 1 is averagedover anchors and minimized with respect to parameters .The choice of anchors has significant implications bothfor accuracy and the (first) Multiboxpaper [8], these anchors (called box priors by the au-thors) were generated by clustering groundtruth boxes inthe dataset. In more recent works, anchors are generatedby tiling a collection of boxes at different scales and aspectratios regularly across the image. The advantage of hav-ing a regular grid of anchors is that predictions for theseboxes can be written as tiled predictors on the image withshared parameters ( , convolutions) and are reminiscentof traditional sliding window methods, [44]. The FasterR-CNN [31] paper and the (second) Multibox paper [40](which called these tiled anchors convolutional priors )were the first papers to take this new Meta-architecturesIn our paper we focus primarily on three recent (meta)-architectures: SSD (Single Shot Multibox Detector [26]),Faster R-CNN [31] and R-FCN (Region-based Fully Con-volutional Networks [6]).

While these papers were orig-inally presented with a particular feature extractor ( ,VGG, Resnet, etc), we now review these three methods, de-coupling the choice of meta-architecture from feature ex-tractor so that conceptually, any feature extractor can beused with SSD, Faster R-CNN or Single Shot Detector (SSD).Though theSSDpaper was published only recently (Liu etal., [26]), we use the term SSD to refer broadly to archi-tectures that use a single feed-forward convolutional net-work to directly predict classes and anchor offsets withoutrequiring a second stage per-proposal classification oper-ation (Figure 1a). Under this definition, the SSD meta-architecture has been explored in a number of precursorsto [26]. Both Multibox and the Region Proposal Network2 PaperMeta-architectureFeature ExtractorMatchingBox Encoding (ba,a)Location Loss functionsSzegedy et al. [40]SSDI nceptionV3 Bipartite[x0,y0,x1,y1]L2 Redmon et al.

[29]SSDC ustom (GoogLeNet inspired)Box Center[xc,yc, w, h]L2 Ren et al. [31]Faster R-CNNVGGA rgmax[xcwa,ycha,logw,logh]SmoothL1He et al. [13]Faster R-CNNResNet-101 Argmax[xcwa,ycha,logw,logh]SmoothL1 Liu et al. [26] (v1)SSDI nceptionV3 Argmax[x0,y0,x1,y1]L2 Liu et al. [26] (v2, v3)SSDVGGA rgmax[xcwa,ycha,logw,logh]SmoothL1 Dai et al [6]R-FCNResNet-101 Argmax[xcwa,ycha,logw,logh]SmoothL1 Table 1:Convolutional detection models that use one of the meta-architectures described in Section 2. Boxes are encoded with respect to a matchinganchoravia a function (Equation 1), where[x0,y0,x1,y1]are min/max coordinates of a box,xc,ycare its center coordinates, andw,hits width andheight. In some cases,wa,ha, width and height of the matching anchor are also : (1) We include an early arXiv version of [26], which used adifferent configuration from that published at ECCV 2016; (2) [29] uses a fast feature extractor described as being inspired by GoogLeNet [39], which wedo not compare to; (3) YOLO matches a groundtruth box to an anchor if its center falls inside the anchor (we refer to this asBoxCenter).

Jonathan Huang Vivek Rathod Chen Sun Menglong Zhu …

Tags:

Information

Transcription of Jonathan Huang Vivek Rathod Chen Sun Menglong Zhu …

Related search queries

Jonathan Huang Vivek Rathod Chen Sun Menglong Zhu …

Tags:

Information

Documents from same domain

Related documents

Related search queries