Example: bankruptcy

YOLOv3: An Incremental Improvement

YOLOv3: An Incremental ImprovementJoseph Redmon, Ali FarhadiUniversity of WashingtonAbstractWe present some updates to YOLO! We made a bunchof little design changes to make it better. We also trainedthis new network that s pretty swell. It s a little bigger thanlast time but more accurate. It s still fast though, don tworry. At320 320 YOLOv3 runs in 22 ms at mAP,as accurate as SSD but three times faster. When we lookat the old .5 IOU mAP detection metric YOLOv3 is quitegood. It 51 ms on a Titan X, com-pared 198 ms by RetinaNet, similar perfor-mance but faster. As always, all the code is online IntroductionSometimes you just kinda phone it in for a year, youknow? I didn t do a whole lot of research this year. Spenta lot of time on Twitter. Played around with GANs a had a little momentum left over from last year [12] [1]; Imanaged to make some improvements to YOLO. But, hon-estly, nothing like super interesting, just a bunch of smallchanges that make it better.

YOLOv3: An Incremental Improvement Joseph Redmon, Ali Farhadi University of Washington Abstract We present some updates to YOLO! We made a …

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of YOLOv3: An Incremental Improvement

1 YOLOv3: An Incremental ImprovementJoseph Redmon, Ali FarhadiUniversity of WashingtonAbstractWe present some updates to YOLO! We made a bunchof little design changes to make it better. We also trainedthis new network that s pretty swell. It s a little bigger thanlast time but more accurate. It s still fast though, don tworry. At320 320 YOLOv3 runs in 22 ms at mAP,as accurate as SSD but three times faster. When we lookat the old .5 IOU mAP detection metric YOLOv3 is quitegood. It 51 ms on a Titan X, com-pared 198 ms by RetinaNet, similar perfor-mance but faster. As always, all the code is online IntroductionSometimes you just kinda phone it in for a year, youknow? I didn t do a whole lot of research this year. Spenta lot of time on Twitter. Played around with GANs a had a little momentum left over from last year [12] [1]; Imanaged to make some improvements to YOLO. But, hon-estly, nothing like super interesting, just a bunch of smallchanges that make it better.

2 I also helped out with otherpeople s research a , that s what brings us here today. We havea camera-ready deadline [4] and we need to cite some ofthe random updates I made to YOLO but we don t have asource. So get ready for a TECH REPORT!The great thing about tech reports is that they don t needintros, y all know why we re here. So the end of this intro-duction will signpost for the rest of the paper. First we ll tellyou what the deal is with YOLOv3. Then we ll tell you howwe do. We ll also tell you about some things we tried thatdidn t work. Finally we ll contemplate what this all The DealSo here s the deal with YOLOv3: We mostly took goodideas from other people. We also trained a new classifiernetwork that s better than the other ones. We ll just takeyou through the whole system from scratch so you can un-derstand it time (ms)283032343638 COCO APBCDEFGR etinaNet-50 RetinaNet-101 YOLOv3 Method[B] SSD321[C] DSSD321[D] R-FCN[E] SSD513[F] DSSD513[G] FPN 1.

3 We adapt this figure from the Focal Loss paper [9].YOLOv3 runs significantly faster than other detection methodswith comparable performance. Times from either an M40 or TitanX, they are basically the same Bounding Box PredictionFollowing YOLO9000 our system predicts boundingboxes using dimension clusters as anchor boxes [15]. Thenetwork predicts 4 coordinates for each bounding box,tx,ty,tw,th. If the cell is offset from the top left corner of theimage by(cx,cy)and the bounding box prior has width andheightpw,ph, then the predictions correspond to:bx= (tx) +cxby= (ty) +cybw=pwetwbh=phethDuring training we use sum of squared error loss. If theground truth for some coordinate prediction is t*our gra-dient is the ground truth value (computed from the groundtruth box) minus our prediction: t* t*. This ground truthvalue can be easily computed by inverting the predicts an objectness score for each boundingbox using logistic regression.

4 This should be 1 if the bound-ing box prior overlaps a ground truth object by more thanany other bounding box prior. If the bounding box prior1 (tx) (ty)pwphbhbwbw=pwebh=phecxcybx= (tx)+cxby= (ty)+cytwthFigure boxes with dimension priors and predict the width and height of the box as offsetsfrom cluster centroids. We predict the center coordinates of thebox relative to the location of filter application using a sigmoidfunction. This figure blatantly self-plagiarized from [15].is not the best but does overlap a ground truth object bymore than some threshold we ignore the prediction, follow-ing [17]. We use the threshold Unlike [17] our systemonly assigns one bounding box prior for each ground truthobject. If a bounding box prior is not assigned to a groundtruth object it incurs no loss for coordinate or class predic-tions, only Class PredictionEach box predicts the classes the bounding box may con-tain using multilabel classification.

5 We do not use a softmaxas we have found it is unnecessary for good performance,instead we simply use independent logistic classifiers. Dur-ing training we use binary cross-entropy loss for the formulation helps when we move to more complexdomains like the Open Images Dataset [7]. In this datasetthere are many overlapping labels ( Woman and Person).Using a softmax imposes the assumption that each box hasexactly one class which is often not the case. A multilabelapproach better models the Predictions Across ScalesYOLOv3 predicts boxes at 3 different scales. Our sys-tem extracts features from those scales using a similar con-cept to feature pyramid networks [8]. From our base fea-ture extractor we add several convolutional layers. The lastof these predicts a 3-d tensor encoding bounding box, ob-jectness, and class predictions. In our experiments withCOCO [10] we predict 3 boxes at each scale so the tensor isN N [3 (4 + 1 + 80)]for the 4 bounding box offsets,1 objectness prediction, and 80 class we take the feature map from 2 layers previous andupsample it by2.

6 We also take a feature map from earlierin the network and merge it with our upsampled featuresusing concatenation. This method allows us to get moremeaningful semantic information from the upsampled fea-tures and finer-grained information from the earlier featuremap. We then add a few more convolutional layers to pro-cess this combined feature map, and eventually predict asimilar tensor, although now twice the perform the same design one more time to predictboxes for the final scale. Thus our predictions for the 3rdscale benefit from all the prior computation as well as fine-grained features from early on in the still use k-means clustering to determine our bound-ing box priors. We just sort of chose 9 clusters and 3scales arbitrarily and then divide up the clusters evenlyacross scales. On the COCO dataset the 9 clusters were:(10 13),(16 30),(33 23),(30 61),(62 45),(59 119),(116 90),(156 198),(373 326).

7 Feature ExtractorWe use a new network for performing feature new network is a hybrid approach between the networkused in YOLOv2, Darknet-19, and that newfangled residualnetwork stuff. Our network uses successive3 3and1 1convolutional layers but now has some shortcut connectionsas well and is significantly larger. It has 53 convolutionallayers so we call wait for Darknet-53!TypeConvolutionalConvolutiona lConvolutionalConvolutionalResidualConvo lutionalConvolutionalConvolutionalResidu alConvolutionalConvolutionalConvolutiona lResidualConvolutionalConvolutionalConvo lutionalResidualConvolutionalConvolution alConvolutionalResidualAvgpoolConnectedS oftmaxFilters326432641286412825612825651 225651210245121024 Size3 33 3 / 21 13 33 3 / 21 13 33 3 / 21 13 33 3 / 21 13 33 3 / 21 13 3 Global1000 Output256 256128 128128 12864 6464 6432 3232 3216 1616 168 88 81 2 8 8 4 Table new network is much more powerful than Darknet-19 but still more efficient than ResNet-101 or are some ImageNet results:BackboneTop-1 Top-5Bn OpsBFLOP/sFPSD arknet-19 [15] [5] [5] of , billions of oper-ations, billion floating point operations per second, and FPS forvarious network is trained with identical settings and testedat256 256, single crop accuracy.

8 Run times are measuredon a Titan X at256 256. Thus Darknet-53 performs onpar with state-of-the-art classifiers but with fewer floatingpoint operations and more speed. Darknet-53 is better thanResNet-101 faster. Darknet-53 has similar perfor-mance to ResNet-152 and is2 also achieves the highest measured floatingpoint operations per second. This means the network struc-ture better utilizes the GPU, making it more efficient to eval-uate and thus faster. That s mostly because ResNets havejust way too many layers and aren t very TrainingWe still train on full images with no hard negative miningor any of that stuff. We use multi-scale training, lots of dataaugmentation, batch normalization, all the standard use the Darknet neural network framework for trainingand testing [14].3. How We DoYOLOv3 is pretty good! See table 3. In terms of COCO sweird average mean AP metric it is on par with the SSDvariants but is3 faster.

9 It is still quite a bit behind otherbackboneAPAP50AP75 APSAPMAPLTwo-stage methodsFaster R-CNN+++ [5] R-CNN w FPN [8] R-CNN by G-RMI [6]Inception-ResNet-v2 [21] R-CNN w TDM [20] methodsYOLOv2 [15]DarkNet-19 [15] [11, 3] [3] [9] [9] 3. I m seriously just stealing all these tables from [9] they take soooo long to make from scratch. Ok, YOLOv3 is doing in mind that RetinaNet has longer to process an image. YOLOv3 is much better than SSD variants and comparable tostate-of-the-art models on the like RetinaNet in this metric , when we look at the old detection metric ofmAP at IOU=.5(or AP50in the chart) YOLOv3 is verystrong. It is almost on par with RetinaNet and far abovethe SSD variants. This indicates that YOLOv3 is a verystrong detector that excels at producing decent boxes for ob-jects. However, performance drops significantly as the IOUthreshold increases indicating YOLOv3 struggles to get theboxes perfectly aligned with the the past YOLO struggled with small objects.

10 How-ever, now we see a reversal in that trend. With the newmulti-scale predictions we see YOLOv3 has relatively highAPSperformance. However, it has comparatively worseperformance on medium and larger size objects. More in-vestigation is needed to get to the bottom of we plot accuracy vs speed on the AP50metric (seefigure 5) we see YOLOv3 has significant benefits over otherdetection systems. Namely, it s faster and Things We Tried That Didn t WorkWe tried lots of stuff while we were working onYOLOv3. A lot of it didn t work. Here s the stuff we boxx,yoffset tried using thenormal anchor box prediction mechanism where you pre-dict thex,yoffset as a multiple of the box width or heightusing a linear activation. We found this formulation de-creased model stability and didn t work very ,ypredictions instead of triedusing a linear activation to directly predict thex,yoffsetinstead of the logistic activation.


Related search queries