Abstract arXiv:1506.02640v5 [cs.CV] 9 May 2016

You Only Look Once:Unified, Real-Time Object DetectionJoseph Redmon , Santosh Divvala , Ross Girshick , Ali Farhadi University of Washington , Allen Institute for AI , Facebook AI Research present YOLO, a new approach to object work on object detection repurposes classifiers to per-form detection. Instead, we frame object detection as a re-gression problem to spatially separated bounding boxes andassociated class probabilities. A single neural network pre-dicts bounding boxes and class probabilities directly fromfull images in one evaluation. Since the whole detectionpipeline is a single network, it can be optimized end-to-enddirectly on detection unified architecture is extremely fast. Our baseYOLO model processes images in real-time at 45 framesper second. A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second whilestill achieving double the mAP of other real-time detec-tors.

Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predictfalse positives on background. Finally, YOLO learns verygeneral representations of objects. It outperforms other de-tection methods, including DPM and R-CNN, when gener-alizing from natural images to other domains like IntroductionHumans glance at an image and instantly know what ob-jects are in the image, where they are, and how they inter-act. The human visual system is fast and accurate, allow-ing us to perform complex tasks like driving with little con-scious thought. Fast, accurate algorithms for object detec-tion would allow computers to drive cars without special-ized sensors, enable assistive devices to convey real-timescene information to human users, and unlock the potentialfor general purpose, responsive robotic detection systems repurpose classifiers to per-form detection.

To detect an object, these systems take aclassifier for that object and evaluate it at various locationsand scales in a test image. Systems like deformable partsmodels (DPM) use a sliding window approach where theclassifier is run at evenly spaced locations over the entireimage [10].More recent approaches like R-CNN use region proposal1. Resize Run convolutional Non-max : : : 1:The YOLO Detection imageswith YOLO is simple and straightforward. Our system (1) resizesthe input image to448 448, (2) runs a single convolutional net-work on the image, and (3) thresholds the resulting detections bythe model s to first generate potential bounding boxes in an im-age and then run a classifier on these proposed boxes. Afterclassification, post-processing is used to refine the bound-ing boxes, eliminate duplicate detections, and rescore theboxes based on other objects in the scene [13].

These com-plex pipelines are slow and hard to optimize because eachindividual component must be trained reframe object detection as a single regression prob-lem, straight from image pixels to bounding box coordi-nates and class probabilities. Using our system, you onlylook once (YOLO) at an image to predict what objects arepresent and where they is refreshingly simple: see Figure 1. A sin-gle convolutional network simultaneously predicts multi-ple bounding boxes and class probabilities for those trains on full images and directly optimizes detec-tion performance. This unified model has several benefitsover traditional methods of object , YOLO is extremely fast. Since we frame detectionas a regression problem we don t need a complex simply run our neural network on a new image at testtime to predict detections.

Our base network runs at 45frames per second with no batch processing on a Titan XGPU and a fast version runs at more than 150 fps. Thismeans we can process streaming video in real-time withless than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision ofother real-time systems. For a demo of our system runningin real-time on a webcam please see our project webpage: , YOLO reasons globally about the image when1 [ ] 9 May 2016making sliding window and regionproposal-based techniques, YOLO sees the entire imageduring training and test time so it implicitly encodes contex-tual information about classes as well as their R-CNN, a top detection method [14], mistakes back-ground patches in an image for objects because it can t seethe larger context.

YOLO makes less than half the numberof background errors compared to Fast , YOLO learns generalizable representations of ob-jects. When trained on natural images and tested on art-work, YOLO outperforms top detection methods like DPMand R-CNN by a wide margin. Since YOLO is highly gen-eralizable it is less likely to break down when applied tonew domains or unexpected still lags behind state-of-the-art detection systemsin accuracy. While it can quickly identify objects in im-ages it struggles to precisely localize some objects, espe-cially small ones. We examine these tradeoffs further in of our training and testing code is open source. Avariety of pretrained models are also available to Unified DetectionWe unify the separate components of object detectioninto a single neural network.

Our network uses featuresfrom the entire image to predict each bounding box. It alsopredicts all bounding boxes across all classes for an im-age simultaneously. This means our network reasons glob-ally about the full image and all the objects in the YOLO design enables end-to-end training and real-time speeds while maintaining high average system divides the input image into anS the center of an object falls into a grid cell, that grid cellis responsible for detecting that grid cell predictsBbounding boxes and confidencescores for those boxes. These confidence scores reflect howconfident the model is that the box contains an object andalso how accurate it thinks the box is that it predicts. For-mally we define confidence asPr(Object) IOUtruthpred. If noobject exists in that cell, the confidence scores should bezero.

Otherwise we want the confidence score to equal theintersection over union (IOU) between the predicted boxand the ground bounding box consists of 5 predictions:x,y,w,h,and confidence. The(x,y)coordinates represent the centerof the box relative to the bounds of the grid cell. The widthand height are predicted relative to the whole image. Finallythe confidence prediction represents the IOU between thepredicted box and any ground truth grid cell also predictsCconditional class proba-bilities,Pr(Classi|Object). These probabilities are condi-tioned on the grid cell containing an object. We only predictone set of class probabilities per grid cell, regardless of thenumber of test time we multiply the conditional class probabili-ties and the individual box confidence predictions,Pr(Classi|Object) Pr(Object) IOUtruthpred= Pr(Classi) IOUtruthpred(1)which gives us class-specific confidence scores for eachbox.

These scores encode both the probability of that classappearing in the box and how well the predicted box fits S grid on inputBounding boxes + confidenceClass probability mapFinal detectionsFigure 2:The system models detection as a regres-sion problem. It divides the image into anS Sgrid and for eachgrid cell predictsBbounding boxes, confidence for those boxes,andCclass probabilities. These predictions are encoded as anS S (B 5 +C) evaluating YOLO on PASCALVOC, we useS= 7,B= 2. PASCALVOC has 20 labelled classes soC= final prediction is a7 7 Network DesignWe implement this model as a convolutional neural net-work and evaluate it on the PASCALVOC detection dataset[9]. The initial convolutional layers of the network extractfeatures from the image while the fully connected layerspredict the output probabilities and network architecture is inspired by the GoogLeNetmodel for image classification [34].

Our network has 24convolutional layers followed by 2 fully connected of the inception modules used by GoogLeNet, wesimply use1 1reduction layers followed by3 3convo-lutional layers, similar to Lin et al [22]. The full network isshown in Figure also train a fast version of YOLO designed to pushthe boundaries of fast object detection. Fast YOLO uses aneural network with fewer convolutional layers (9 insteadof 24) and fewer filters in those layers. Other than the sizeof the network, all training and testing parameters are thesame between YOLO and Fast Layer7x7x64-s-2 Maxpool Layer2x2-s-233112112192335656256 Conn. Layer4096 Conn. LayerConv. Layer3x3x192 Maxpool Layer2x2-s-2 Conv. Layers1x1x1283x3x2561x1x2563x3x512 Maxpool Layer2x2-s-2332828512 Conv. Layers1x1x2563x3x5121x1x5123x3x1024 Maxpool Layer2x2-s-23314141024 Conv. Layers1x1x5123x3x10243x3x10243x3x1024-s- 2337710247710247730} 4} 2 Conv.

Abstract arXiv:1506.02640v5 [cs.CV] 9 May 2016

Tags:

Information

Transcription of Abstract arXiv:1506.02640v5 [cs.CV] 9 May 2016

Related search queries

Abstract arXiv:1506.02640v5 [cs.CV] 9 May 2016

Tags:

Information

Documents from same domain

Related documents

Related search queries