Focal Loss for Dense Object Detection - arXiv

Focal loss for Dense Object DetectionTsung-Yi LinPriya GoyalRoss GirshickKaiming HePiotr Doll arFacebook AI Research (FAIR) of ground truth class012345loss = 0 = = 1 = 2 = 5well-classi edexampleswell-classi edexamplesCE(pt) = log(pt)FL(pt) = (1 pt) log(pt)Figure 1. We propose a novel loss we term theFocal Lossthatadds a factor(1 pt) to the standard cross entropy >0reduces the relative loss for well-classified examples(pt> .5), putting more focus on hard, misclassified examples. Asour experiments will demonstrate, the proposed Focal loss enablestraining highly accurate Dense Object detectors in the presence ofvast numbers of easy background highest accuracy Object detectors to date are basedon a two-stage approach popularized by R-CNN, where aclassifier is applied to asparseset of candidate Object lo-cations.

In contrast, one-stage detectors that are appliedover a regular,densesampling of possible Object locationshave the potential to be faster and simpler, but have trailedthe accuracy of two-stage detectors thus far. In this paper,we investigate why this is the case. We discover that the ex-treme foreground-background class imbalance encounteredduring training of Dense detectors is the central cause. Wepropose to address this class imbalance by reshaping thestandard cross entropy loss such that it down-weights theloss assigned to well-classified examples. Our novelFocalLossfocuses training on a sparse set of hard examples andprevents the vast number of easy negatives from overwhelm-ing the detector during training.

To evaluate the effective-ness of our loss , we design and train a simple Dense detectorwe call RetinaNet. Our results show that when trained withthe Focal loss , RetinaNet is able to match the speed of pre-vious one-stage detectors while surpassing the accuracy ofall existing state-of-the-art two-stage detectors. Code is at: time (ms)283032343638 COCO APBCDEFGR etinaNet-50 RetinaNet-101 APtime[A]YOLOv2 [27] [B]SSD321 [22] [C]DSSD321 [9] [D]R-FCN [3] [E]SSD513 [22] [F]DSSD513 [9] [G]FPN FRCN [20] Not plotted Extrapolated timeFigure 2. Speed (ms) versus accuracy (AP) on by the Focal loss , our simple one-stageRetinaNetdetec-tor outperforms all previous one-stage and two-stage detectors, in-cluding the best reported Faster R-CNN [28] system from [20].

We show variants of RetinaNet with ResNet-50-FPN (blue circles)and ResNet-101-FPN (orange diamonds) at five scales (400-800pixels). Ignoring the low-accuracy regime (AP<25), RetinaNetforms an upper envelope of all current detectors, and an improvedvariant (not shown) achieves AP. Details are given in IntroductionCurrent state-of-the-art Object detectors are based ona two-stage, proposal-driven mechanism. As popularizedin the R-CNN framework [11], the first stage generates asparseset of candidate Object locations and the second stageclassifies each candidate location as one of the foregroundclasses or as background using a convolutional neural net-work.

Through a sequence of advances [10, 28, 20, 14], thistwo-stage framework consistently achieves top accuracy onthe challenging COCO benchmark [21].Despite the success of two-stage detectors, a naturalquestion to ask is: could a simple one-stage detector achievesimilar accuracy? One stage detectors are applied over aregular,densesampling of Object locations, scales, and as-pect ratios. Recent work on one-stage detectors, such asYOLO [26, 27] and SSD [22, 9], demonstrates promisingresults, yielding faster detectors with accuracy within 10-40% relative to state-of-the-art two-stage paper pushes the envelop further: we present a one-stage Object detector that, for the first time, matches thestate-of-the-art COCO AP of more complex two-stage de-1 [ ] 7 Feb 2018tectors, such as the Feature Pyramid Network (FPN) [20]or Mask R-CNN [14] variants of Faster R-CNN [28].

Toachieve this result, we identify class imbalance during train-ing as the main obstacle impeding one-stage detector fromachieving state-of-the-art accuracy and propose a new lossfunction that eliminates this imbalance is addressed in R-CNN-like detectorsby a two-stage cascade and sampling heuristics. The pro-posal stage ( , Selective Search [35], EdgeBoxes [39],DeepMask [24, 25], RPN [28]) rapidly narrows down thenumber of candidate Object locations to a small number( , 1-2k), filtering out most background samples. In thesecond classification stage, sampling heuristics, such as afixed foreground-to-background ratio (1:3), or online hardexample mining (OHEM) [31], are performed to maintain amanageable balance between foreground and contrast, a one-stage detector must process a muchlarger set of candidate Object locations regularly sampledacross an image.

In practice this often amounts to enumer-ating 100k locations that densely cover spatial positions,scales, and aspect ratios. While similar sampling heuris-tics may also be applied, they are inefficient as the trainingprocedure is still dominated by easily classified backgroundexamples. This inefficiency is a classic problem in objectdetection that is typically addressed via techniques such asbootstrapping [33, 29] or hard example mining [37, 8, 31].In this paper, we propose a new loss function that actsas a more effective alternative to previous approaches fordealing with class imbalance. The loss function is a dy-namically scaled cross entropy loss , where the scaling factordecays to zero as confidence in the correct class increases,see Figure 1.

Intuitively, this scaling factor can automati-cally down-weight the contribution of easy examples duringtraining and rapidly focus the model on hard examples. Ex-periments show that our proposedFocal Lossenables us totrain a high-accuracy, one-stage detector that significantlyoutperforms the alternatives of training with the samplingheuristics or hard example mining, the previous state-of-the-art techniques for training one-stage detectors. Finally,we note that the exact form of the Focal loss is not crucial,and we show other instantiations can achieve similar demonstrate the effectiveness of the proposed focalloss, we design a simple one-stage Object detector calledRetinaNet, named for its Dense sampling of Object locationsin an input image.

Its design features an efficient in-networkfeature pyramid and use of anchor boxes. It draws on a va-riety of recent ideas from [22, 6, 28, 20]. RetinaNet is effi-cient and accurate; our best model, based on a ResNet-101-FPN backbone, achieves a COCO test-devAP of running at 5 fps, surpassing the previously best pub-lished single-model results from both one and two-stage de-tectors, see Figure Related WorkClassic Object Detectors:The sliding-window paradigm,in which a classifier is applied on a Dense image grid, hasa long and rich history. One of the earliest successes is theclassic work of LeCunet al. who applied convolutional neu-ral networks to handwritten digit recognition [19, 36].

Vi-ola and Jones [37] used boosted Object detectors for facedetection, leading to widespread adoption of such introduction of HOG [4] and integral channel features[5] gave rise to effective methods for pedestrian [8] helped extend Dense detectors to more generalobject categories and had top results on PASCAL [7] formany years. While the sliding-window approach was theleading Detection paradigm in classic computer vision, withthe resurgence of deep learning [18], two-stage detectors,described next, quickly came to dominate Object Detectors:The dominant paradigm in modernobject Detection is based on a two-stage approach.

Focal Loss for Dense Object Detection - arXiv

Tags:

Information

Transcription of Focal Loss for Dense Object Detection - arXiv

Related search queries

Focal Loss for Dense Object Detection - arXiv

Tags:

Information

Documents from same domain

Related documents

Related search queries