Transcription of Objects as Points
1 Objects as PointsXingyi ZhouUT WangUC Kr ahenb uhlUT identifies Objects as axis-aligned boxes in animage. Most successful object detectors enumerate a nearlyexhaustive list of potential object locations and classifyeach. This is wasteful, inefficient, and requires additionalpost-processing. In this paper, we take a different model an object as a single point the center pointof its bounding box. Our detector uses keypoint estima-tion to find center Points and regresses to all other ob-ject properties, such as size, 3D location, orientation, andeven pose. Our center point based approach, CenterNet, isend-to-end differentiable, simpler, faster, and more accuratethan corresponding bounding box based detectors. Center-Net achieves the best speed-accuracy trade-off on the MSCOCO dataset, at 142 FPS, at 52 FPS, with multi-scale testing at FPS.
2 Weuse the same approach to estimate 3D bounding box in theKITTI benchmark and human pose on the COCO keypointdataset. Our method performs competitively with sophisti-cated multi-stage methods and runs in IntroductionObject detection powers many vision tasks like instancesegmentation [7, 21, 32], pose estimation [3, 15, 39], track-ing [24, 27], and action recognition [5]. It has down-streamapplications in surveillance [57], autonomous driving [53],and visual question answering [1]. Current object detec-tors represent each object through an axis-aligned boundingbox that tightly encompasses the object [18, 19, 33, 43, 46].They then reduce object detection to image classificationof an extensive number of potential object bounding each bounding box, the classifier determines if theimage content is a specific object or background.
3 One-stage detectors [33, 43] slide a complex arrangement ofpossible bounding boxes, called anchors, over the imageand classify them directly without specifying the box con-tent. Two-stage detectors [18, 19, 46] recompute image-features for each potential box, then classify those , namely non-maxima suppression, then re-050100150200 Inference time (ms)25303540 COCO APCenterNet(ours)FasterRCNNR etinaNetYOLOv3 Figure 1: Speed-accuracy trade-off on COCO validation forreal-time detectors. The proposed CenterNet outperforms arange of state-of-the-art duplicated detections for the same instance by com-puting bounding box IoU. This post-processing is hard todifferentiate and train [23], hence most current detectorsare not end-to-end trainable.
4 Nonetheless, over the pastfive years [19], this idea has achieved good empirical suc-cess [12, 21, 25, 26, 31, 35, 47, 48, 56, 62, 63]. Sliding windowbased object detectors are however a bit wasteful, as theyneed to enumerate all possible object locations and this paper, we provide a much simpler and more effi-cient alternative. We represent Objects by a single point attheir bounding box center (see Figure 2). Other properties,such as object size, dimension, 3D extent, orientation, andpose are then regressed directly from image features at thecenter location. object detection is then a standard keypointestimation problem [3,39,60]. We simply feed the input im-age to a fully convolutional network [37, 40] that generatesa heatmap. Peaks in this heatmap correspond to object cen-ters.
5 Image features at each peak predict the Objects bound-ing box height and weight. The model trains using standarddense supervised learning [39,60]. Inference is a single net-work forward-pass, without non-maximal suppression [ ] 25 Apr 2019 Figure 2: We model an object as the center point of its bounding box. The bounding box size and other object properties areinferred from the keypoint feature at the center. Best viewed in method is general and can be extended to other taskswith minor effort. We provide experiments on 3D object de-tection [17] and multi-person human pose estimation [4], bypredicting additional outputs at each center point (see Fig-ure 4). For 3D bounding box estimation, we regress to theobject absolute depth, 3D bounding box dimensions, andobject orientation [38].
6 For human pose estimation, we con-sider the 2D joint locations as offsets from the center anddirectly regress to them at the center point simplicity of our method, CenterNet, allows it torun at a very high speed (Figure 1). With a simple Resnet-18 and up-convolutional layers [55], our network runsat 142 FPS bounding box AP. Witha carefully designed keypoint detection network, DLA-34 [58], our network AP at 52 with the state-of-the-art keypoint estimation net-work, Hourglass-104 [30, 40], and multi-scale testing, ournetwork AP at FPS. On 3 Dbounding box estimation and human pose estimation, weperform competitively with state-of-the-art at a higher in-ference speed. Code is available Related workObject detection by region of thefirst successful deep object detectors, RCNN [19], enu-merates object location from a large set of region candi-dates [52], crops them, and classifies each using a deepnetwork.
7 Fast-RCNN [18] crops image features instead,to save computation. However, both methods rely on slowlow-level region proposal [46] generates region proposal within the de-tection network. It samples fixed-shape bounding boxes(anchors) around a low-resolution image grid and classifieseach into foreground or not . An anchor is labeled fore-ground with a> with any ground truth object ,background with a< , or ignored generated region proposal is again classified [18].Changing the proposal classifier to a multi-class classi-fication forms the basis of one-stage detectors. Severalimprovements to one-stage detectors include anchor shapepriors [44, 45], different feature resolution [36], and lossre-weighting among different samples [33].
8 Our approach is closely related to anchor-based one-stage approaches [33, 36, 43]. A center point can be seenas a single shape-agnostic anchor (see Figure 3). However,there are a few important differences. First, our CenterNetassigns the anchor based solely on location, not box over-lap [18]. We have no manual thresholds [18] for foregroundand background classification. Second, we only have onepositive anchor per object , and hence do not need Non-Maximum Suppression (NMS) [2]. We simply extract lo-cal peaks in the keypoint heatmap [4, 39]. Third, CenterNetuses a larger output resolution (output stride of4) comparedto traditional object detectors [21, 22] (output stride of16).This eliminates the need for multiple anchors [47]. object detection by keypoint are not thefirst to use keypoint estimation for object detection.
9 Cor-nerNet [30] detects two bounding box corners as keypoints,while ExtremeNet [61] detects the top-, left-, bottom-, right-most, and center Points of all Objects . Both these methodsbuild on the same robust keypoint estimation network as ourCenterNet. However, they require a combinatorial group-ing stage after keypoint detection, which significantly slowsdown each algorithm. Our CenterNet, on the other hand,simply extracts a single center point per object without theneed for grouping or 3D object bounding box esti-mation powers autonomous driving [17]. Deep3 Dbox [38]uses a slow-RCNN [19] style framework, by first detecting2D Objects [46] and then feeding each object into a 3D es-timation network. 3D RCNN [29] adds an additional headto Faster-RCNN [46] followed by a 3D projection.
10 DeepManta [6] uses a coarse-to-fine Faster-RCNN [46] trainedon many tasks. Our method is similar to a one-stage versionof Deep3 Dbox [38] or 3 DRCNN [29]. As such, CenterNetis much simpler and faster than competing methods.(a) Standard anchor based count as positivewith an overlapIoU > object , negative with an over-lapIoU < , or are ignored oth-erwise.(b) Center point based center pixelis assigned to the Points have a re-duced negative loss. Objectsize is 3: Different between anchor-based detectors (a) andour center point detector (b). Best viewed on PreliminaryLetI RW H 3be an input image of widthWandheightH. Our aim is to produce a keypoint heatmap Y [0,1]WR HR C, whereRis the output stride andCis thenumber of keypoint types.