Example: marketing

Objects as Points

Objects as PointsXingyi ZhouUT WangUC Kr ahenb uhlUT identifies Objects as axis-aligned boxes in animage. Most successful object detectors enumerate a nearlyexhaustive list of potential object locations and classifyeach. This is wasteful, inefficient, and requires additionalpost-processing. In this paper, we take a different model an object as a single point the center pointof its bounding box. Our detector uses keypoint estima-tion to find center Points and regresses to all other ob-ject properties, such as size, 3D location, orientation, andeven pose.

Objects as Points Xingyi Zhou UT Austin zhouxy@cs.utexas.edu Dequan Wang UC Berkeley dqwang@cs.berkeley.edu Philipp Kr¨ahenb uhl¨ UT Austin philkr@cs.utexas.edu

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Objects as Points

1 Objects as PointsXingyi ZhouUT WangUC Kr ahenb uhlUT identifies Objects as axis-aligned boxes in animage. Most successful object detectors enumerate a nearlyexhaustive list of potential object locations and classifyeach. This is wasteful, inefficient, and requires additionalpost-processing. In this paper, we take a different model an object as a single point the center pointof its bounding box. Our detector uses keypoint estima-tion to find center Points and regresses to all other ob-ject properties, such as size, 3D location, orientation, andeven pose.

2 Our center point based approach, CenterNet, isend-to-end differentiable, simpler, faster, and more accuratethan corresponding bounding box based detectors. Center-Net achieves the best speed-accuracy trade-off on the MSCOCO dataset, at 142 FPS, at 52 FPS, with multi-scale testing at FPS. Weuse the same approach to estimate 3D bounding box in theKITTI benchmark and human pose on the COCO keypointdataset. Our method performs competitively with sophisti-cated multi-stage methods and runs in IntroductionObject detection powers many vision tasks like instancesegmentation [7, 21, 32], pose estimation [3, 15, 39], track-ing [24, 27], and action recognition [5].

3 It has down-streamapplications in surveillance [57], autonomous driving [53],and visual question answering [1]. Current object detec-tors represent each object through an axis-aligned boundingbox that tightly encompasses the object [18, 19, 33, 43, 46].They then reduce object detection to image classificationof an extensive number of potential object bounding each bounding box, the classifier determines if theimage content is a specific object or background. One-stage detectors [33, 43] slide a complex arrangement ofpossible bounding boxes, called anchors, over the imageand classify them directly without specifying the box con-tent.

4 Two-stage detectors [18, 19, 46] recompute image-features for each potential box, then classify those , namely non-maxima suppression, then re-050100150200 Inference time (ms)25303540 COCO APCenterNet(ours)FasterRCNNR etinaNetYOLOv3 Figure 1: Speed-accuracy trade-off on COCO validation forreal-time detectors. The proposed CenterNet outperforms arange of state-of-the-art duplicated detections for the same instance by com-puting bounding box IoU. This post-processing is hard todifferentiate and train [23], hence most current detectorsare not end-to-end trainable.

5 Nonetheless, over the pastfive years [19], this idea has achieved good empirical suc-cess [12, 21, 25, 26, 31, 35, 47, 48, 56, 62, 63]. Sliding windowbased object detectors are however a bit wasteful, as theyneed to enumerate all possible object locations and this paper, we provide a much simpler and more effi-cient alternative. We represent Objects by a single point attheir bounding box center (see Figure 2). Other properties,such as object size, dimension, 3D extent, orientation, andpose are then regressed directly from image features at thecenter location.

6 Object detection is then a standard keypointestimation problem [3,39,60]. We simply feed the input im-age to a fully convolutional network [37, 40] that generatesa heatmap. Peaks in this heatmap correspond to object cen-ters. Image features at each peak predict the Objects bound-ing box height and weight. The model trains using standarddense supervised learning [39,60]. Inference is a single net-work forward-pass, without non-maximal suppression [ ] 25 Apr 2019 Figure 2: We model an object as the center point of its bounding box. The bounding box size and other object properties areinferred from the keypoint feature at the center.

7 Best viewed in method is general and can be extended to other taskswith minor effort. We provide experiments on 3D object de-tection [17] and multi-person human pose estimation [4], bypredicting additional outputs at each center point (see Fig-ure 4). For 3D bounding box estimation, we regress to theobject absolute depth, 3D bounding box dimensions, andobject orientation [38]. For human pose estimation, we con-sider the 2D joint locations as offsets from the center anddirectly regress to them at the center point simplicity of our method, CenterNet, allows it torun at a very high speed (Figure 1).

8 With a simple Resnet-18 and up-convolutional layers [55], our network runsat 142 FPS bounding box AP. Witha carefully designed keypoint detection network, DLA-34 [58], our network AP at 52 with the state-of-the-art keypoint estimation net-work, Hourglass-104 [30, 40], and multi-scale testing, ournetwork AP at FPS. On 3 Dbounding box estimation and human pose estimation, weperform competitively with state-of-the-art at a higher in-ference speed. Code is available Related workObject detection by region of thefirst successful deep object detectors, RCNN [19], enu-merates object location from a large set of region candi-dates [52], crops them, and classifies each using a deepnetwork.

9 Fast-RCNN [18] crops image features instead,to save computation. However, both methods rely on slowlow-level region proposal [46] generates region proposal within the de-tection network. It samples fixed-shape bounding boxes(anchors) around a low-resolution image grid and classifieseach into foreground or not . An anchor is labeled fore-ground with a> with any ground truth object,background with a< , or ignored generated region proposal is again classified [18].Changing the proposal classifier to a multi-class classi-fication forms the basis of one-stage detectors.

10 Severalimprovements to one-stage detectors include anchor shapepriors [44, 45], different feature resolution [36], and lossre-weighting among different samples [33].Our approach is closely related to anchor-based one-stage approaches [33, 36, 43]. A center point can be seenas a single shape-agnostic anchor (see Figure 3). However,there are a few important differences. First, our CenterNetassigns the anchor based solely on location, not box over-lap [18]. We have no manual thresholds [18] for foregroundand background classification. Second, we only have onepositive anchor per object, and hence do not need Non-Maximum Suppression (NMS) [2].


Related search queries