Objects as Points

Objects as PointsXingyi ZhouUT WangUC Kr ahenb uhlUT identifies Objects as axis-aligned boxes in animage. Most successful object detectors enumerate a nearlyexhaustive list of potential object locations and classifyeach. This is wasteful, inefficient, and requires additionalpost-processing. In this paper, we take a different model an object as a single point the center pointof its bounding box. Our detector uses keypoint estima-tion to find center Points and regresses to all other ob-ject properties, such as size, 3D location, orientation, andeven pose.

Our center point based approach, CenterNet, isend-to-end differentiable, simpler, faster, and more accuratethan corresponding bounding box based detectors. Center-Net achieves the best speed-accuracy trade-off on the MSCOCO dataset, at 142 FPS, at 52 FPS, with multi-scale testing at FPS. Weuse the same approach to estimate 3D bounding box in theKITTI benchmark and human pose on the COCO keypointdataset. Our method performs competitively with sophisti-cated multi-stage methods and runs in IntroductionObject detection powers many vision tasks like instancesegmentation [7, 21, 32], pose estimation [3, 15, 39], track-ing [24, 27], and action recognition [5].

It has down-streamapplications in surveillance [57], autonomous driving [53],and visual question answering [1]. Current object detec-tors represent each object through an axis-aligned boundingbox that tightly encompasses the object [18, 19, 33, 43, 46].They then reduce object detection to image classificationof an extensive number of potential object bounding each bounding box, the classifier determines if theimage content is a specific object or background. One-stage detectors [33, 43] slide a complex arrangement ofpossible bounding boxes, called anchors, over the imageand classify them directly without specifying the box con-tent.

Two-stage detectors [18, 19, 46] recompute image-features for each potential box, then classify those , namely non-maxima suppression, then re-050100150200 Inference time (ms)25303540 COCO APCenterNet(ours)FasterRCNNR etinaNetYOLOv3 Figure 1: Speed-accuracy trade-off on COCO validation forreal-time detectors. The proposed CenterNet outperforms arange of state-of-the-art duplicated detections for the same instance by com-puting bounding box IoU. This post-processing is hard todifferentiate and train [23], hence most current detectorsare not end-to-end trainable.

Nonetheless, over the pastfive years [19], this idea has achieved good empirical suc-cess [12, 21, 25, 26, 31, 35, 47, 48, 56, 62, 63]. Sliding windowbased object detectors are however a bit wasteful, as theyneed to enumerate all possible object locations and this paper, we provide a much simpler and more effi-cient alternative. We represent Objects by a single point attheir bounding box center (see Figure 2). Other properties,such as object size, dimension, 3D extent, orientation, andpose are then regressed directly from image features at thecenter location.

Object detection is then a standard keypointestimation problem [3,39,60]. We simply feed the input im-age to a fully convolutional network [37, 40] that generatesa heatmap. Peaks in this heatmap correspond to object cen-ters. Image features at each peak predict the Objects bound-ing box height and weight. The model trains using standarddense supervised learning [39,60]. Inference is a single net-work forward-pass, without non-maximal suppression [ ] 25 Apr 2019 Figure 2: We model an object as the center point of its bounding box. The bounding box size and other object properties areinferred from the keypoint feature at the center.

Best viewed in method is general and can be extended to other taskswith minor effort. We provide experiments on 3D object de-tection [17] and multi-person human pose estimation [4], bypredicting additional outputs at each center point (see Fig-ure 4). For 3D bounding box estimation, we regress to theobject absolute depth, 3D bounding box dimensions, andobject orientation [38]. For human pose estimation, we con-sider the 2D joint locations as offsets from the center anddirectly regress to them at the center point simplicity of our method, CenterNet, allows it torun at a very high speed (Figure 1).

With a simple Resnet-18 and up-convolutional layers [55], our network runsat 142 FPS bounding box AP. Witha carefully designed keypoint detection network, DLA-34 [58], our network AP at 52 with the state-of-the-art keypoint estimation net-work, Hourglass-104 [30, 40], and multi-scale testing, ournetwork AP at FPS. On 3 Dbounding box estimation and human pose estimation, weperform competitively with state-of-the-art at a higher in-ference speed. Code is available Related workObject detection by region of thefirst successful deep object detectors, RCNN [19], enu-merates object location from a large set of region candi-dates [52], crops them, and classifies each using a deepnetwork.

Fast-RCNN [18] crops image features instead,to save computation. However, both methods rely on slowlow-level region proposal [46] generates region proposal within the de-tection network. It samples fixed-shape bounding boxes(anchors) around a low-resolution image grid and classifieseach into foreground or not . An anchor is labeled fore-ground with a> with any ground truth object,background with a< , or ignored generated region proposal is again classified [18].Changing the proposal classifier to a multi-class classi-fication forms the basis of one-stage detectors.

Severalimprovements to one-stage detectors include anchor shapepriors [44, 45], different feature resolution [36], and lossre-weighting among different samples [33].Our approach is closely related to anchor-based one-stage approaches [33, 36, 43]. A center point can be seenas a single shape-agnostic anchor (see Figure 3). However,there are a few important differences. First, our CenterNetassigns the anchor based solely on location, not box over-lap [18]. We have no manual thresholds [18] for foregroundand background classification. Second, we only have onepositive anchor per object, and hence do not need Non-Maximum Suppression (NMS) [2].

Objects as Points

Information

Advertisement

Transcription of Objects as Points

Related search queries

Objects as Points

Information

Advertisement

Documents from same domain

Related documents

Related search queries