Example: barber

DetCo: Unsupervised Contrastive Learning for Object Detection

DetCo: Unsupervised Contrastive Learning for Object DetectionEnze Xie1 , Jian Ding3*, Wenhai Wang4, Xiaohang Zhan5,Hang Xu2, Peize Sun1, Zhenguo Li2, Ping Luo11 The University of Hong Kong2 Huawei Noah s Ark Lab3 Wuhan University4 Nanjing University5 Chinese University of Hong KongAbstractWe present DetCo, a simple yet effective self-supervisedapproach for Object Detection . Unsupervised pre-trainingmethods have been recently designed for Object Detection ,but they are usually deficient in image classification, or theopposite.

Self-supervised learning of visual representation is an es-sential problem in computer vision, facilitating many down-stream tasks such as image classification, object detection, and semantic segmentation [23,35,43]. It aims to provide models pre-trained on large-scale unlabeled data for down-stream tasks. Previous methods focus on designing ...

Tags:

  Learning, Visual, Representation, Unsupervised, Visual representation

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of DetCo: Unsupervised Contrastive Learning for Object Detection

1 DetCo: Unsupervised Contrastive Learning for Object DetectionEnze Xie1 , Jian Ding3*, Wenhai Wang4, Xiaohang Zhan5,Hang Xu2, Peize Sun1, Zhenguo Li2, Ping Luo11 The University of Hong Kong2 Huawei Noah s Ark Lab3 Wuhan University4 Nanjing University5 Chinese University of Hong KongAbstractWe present DetCo, a simple yet effective self-supervisedapproach for Object Detection . Unsupervised pre-trainingmethods have been recently designed for Object Detection ,but they are usually deficient in image classification, or theopposite.

2 Unlike them, DetCo transfers well on downstreaminstance-level dense prediction tasks, while maintainingcompetitive image-level classification accuracy. The advan-tages are derived from (1) multi-level supervision to inter-mediate representations, (2) Contrastive Learning betweenglobal image and local patches. These two designs facil-itate discriminative and consistent global and local repre-sentation at each level of feature pyramid, improving detec-tion and classification, experiments on VOC, COCO, Cityscapes, andImageNet demonstrate that DetCo not only outperforms re-cent methods on a series of 2D and 3D instance-level de-tection tasks, but also competitive on image example, on ImageNet classification, DetCo is top-1 accuracy better than InsLoc and DenseCL.

3 Which are two contemporary works designed for Object de-tection. Moreover, on COCO Detection , DetCo is APbetter than SwAV with Mask R-CNN C4. Notably, DetColargely boosts up Sparse R-CNN, a recent strong detector,from AP to AP (+ AP), establishing a newSOTA on IntroductionSelf-supervised Learning of visual representation is an es-sential problem in computer vision, facilitating many down-stream tasks such as image classification, Object Detection ,and semantic segmentation [23, 35, 43]. It aims to providemodels pre-trained on large-scale unlabeled data for down-stream tasks.

4 Previous methods focus on designing differentpretext tasks. One of the most promising directions amongthem is Contrastive Learning [32], which transforms one im-*equal + + APInstDisFigure accuracy on Classification and achieves the best performance trade-off on both classifi-cation and Detection . For example, DetCo outperforms its strongbaseline, MoCo v2 [5], by AP on COCO Detection . Moreover,DetCo is significant better than recent DenseCL [39],InsLoc [41], PatchReID [8] on ImageNet classification while alsohas advantages on Object Detection .

5 Note that these three meth-ods are concurrent work and specially designed for Object detec-tion (mark withgreen). The yellow asterisk indicates that a de-sired method should have both high performance in Detection into multiple views, minimizes the distance betweenviews from the same image, and maximizes the distancebetween views from different images in a feature the past two years, some methods based on contrastivelearning and online clustering, MoCo v1/v2 [19, 5],BYOL [18], and SwAV [3]

6 , have achieved great progressto bridge the performance gap between Unsupervised andfully-supervised methods for image classification. How-ever, their transferring ability on Object Detection is not sat-isfactory. Concurrent to our work, recently DenseCL [39],InsLoc [41] and PatchReID [8] also adopt Contrastive learn-ing to design Detection -friendly pretext tasks. Nonetheless,these methods only transfer well on Object Detection but sac-rifice image classification performance, as shown in Fig-ure 1 and Table 1.

7 So,it is challenging to design a pretexttask that can reconcile instance-level Detection and image8392 MethodPlaceImageNet v1[19]CVPR v2[5] [41]CVPR [39]CVPR [8] and Detection trade-off for recentdetection-friendly self-supervised withconcurrent InstLoc[41], DenseCL[39] and PatchReID[8], DetCois significantly better by , and on ImageNet clas-sification. Moreover, DetCo is also on par with these methods ondense prediction tasks, achieving best hypothesize that there is no unbridgeable gap be-tween image-level classification and instance-level detec-tion.

8 Intuitively, image classification recognizes global in-stance from a single high-level feature map, while objectdetection recognizes local instance from multi-level featurepyramids. From this perspective, it is desirable to build in-stance representation that are (1) discriminative at each levelof feature pyramid (2) consistent for both global image andlocal patch ( windows). However, existing un-supervised methods overlook these two aspects. Therefore, Detection and classification cannot mutually this work, we present DetCo, which is a contrastivelearning framework beneficial for instance-level detectiontasks while maintaining competitive image classificationtransfer accuracy.

9 DetCo contains (1) multi-level supervi-sion on features from different stages of the backbone net-work. (2) Contrastive Learning between global image andlocal patches. Specifically, the multi-level supervision di-rectly optimizes the features from each stage of backbonenetwork, ensuring strong discrimination in each level ofpyramid features. This supervision leads to better perfor-mance for dense Object detectors by multi-scale global and local Contrastive Learning guides the networkto learn consistent representation on both image-level andpatch-level, which can not only keep each local patch highlydiscriminative but also promote the whole image represen-tation, benefiting both Object Detection and image achieves state-of-the-art transfer performance onvarious 2D and 3D instance-level Detection VOCand COCO Object Detection .

10 Semantic segmentation andDensePose. Moreover, the performance of DetCo on Im-ageNet classification and VOC SVM classification is stillvery competitive. For example, as shown in Figure 1 andTable 1, DetCo improves MoCo v2 on both classificationand dense prediction tasks. DetCo is significant better thanDenseCL [39], InsLoc [41] and PatchReID [8] on ImageNetclassification by , and and slightly bet-ter on Object Detection and semantic segmentation. Pleasenote DenseCL, InsLoc and PatchReID are three concur-rent works which are designed for Object Detection but sac-rifice classification.


Related search queries