Dense Contrastive Learning for Self-Supervised Visual Pre ...

Dense Contrastive Learning for Self-Supervised Visual Pre-TrainingXinlong Wang1,Rufeng Zhang2,Chunhua Shen1*,Tao Kong3,Lei Li31 The University of Adelaide, Australia2 Tongji University, China3 ByteDance AI LabAbstractTo date, most existing Self-Supervised Learning methodsare designed and optimized for image classification. Thesepre-trained models can be sub-optimal for Dense predictiontasks due to the discrepancy between image-level predic-tion and pixel-level prediction. To fill this gap, we aim todesign an effective, Dense Self-Supervised Learning methodthat directly works at the level of pixels (or local features)by taking into account the correspondence between localfeatures. We present Dense Contrastive Learning (DenseCL),which implements Self-Supervised Learning by optimizing apairwise Contrastive (dis)similarity loss at the pixel levelbetween two views of input to the baseline method MoCo-v2, our methodintroduces negligible computation overhead (only<1%slower), but demonstrates consistently superior perfor-mance when transferring to downstream Dense predictiontasks including object detection, semantic segmentation andinstance segmentation; and outperforms the state-of-the-artmethods by a large margin.

Specifically, over the strongMoCo-v2 baseline, our method achieves significant im-provements of AP on PASCAL VOC object detection, AP on COCO object detection, AP on COCO in-stance segmentation, mIoU on PASCAL VOC seman-tic segmentation and mIoU on Cityscapes and models are available at: IntroductionPre-training has become a well-established paradigmin many computer vision tasks. In a typical pre-trainingparadigm, models are first pre-trained on large-scaledatasets and then fine-tuned on target tasks with less train-ing data. Specifically, the supervised ImageNet pre-traininghas been dominant for years, where the models are pre-trained to solve image classification and transferred todownstream tasks. However, there is a gap between im-*Corresponding (%)COCOI mageNetMoCo-v2 OursSup.

IN(a)Object Detection6062646668707274mIoU (%)COCOI mageNetMoCo-v2 OursSup. IN(b)Semantic SegmentationFigure 1 Comparisons of pre-trained models by fine-tuningon object detection and semantic segmentation datasets. denotes the supervised pre-training on ImageNet. COCO and ImageNet indicate the pre-training models trained onCOCO and ImageNet respectively. (a): The object detec-tion results of a Faster R-CNN detector fine-tuned on VOCtrainval07+12for 24k iterations and evaluated on VOCtest2007; (b): The semantic segmentation results of an FCNmodel fine-tuned on VOCtrainaug2012for 20k iterationsand evaluated onval2012. The results are averaged over 5independent classification pre-training and target Dense predictiontasks, such as object detection [9,25] and semantic segmen-tation [5].

The former focuses on assigning a category to aninput image, while the latter needs to perform Dense classi-fication or regression over the whole image. For example,semantic segmentation aims to assign a category for eachpixel, and object detection aims to predict the categoriesand bounding boxes for all object instances of interest. Astraightforward solution would be to pre-train on Dense pre-diction tasks directly. However, these tasks annotation isnotoriously time-consuming compared to the image-levellabeling, making it hard to collect data at a massive scaleto pre-train a universal feature , unsupervised Visual pre-training has attractedmuch research attention, which aims to learn a proper vi-sual representation from a large set of unlabeled images. Afew methods [17,2,3,14] show the effectiveness in down-stream tasks, which achieve comparable or better resultscompared to supervised ImageNet pre-training.

However,the gap between image classification pre-training and target3024dense prediction tasks still exists. First, almost all recentself-supervised Learning methods formulate the Learning asimage-level prediction using global features. They all canbe thought of as classifying each image into its own version, , instance discrimination [41]. Moreover, existing ap-proaches are usually evaluated and optimized on the imageclassification benchmark. Nevertheless, better image clas-sification does not guarantee more accurate object detec-tion, as shown in [18]. Thus, Self-Supervised Learning thatis customized for Dense prediction tasks is on demand. Asfor unsupervised pre-training, Dense annotation is no longerneeded. A clear approach would be pre-training as a denseprediction taskdirectly, thus removing the gap between pre-training and target Dense prediction by the supervised Dense prediction tasks, , semantic segmentation, which performs Dense per-pixel classification, we propose Dense Contrastive learn-ing (DenseCL) for Self-Supervised Visual views the Self-Supervised Learning task as a densepairwise Contrastive Learning rather than the global imageclassification.

First, we introduce a Dense projection headthat takes the features from backbone networks as input andgenerates Dense feature vectors. Our method naturally pre-serves the spatial information and constructs a Dense outputformat, compared to the existing global projection head thatapplies a global pooling to the backbone features and out-puts a single, global feature vector for each image. Second,we define the positive sample of each local feature vector byextracting the correspondence across views. To construct anunsupervised objective function, we further design a densecontrastive loss, which extends the conventional InfoNCEloss [29] to a Dense paradigm. With the above approaches,we perform Contrastive Learning densely using a fully con-volutional network (FCN) [26], similar to target Dense pre-diction main contributions are thus summarized as follows.

We propose a new Contrastive Learning paradigm, , Dense Contrastive Learning , which performs Dense pair-wise Contrastive Learning at the level of pixels (or localfeatures). With the proposed Dense Contrastive Learning , we de-sign a simple and effective Self-Supervised learningmethod tailored for Dense prediction tasks, termedDenseCL, which fills the gap between self-supervisedpre-training and Dense prediction tasks. DenseCL significantly outperforms the state-of-the-artMoCo-v2 [3] when transferring the pre-trained modelto downstream Dense prediction tasks, including objectdetection (+ ), instance segmentation (+ ) and semantic segmentation (+ ), andfar surpasses the supervised ImageNet Related WorkSelf-supervised speaking, the suc-cess of Self-Supervised Learning [41,17,42,47,16,14]can be attributed to two important aspects namelycon-trastive Learning , andpretext tasks.

The objective func-tions used to train Visual representations in many methodsare either reconstruction-based loss functions [7,30,12], orcontrastive loss that measures the co-occurrence of multi-ple views [38]. Contrastive Learning , holds the key to moststate-of-the-art methods [41,17,2,42], in which the posi-tive pair is usually formed with two augmented views of thesame image (or other Visual patterns), while negative onesare formed with different wide range of pretext tasks have been explored to learna good representation. These examples include coloriza-tion [46], context autoencoders [7], inpainting [30], spa-tial jigsaw puzzles [28] and discriminate orientation [11].These methods achieved very limited success in computervision. The breakthrough approach is SimCLR [2], whichfollows an instance discrimination pretext task, similarto [41], where the features of each instance are pulled awayfrom those of all other instances in the training set.

In-variances are encoded from low-level image transforma-tions such as cropping, scaling, and color jittering. Con-trastive Learning and pretext tasks are often combined toform a representation Learning framework. DenseCL be-longs to the Self-Supervised pre-training paradigm, and wenaturally make the framework friendly for Dense predictiontasks such as semantic segmentation and object for Dense prediction hasenabled surprising results on many Dense prediction tasks,including object detection [34,32] and semantic segmenta-tion [26]. These models are usually fine-tuned from Ima-geNet pre-trained model, which is designed for image-levelrecognition tasks. Some previous studies have shown thegap between ImageNet pre-training and Dense predictiontasks in the context of network architecture [24,22,37,36].

YOLO9000 [33] proposes to joint train the object detec-tor on both classification and detection data. He et al. [18]demonstrate that even we pre-train on extremely larger clas-sification dataset ( , Instagram [27], which is 3000 larger than ImageNet), the transfer improvements on ob-ject detection are relatively small. Recent works [23,48]show that pre-trained models utilizing object detection dataand annotations ( MS COCO [25]) could achieve on parperformance on object detection and semantic segmentationcompared with ImageNet pre-trained model. While the su-pervised pre-training for Dense prediction tasks has been ex-plored before DenseCL, there are few works on designingan unsupervised paradigm for Dense prediction tasks. Con-current and independent works [31,1] also find that con-trastive Learning at the level of local features matters.

Dense Contrastive Learning for Self-Supervised Visual Pre ...

Tags:

Information

Transcription of Dense Contrastive Learning for Self-Supervised Visual Pre ...

Related search queries

Dense Contrastive Learning for Self-Supervised Visual Pre ...

Tags:

Information

Documents from same domain

Related documents

Related search queries