An Empirical Study of Training Self-Supervised Vision ...

An Empirical Study of Training Self-Supervised Vision Transformers Xinlei Chen Saining Xie Kaiming He Facebook AI Research (FAIR). Code: Abstract framework model params acc. (%). linear probing: This paper does not describe a novel method. Instead, iGPT [9] iGPT-L 1362M it studies a straightforward, incremental, yet must-know iGPT [9] iGPT-XL 6801M MoCo v3 ViT-B 86M [ ] 16 Aug 2021. baseline given the recent progress in computer Vision : self - MoCo v3 ViT-L 304M supervised learning for Vision Transformers (ViT). While MoCo v3 ViT-H 632M the Training recipes for standard convolutional networks MoCo v3 ViT-BN-H 632M have been highly mature and robust, the recipes for ViT are MoCo v3 ViT-BN-L/7 304M yet to be built, especially in the Self-Supervised scenarios end-to-end fine-tuning: where Training becomes more challenging.

In this work, we masked patch pred. [16] ViT-B 86M . go back to basics and investigate the effects of several fun- MoCo v3 ViT-B 86M damental components for Training Self-Supervised ViT. We MoCo v3 ViT-L 304M observe that instability is a major issue that degrades accu- Table 1. State-of-the-art Self-Supervised Transformers in racy, and it can be hidden by apparently good results. We ImageNet classification, evaluated by linear probing (top panel). reveal that these results are indeed partial failure, and they or end-to-end fine-tuning (bottom panel). Both iGPT [9] and can be improved when Training is made more stable. We masked patch prediction [16] belong to the masked auto-encoding paradigm.

MoCo v3 is a contrastive learning method that com- benchmark ViT results in MoCo v3 and several other self - pares two (224 224) crops. ViT-B, -L, -H are the Vision Trans- supervised frameworks, with ablations in various aspects. formers proposed in [16]. ViT-BN is modified with BatchNorm, We discuss the currently positive evidence as well as chal- and /7 denotes a patch size of 7 7. : pre-trained in JFT-300M. lenges and open questions. We hope that this work will provide useful data points and experience for future research. others [10, 18, 7]. 1. Introduction Unlike standard convolutional networks whose Training practice has been extensively studied thanks to continuous Unsupervised pre- Training has revolutionized natural community effort, ViT models are new and their recipes language processing (NLP) [37, 15, 38, 4].

In computer vi- are yet to be established. In this work, we go back to ba- sion, the un-/ Self-Supervised pre- Training paradigms differ sics and investigate the fundamental components of train- from their NLP counterparts in at least two aspects: (i) the ing deep neural networks: the batch size, learning rate, and learners in NLP are masked auto-encoders, while in Vision optimizer. We find that under various cases, instability is a the recently popular choices are Siamese networks ( , major issue that impacts Self-Supervised ViT Training . [20, 10, 18, 7]); (ii) the backbone architectures in NLP are Interestingly, we observe that unstable ViT Training may self -attentional Transformers [43], while in Vision the com- not result in catastrophic failure ( , divergence); instead, mon choice is convolutional [28] yet non-attentional it can cause mild degradation in accuracy ( , 1 3%).

Deep residual networks (ResNets) [21]. To complete the big Such a degree of degradation may not be noticeable, unless picture of Self-Supervised learning in Vision , and towards a more stable counterpart is available for comparison. To closing the gap of pre- Training methodology between Vision the best of our knowledge, this phenomena is rare in the lit- and language, it is of scientific merit to investigate these erature of Training convolutional networks1 , and we believe differences. this problem and its hidden degradation are worth noticing. This work focuses on Training Transformers with the To demonstrate the possible harm of instability, we leading Self-Supervised frameworks in Vision .

This in- investigate a simple trick that can improve stability in vestigation is a straightforward extension given the recent practice. Based on an Empirical observation on gradient progress on Vision Transformers (ViT) [16]. In contrast to changes, we freeze the patch projection layer in ViT, , prior works [9, 16] that train Self-Supervised Transformers we use fixed random patch projection. We empirically show with masked auto-encoding, we Study the frameworks that that this trick alleviates the instability issue in several sce- are based on Siamese networks, including MoCo [20] and narios and consistently increases accuracy. *: equal contribution.

1 See also postscript on a related discussion. We benchmark and ablate Self-Supervised ViT in a va- methods suggest that it is of central importance to learn in- riety of cases. We provide ViT results in several self - variant features by matching positive samples. supervised frameworks. We conduct ablations on architec- Transformers. Transformers [43] were originally intro- ture designs and discuss the implications. Furthermore, we duced for machine translation and later became a dominant explore scaling up the ViT models, including the non-trivial backbone in NLP [37, 15, 38, 4]. The long-range, self - ViT-Large and ViT-Huge [16] the latter has 40 more attentional behavior makes Transformers an effective tool computation than ResNet-50 [21].

Based on these experi- given the non-local, relational nature of languages. mental results, we discuss both the currently positive evi- There have been continuous efforts on generalizing dence as well as the challenges and open questions. Transformers to computer Vision [44, 5, 39, 49, 6, 16]. We report that Self-Supervised Transformers can achieve The recent work on Vision Transformers (ViT) [16] greatly strong results using a contrastive learning framework, com- pushes this frontier. ViT is purely Transformer-based, rather pared against masked auto-encoding (Table 1). This behav- than interlaced with non-degenerated ( , non-1 1) convo- ior of Transformers differs from the existing trend in NLP.

This largely closes the architectural gap between Moreover, as a promising signal, our bigger Self-Supervised NLP and Vision . ViT achieves compelling accuracy in su- ViT can achieve better accuracy, unlike the ImageNet- pervised learning, especially with large-scale data and high- supervised ViT in [16] whose accuracy degrades if get- capacity models. Given these properties, we believe ViT is ting bigger. For instance, for the very big ViT-Large, our a must- Study baseline for Self-Supervised learning in com- Self-Supervised pre- Training can outperform its supervised puter Vision . pre- Training counterpart for transfer learning in certain cases.

This presents a proof-of-concept scenario where self - Self-Supervised Transformers for Vision . In pioneering supervised pre- Training is needed. works [9, 16], Training Self-Supervised Transformers for vi- In addition, we report that our Self-Supervised ViT sion problems in general follows the masked auto-encoding models have competitive results vs. the big convolutional paradigm in NLP [37, 15] (Table 1). iGPT [9] masks and ResNets in prior art [11, 18]. On one hand, this compari- reconstructs pixels, and the Self-Supervised variant of ViT. son shows the potential of ViT, especially considering that in [16] masks and reconstructs patches.

An Empirical Study of Training Self-Supervised Vision ...

Tags:

Information

Transcription of An Empirical Study of Training Self-Supervised Vision ...

Related search queries

An Empirical Study of Training Self-Supervised Vision ...

Tags:

Information

Documents from same domain

Related documents

Related search queries