Generative Pretraining from Pixels - OpenAI

Generative Pretraining from PixelsMark Chen1 Alec Radford1 Rewon Child1 Jeff Wu1 Heewoo Jun1 Prafulla Dhariwal1 David Luan1 Ilya Sutskever1 AbstractInspired by progress in unsupervised representa-tion learning for natural language , we examinewhether similar models can learn useful repre-sentations for images. We train a sequence Trans-former to auto-regressively predict Pixels , withoutincorporating knowledge of the 2D input training on low-resolution ImageNet with-out labels, we find that a GPT-2 scale model learnsstrong image representations as measured by lin-ear probing, fine-tuning, and low-data classifica-tion. On CIFAR-10, we achieve accuracywith a linear probe, outperforming a supervisedWide ResNet, and accuracy with full fine-tuning, matching the top supervised pre-trainedmodels. An even larger model trained on a mix-ture of ImageNet and web images is competitivewith self-supervised benchmarks on ImageNet,achieving top-1 accuracy on a linear probeof our IntroductionUnsupervised pre-training played a central role in the resur-gence of deep learning.

Starting in the mid 2000 s, ap-proaches such as the Deep Belief Network (Hinton et al.,2006) and Denoising Autoencoder (Vincent et al., 2008)were commonly used in neural networks for computer vi-sion (Lee et al., 2009) and speech recognition (Mohamedet al., 2009). It was believed that a model which learnedthe data distributionP(X)would also learn beneficial fea-tures for the subsequent supervised modeling ofP(Y|X)(Lasserre et al., 2006; Erhan et al., 2010). However, advance-ments such as piecewise linear activation functions (Nair& Hinton, 2010), improved initializations (Glorot & Ben-gio, 2010), and normalization strategies (Ioffe & Szegedy,2015; Ba et al., 2016) removed the need for pre-training inorder to achieve strong results. Other research cast doubton the benefits ofdeepunsupervised representations and re-1 OpenAI , San Francisco, CA, USA.

Correspondence to: strong results using a single layer of learned features(Coates et al., 2011), or even random features (Huang et al.,2014; May et al., 2017). The approach fell out of favor asthe state of the art increasingly relied on directly encodingprior structure into the model and utilizing abundant su-pervised data to directly learn representations (Krizhevskyet al., 2012; Graves & Jaitly, 2014). Retrospective study ofunsupervised pre-training demonstrated that it could evenhurt performance in modern settings (Paine et al., 2014).Instead, unsupervised pre-training flourished in a differ-ent domain. After initial strong results for word vectors(Mikolov et al., 2013), it has pushed the state of the artforward in Natural language Processing on most tasks (Dai& Le, 2015; Peters et al., 2018; Howard & Ruder, 2018;Radford et al., 2018; Devlin et al.)

, 2018). Interestingly, thetraining objective of a dominant approach like BERT, theprediction of corrupted inputs, closely resembles that of theDenoising Autoencoder, which was originally developed a higher dimensional, noisier, and more redundant modal-ity than text, images are believed to be difficult for genera-tive modeling. Here, self-supervised approaches designed toencourage the modeling of more global structure (Doerschet al., 2015) have shown significant promise. A combinationof new training objectives (Oord et al., 2018), more recentarchitectures (Gomez et al., 2017), and increased model ca-pacity (Kolesnikov et al., 2019) has allowed these methodsto achieve state of the art performance in low data settings(H enaff et al., 2019) and sometimes even outperform super-vised representations in transfer learning settings (He et al.,2019; Misra & van der Maaten, 2019; Chen et al.

, 2020).Given that it has been a decade since the original wave ofgenerative pre-training methods for images and consideringtheir substantial impact in NLP, this class of methods is duefor a modern re-examination and comparison with the recentprogress of self-supervised methods. We re-evaluate genera-tive pre-training on images and demonstrate that when usinga flexible architecture (Vaswani et al., 2017), a tractable andefficient likelihood based training objective (Larochelle &Murray, 2011; Oord et al., 2016), and significant computeresources (2048 TPU cores), Generative pre-training is com-petitive with other self-supervised approaches and learnsGenerative Pretraining from PixelsFigure overview of our approach. First, we pre-process raw images by resizing to a low resolution and reshaping into a 1D then chose one of two pre-training objectives, auto-regressive next pixel prediction or masked pixel prediction.

Finally, we evaluatethe representations learned by these objectives with linear probes or that significantly improve the state of theart in low-resolution unsupervised representation is especially promising as our architecture uses a denseconnectivity pattern which does not encode the 2D spatialstructure of images yet is able to match and even outperformapproaches which do. We report a set of experiments charac-terizing the performance of our approach on many datasetsand in several different evaluation settings (low data, linearevaluation, full fine-tuning). We also conduct several exper-iments designed to better understand the achieved perfor-mance of these models. We investigate how representationsare computed inside our model via the performance of linearprobes as a function of model depth as well as studying howscaling the resolution and parameter count of the approachaffects ApproachOur approach consists of a pre-training stage followed bya fine-tuning stage.

In pre-training, we explore both theauto-regressive and BERT objectives. We also apply thesequence Transformer architecture to predict Pixels insteadof language way to measure representation quality is to fine-tune forimage classification. Fine-tuning adds a small classificationhead to the model , used to optimize a classification objectiveand adapts all weights. Pre-training can be viewed as afavorable initialization or as a regularizer when used incombination with early stopping (Erhan et al., 2010).Another approach for measuring representation quality usesthe pre-trained model as a feature extractor. In particular,given labeled examples(X,Y), the model is applied toXto produce featuresfX. Then, a linear classifier is trainedon(fX,Y). Linear probing captures the intuition that goodfeatures should linearly separate the classes of transfer , linear probes help disentangle feature qualityfrom model architecture: in fine-tuning, one model mayoutperform another because its architecture is more suitedfor the downstream task rather than because of better begin this section by defining the auto-regressive andBERT objectives in the context of images.

Next, we outlineimplementation details for our transformer decoder. Finally,we describe how the transformer is used for fine-tuning andhow features are extracted for linear Pre-trainingGiven an unlabeled datasetXconsisting of high dimen-sional datax= (x1,..,xn), we can pick a permutation of the set[1,n]and model the densityp(x)auto-regressivelyas follows:p(x) =n i=1p(x i|x 1,..,x i 1, )When working with images, we pick the identity permuta-tion i=ifor1 i n, also known as raster order. Wetrain our model by minimizing the negative log-likelihoodof the data:LAR=Ex X[ logp(x)]We also consider the BERT objective, which samples asub-sequenceM [1,n]such that each indexiindepen-dently has appearing inM. We callMthe BERT mask, and we train our model by minimizingthe negative log-likelihood of the masked elementsxMconditioned on the unmasked onesx[1,n]\M:LBERT=Ex XEM i M[ logp(xi|x[1,n]\M)]In pre-training, we pick one ofLARorLBERTand mini-mize the loss over our pre-training ArchitectureThe transformer decoder takes an input sequencex1.

,xnof discrete tokens and produces ad-dimensional embeddingfor each position. The decoder is realized as a stack ofLblocks, thel-th of which produces an intermediate em-beddinghl1,..,hlnalso of dimensiond. We use the GPT-2 Generative Pretraining from Pixels (Radford et al., 2019) formulation of the transformer de-coder block, which acts on an input tensorhlas follows:nl=layernorm(hl)al=hl+multiheada ttention(nl)hl+1=al+mlp(layernorm(al))In particular, layer norms precede both the attention andmlp operations, and all operations lie strictly on residualpaths. We find that such a formulation allows us to scale thetransformer with only mixing across sequence elements occurs in theattention operation, and to ensure proper conditioning whentraining the AR objective, we apply the standard uppertriangular mask to then nmatrix of attention logits.

Whenusing the BERT objective, no attention logit masking isrequired: after applying content embeddings to the inputsequence, we zero out the positions , since we learn independent position embed-dings for each sequence element, our BERT model has nopositional inductive biases ( it is permutation invariant).Put another way, any spatial relationships between posi-tions must be learned by the model at train time. This isnot entirely true for the AR model , as choosing the rasterorder also fixes a prespecified ordering of the condition-als. Nevertheless, permutation invariance is a property instrong contrast to convolutional neural networks, which in-corporate the inductive bias that features should arise fromspatially proximate the final transformer layer, we apply a layer normnL=layernorm(hL), and learn a projection fromnLtologits parameterizing the conditional distributions at eachsequence element.

Generative Pretraining from Pixels - OpenAI

Tags:

Information

Transcription of Generative Pretraining from Pixels - OpenAI

Related search queries

Generative Pretraining from Pixels - OpenAI

Tags:

Information

Documents from same domain

Related documents

Related search queries