Learning Transferable Visual Models From Natural …

Learning Transferable Visual Models From Natural Language SupervisionAlec Radford* 1 Jong Wook Kim* 1 Chris Hallacy1 Aditya Ramesh1 Gabriel Goh1 Sandhini Agarwal1 Girish Sastry1 Amanda Askell1 Pamela Mishkin1 Jack Clark1 Gretchen Krueger1 Ilya Sutskever1 AbstractState-of-the-art computer vision systems aretrained to predict a fixed set of predeterminedobject categories. This restricted form of super-vision limits their generality and usability sinceadditional labeled data is needed to specify anyother Visual concept. Learning directly from rawtext about images is a promising alternative whichleverages a much broader source of demonstrate that the simple pre-training taskof predicting which caption goes with which im-age is an efficient and scalable way to learn SOTA image representations from scratch on a datasetof 400 million (image, text) pairs collected fromthe internet.

After pre-training, Natural languageis used to reference learned Visual concepts (ordescribe new ones) enabling zero-shot transferof the model to downstream tasks. We studythe performance of this approach by benchmark-ing on over 30 different existing computer vi-sion datasets, spanning tasks such as OCR, actionrecognition in videos, geo-localization, and manytypes of fine-grained object classification. Themodel transfers non-trivially to most tasks and isoften competitive with a fully supervised baselinewithout the need for any dataset specific instance, we match the accuracy of the orig-inal ResNet-50 on ImageNet zero-shot withoutneeding to use any of the million trainingexamples it was trained Introduction and Motivating WorkPre-training methods which learn directly from raw texthave revolutionized NLP over the last few years (Dai Peters et al.)

, 2018; Howard & Ruder, 2018; Rad-ford et al., 2018; Devlin et al., 2018; Raffel et al., 2019).Task-agnostic objectives such as autoregressive and maskedlanguage modeling have scaled across many orders of mag-*Equal contribution1 OpenAI, San Francisco, CA 94110, to:<{alec, in compute, model capacity, and data, steadily im-proving capabilities. The development of text-to-text asa standardized input-output interface (McCann et al., 2018;Radford et al., 2019; Raffel et al., 2019) has enabled task-agnostic architectures to zero-shot transfer to downstreamdatasets removing the need for specialized output heads ordataset specific customization.}

Flagship systems like GPT-3(Brown et al., 2020) are now competitive across many taskswith bespoke Models while requiring little to no datasetspecific training results suggest that the aggregate supervision acces-sible to modern pre-training methods within web-scale col-lections of text surpasses that of high-quality crowd-labeledNLP datasets. However, in other fields such as computervision it is still standard practice to pre-train Models oncrowd-labeled datasets such as ImageNet (Deng et al., 2009).Could scalable pre-training methods which learn directlyfrom web text result in a similar breakthrough in computervision?

Prior work is 20 years ago Mori et al. (1999) explored improvingcontent based image retrieval by training a model to pre-dict the nouns and adjectives in text documents paired withimages. Quattoni et al. (2007) demonstrated it was possi-ble to learn more data efficient image representations viamanifold Learning in the weight space of classifiers trainedto predict words in captions associated with images. Sri-vastava & Salakhutdinov (2012) explored deep represen-tation Learning by training multimodal Deep BoltzmannMachines on top of low-level image and text tag featurefeatures.

Joulin et al. (2016) modernized this line of workand demonstrated that CNNs trained to predict words inimage captions learn useful image representations. Theyconverted the title, description, and hashtag metadata of im-ages in the YFCC100M dataset (Thomee et al., 2016) intoa bag-of-words multi-label classification task and showedthat pre-training AlexNet (Krizhevsky et al., 2012) to pre-dict these labels learned representations which preformedsimilarly to ImageNet-based pre-training on transfer et al. (2017) then extended this approach to predictingphrase n-grams in addition to individual words and demon-strated the ability of their system to zero-shot transfer toother image classification datasets by scoring target classesbased on their dictionary of learned Visual n-grams andLearning Transferable Visual Models From Natural Language Supervision2I1 T2I1 T1I2 T1I3 I1 T1I2 T2I3 T3(1) Contrastive pre-trainingImageEncoderTextEncoderPeppe r theaussie pupPepper theaussie pupPepper theaussie pupPepper theaussie (2)

Create dataset classifier from label textplanecardog birdA photo ofa {object}. (3) Use for zero-shot predictionImageEncoderI1I1 T2I1 TNI1 photo of a T1IN T2IN T3I1 TNI2 TNI3 TN .. IN TNI1 T3 Figure of our approach. While standard image Models jointly train an image feature extractor and a linear classifier to predictsome label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) trainingexamples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of thetarget dataset s the one with the highest score.

Adopting morerecent architectures and pre-training approaches, VirTex(Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al.,2020), and ConVIRT (Zhang et al., 2020) have recentlydemonstrated the potential of transformer-based languagemodeling, masked language modeling, and contrastive ob-jectives to learn image representations from exciting as proofs of concept, using Natural languagesupervision for image representation Learning is still is likely because demonstrated performance on com-mon benchmarks is much lower than alternative example, Li et al.

(2017) reach only accuracyon ImageNet in a zero-shot setting. This is well below accuracy of the current state of the art (Xie et al.,2020). It is even below the 50% accuracy of classic com-puter vision approaches (Deng et al., 2012). Instead, morenarrowly scoped but well-targeted uses of weak supervisionhave improved performance. Mahajan et al. (2018) showedthat predicting ImageNet related hashtags on Instagram im-ages is an effective pre-training task. When fine-tuned toImageNet these pre-trained Models increased accuracy byover 5% and improved the overall state of the art at the et al.

(2019) and Dosovitskiy et al. (2020) havealso demonstrated large gains on a broader set of transferbenchmarks by pre-training Models to predict the classes ofthe noisily labeled JFT-300M line of work represents the current pragmatic middleground between Learning from a limited amount of super-vised gold-labels and Learning from practically unlimitedamounts of raw text. However, it is not without compro-mises. Both works carefully design, and in the process limit,their supervision to 1000 and 18291 classes language is able to express, and therefore supervise,a much wider set of Visual concepts through its general-ity.

Learning Transferable Visual Models From Natural …

Tags:

Information

Transcription of Learning Transferable Visual Models From Natural …

Related search queries

Learning Transferable Visual Models From Natural …

Tags:

Information

Documents from same domain

Related documents

Related search queries