Transcription of Learning Transferable Visual Models From Natural Language ...
1 Learning Transferable Visual Models From Natural Language SupervisionAlec Radford* 1 Jong Wook Kim* 1 Chris Hallacy1 Aditya Ramesh1 Gabriel Goh1 Sandhini Agarwal1 Girish Sastry1 Amanda Askell1 Pamela Mishkin1 Jack Clark1 Gretchen Krueger1 Ilya Sutskever1 AbstractState-of-the-art computer vision systems aretrained to predict a fixed set of predeterminedobject categories. This restricted form of super-vision limits their generality and usability sinceadditional labeled data is needed to specify anyother Visual concept. Learning directly from rawtext about images is a promising alternative whichleverages a much broader source of demonstrate that the simple pre-training taskof predicting which caption goes with which im-age is an efficient and scalable way to learn SOTA image representations from scratch on a datasetof 400 million (image, text) pairs collected fromthe internet.
2 After pre-training, Natural languageis used to reference learned Visual concepts (ordescribe new ones) enabling zero-shot transferof the model to downstream tasks. We studythe performance of this approach by benchmark-ing on over 30 different existing computer vi-sion datasets, spanning tasks such as OCR, ac-tion recognition in videos, geo-localization, andmany types of fine-grained object model transfers non-trivially to most tasksand is often competitive with a fully supervisedbaseline without the need for any dataset spe-cific training.
3 For instance, we match the ac-curacy of the original ResNet-50 on ImageNetzero-shot without needing to use any of the training examples it was trained on. Werelease our code and pre-trained model weights Introduction and Motivating WorkPre-training methods which learn directly from raw texthave revolutionized NLP over the last few years (Dai Peters et al., 2018; Howard & Ruder, 2018; Rad-ford et al., 2018; Devlin et al., 2018; Raffel et al., 2019).*Equal contribution1 OpenAI, San Francisco, CA 94110, to:<{alec, objectives such as autoregressive and maskedlanguage modeling have scaled across many orders of mag-nitude in compute, model capacity, and data, steadily im-proving capabilities.}
4 The development of text-to-text asa standardized input-output interface (McCann et al., 2018;Radford et al., 2019; Raffel et al., 2019) has enabled task-agnostic architectures to zero-shot transfer to downstreamdatasets removing the need for specialized output heads ordataset specific customization. Flagship systems like GPT-3(Brown et al., 2020) are now competitive across many taskswith bespoke Models while requiring little to no datasetspecific training results suggest that the aggregate supervision acces-sible to modern pre-training methods within web-scale col-lections of text surpasses that of high-quality crowd-labeledNLP datasets.
5 However, in other fields such as computervision it is still standard practice to pre-train Models oncrowd-labeled datasets such as ImageNet (Deng et al., 2009).Could scalable pre-training methods which learn directlyfrom web text result in a similar breakthrough in computervision? Prior work is 20 years ago Mori et al. (1999) explored improvingcontent based image retrieval by training a model to pre-dict the nouns and adjectives in text documents paired withimages. Quattoni et al. (2007) demonstrated it was possi-ble to learn more data efficient image representations viamanifold Learning in the weight space of classifiers trainedto predict words in captions associated with images.
6 Sri-vastava & Salakhutdinov (2012) explored deep represen-tation Learning by training multimodal Deep BoltzmannMachines on top of low-level image and text tag et al. (2016) modernized this line of work and demon-strated that CNNs trained to predict words in image cap-tions learn useful image representations. They convertedthe title, description, and hashtag metadata of images in theYFCC100M dataset (Thomee et al., 2016) into a bag-of-words multi-label classification task and showed that pre-training AlexNet (Krizhevsky et al.)
7 , 2012) to predict theselabels learned representations which preformed similarlyto ImageNet-based pre-training on transfer tasks. Li et al.(2017) then extended this approach to predicting phrase n-grams in addition to individual words and demonstrated theability of their system to zero-shot transfer to other [ ] 26 Feb 2021 Learning Transferable Visual Models From Natural Language Supervision2I1 T2I1 T1I2 T1I3 I1 T1I2 T2I3 T3(1) Contrastive pre-trainingImageEncoderTextEncoderPeppe r theaussie pupPepper theaussie pupPepper theaussie pupPepper theaussie (2) Create dataset classifier from label textplanecardog birdA photo ofa {object}.
8 (3) Use for zero-shot predictionImageEncoderI1I1 T2I1 TNI1 photo of a T1IN T2IN T3I1 TNI2 TNI3 TN .. IN TNI1 T3 Figure of our approach. While standard image Models jointly train an image feature extractor and a linear classifier to predictsome label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) trainingexamples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of thetarget dataset s datasets by scoring target classes based ontheir dictionary of learned Visual n-grams and predicting theone with the highest score.
9 Adopting more recent architec-tures and pre-training approaches, VirTex (Desai & Johnson,2020), ICMLM (Bulent Sariyildiz et al., 2020), and Con-VIRT (Zhang et al., 2020) have recently demonstrated thepotential of transformer-based Language modeling, maskedlanguage modeling, and contrastive objectives to learn im-age representations from exciting as proofs of concept, using Natural languagesupervision for image representation Learning is still is likely because demonstrated performance on com-mon benchmarks is much lower than alternative example, Li et al.
10 (2017) reach only accuracyon ImageNet in a zero-shot setting. This is well below accuracy of the current state of the art (Xie et al.,2020). It is even below the 50% accuracy of classic com-puter vision approaches (Deng et al., 2012). Instead, morenarrowly scoped but well-targeted uses of weak supervisionhave improved performance . Mahajan et al. (2018) showedthat predicting ImageNet-related hashtags on Instagram im-ages is an effective pre-training task. When fine-tuned toImageNet these pre-trained Models increased accuracy byover 5% and improved the overall state of the art at the et al.