Language Models are Unsupervised Multitask Learners

Language Models are Unsupervised Multitask LearnersAlec Radford*1 Jeffrey Wu*1 Rewon Child1 David Luan1 Dario Amodei**1 Ilya Sutskever**1 AbstractNatural Language processing tasks, such as ques-tion answering, machine translation, reading com-prehension, and summarization, are typicallyapproached with supervised learning on task-specific datasets. We demonstrate that languagemodels begin to learn these tasks without any ex-plicit supervision when trained on a new datasetof millions of webpages called WebText. Whenconditioned on a document plus questions, the an-swers generated by the Language model reach 55F1 on the CoQA dataset - matching or exceedingthe performance of 3 out of 4 baseline systemswithout using the 127,000+ training capacity of the Language model is essentialto the success of zero-shot task transfer and in-creasing it improves performance in a log-linearfashion across tasks.

Our largest model, GPT-2,is a parameter Transformer that achievesstate of the art results on 7 out of 8 tested lan-guage modeling datasets in a zero-shot settingbut still underfits WebText. Samples from themodel reflect these improvements and contain co-herent paragraphs of text. These findings suggesta promising path towards building Language pro-cessing systems which learn to perform tasks fromtheir naturally occurring IntroductionMachine learning systems now excel (in expectation) attasks they are trained for by using a combination of largedatasets, high-capacity Models , and supervised learning(Krizhevsky et al.)

, 2012) (Sutskever et al., 2014) (Amodeiet al., 2016). Yet these systems are brittle and sensitive toslight changes in the data distribution (Recht et al., 2018)and task specification (Kirkpatrick et al., 2017). Current sys-tems are better characterized as narrow experts rather than*, **Equal contribution1 OpenAI, San Francisco, Califor-nia, United to:Alec generalists. We would like to move towards moregeneral systems which can perform many tasks eventuallywithout the need to manually create and label a trainingdataset for each dominant approach to creating ML systems is to col-lect a dataset of training examples demonstrating correctbehavior for a desired task, train a system to imitate thesebehaviors, and then test its performance on independentand identically distributed (IID) held-out examples.

Thishas served well to make progress on narrow experts. Butthe often erratic behavior of captioning Models (Lake et al.,2017), reading comprehension systems (Jia & Liang, 2017),and image classifiers (Alcorn et al., 2018) on the diversityand variety of possible inputs highlights some of the short-comings of this suspicion is that the prevalence of single task trainingon single domain datasets is a major contributor to the lackof generalization observed in current systems. Progresstowards robust systems with current architectures is likelyto require training and measuring performance on a widerange of domains and tasks.

Recently, several benchmarkshave been proposed such as GLUE (Wang et al., 2018) anddecaNLP (McCann et al., 2018) to begin studying learning (Caruana, 1997) is a promising frame-work for improving general performance. However, mul-titask training in NLP is still nascent. Recent work re-ports modest performance improvements (Yogatama et al.,2019) and the two most ambitious efforts to date havetrained on a total of 10 and 17(dataset, objective)pairs respectively (McCann et al., 2018) (Bowman et al.,2018). From a meta-learning perspective, each(dataset,objective)pair is a single training example sampledfrom the distribution of datasets and objectives.

CurrentML systems need hundreds to thousands of examples toinduce functions which generalize well. This suggests thatmultitask training many need just as many effective trainingpairs to realize its promise with current approaches. It willbe very difficult to continue to scale the creation of datasetsand the design of objectives to the degree that may be re-quired to brute force our way there with current motivates exploring additional setups for performingmultitask current best performing systems on Language tasksLanguage Models are Unsupervised Multitask LearnersFigure task performance of WebText LMs as a function of model size on many NLP tasks.

Reading Comprehension resultsare on CoQA (Reddy et al., 2018), translation on WMT-14 Fr-En (Artetxe et al., 2017), summarization on CNN and Daily Mail (See et al.,2017), and Question Answering on Natural Questions (Kwiatkowski et al., 2019). Section 3 contains detailed descriptions of each a combination of pre-training and supervised fine-tuning. This approach has a long history with a trend to-wards more flexible forms of transfer. First, word vectorswere learned and used as inputs to task-specific architec-tures (Mikolov et al., 2013) (Collobert et al., 2011), thenthe contextual representations of recurrent networks weretransferred (Dai & Le, 2015) (Peters et al.)

, 2018), and re-cent work suggests that task-specific architectures are nolonger necessary and transferring many self-attention blocksis sufficient (Radford et al., 2018) (Devlin et al., 2018).These methods still require supervised training in orderto perform a task. When only minimal or no superviseddata is available, another line of work has demonstratedthe promise of Language Models to perform specific tasks,such as commonsense reasoning (Schwartz et al., 2017) andsentiment analysis (Radford et al., 2017).In this paper, we connect these two lines of work and con-tinue the trend of more general methods of transfer.

Wedemonstrate Language Models can perform down-streamtasks in a zero-shot setting without any parameter or archi-tecture modification. We demonstrate this approach showspotential by highlighting the ability of Language Models toperform a wide range of tasks in a zero-shot setting. Weachieve promising, competitive, and state of the art resultsdepending on the ApproachAt the core of our approach is Language modeling. Lan-guage modeling is usually framed as Unsupervised distri-bution estimation from a set of examples(x1, x2, .., xn)each composed of variable length sequences of symbols(s1, s2.)

, sn). Since Language has a natural sequential or-dering, it is common to factorize the joint probabilities oversymbols as the product of conditional probabilities (Jelinek& Mercer, 1980) (Bengio et al., 2003):p(x) =n i=1p(sn|s1, .., sn 1)(1)This approach allows for tractable sampling from and es-timation ofp(x)as well as any conditionals of the formp(sn k, .., sn|s1, .., sn k 1). In recent years, there havebeen significant improvements in the expressiveness of mod-els that can compute these conditional probabilities, such asself-attention architectures like the Transformer (Vaswaniet al.

Language Models are Unsupervised Multitask Learners

Information

Transcription of Language Models are Unsupervised Multitask Learners

Related search queries

Language Models are Unsupervised Multitask Learners

Information

Documents from same domain

Related documents

Related search queries