Transcription of Deep Contextualized Word Representations
1 Proceedings of NAACL-HLT 2018, pages 2227 2237 New Orleans, Louisiana, June 1 - 6, 2018 Association for Computational LinguisticsDeep Contextualized word representationsMatthew E. Peters , Mark Neumann , Mohit Iyyer , Matt Gardner Clark , Kenton Lee , Luke Zettlemoyer Allen Institute for Artificial Intelligence Paul G. Allen School of Computer Science & Engineering, University of WashingtonAbstractWe introduce a new type ofdeep contextual-izedword representation that models both (1)complex characteristics of word use ( , syn-tax and semantics), and (2) how these usesvary across linguistic contexts ( , to modelpolysemy). Our word vectors are learned func-tions of the internal states of a deep bidirec-tional language model (biLM), which is pre-trained on a large text corpus.
2 We show thatthese Representations can be easily added toexisting models and significantly improve thestate of the art across six challenging NLPproblems, including question answering, tex-tual entailment and sentiment analysis. Wealso present an analysis showing that exposingthe deep internals of the pre-trained network iscrucial, allowing downstream models to mixdifferent types of semi-supervision IntroductionPre-trained word Representations (Mikolov et al.,2013;Pennington et al.,2014) are a key compo-nent in many neural language understanding mod-els. However, learning high quality representa-tions can be challenging. They should ideallymodel both (1) complex characteristics of worduse ( , syntax and semantics), and (2) how theseuses vary across linguistic contexts ( , to modelpolysemy).
3 In this paper, we introduce a new typeofdeep contextualizedword representation thatdirectly addresses both challenges, can be easilyintegrated into existing models, and significantlyimproves the state of the art in every consideredcase across a range of challenging language un-derstanding Representations differ from traditional wordtype embeddings in that each token is assigned arepresentation that is a function of the entire inputsentence. We use vectors derived from a bidirec-tional LSTM that is trained with a coupled lan-guage model (LM) objective on a large text cor-pus. For this reason, we call them ELMo (Em-beddings from Language Models) previous approaches for learning contextu-alized word vectors (Peters et al.,2017;McCannet al.,2017), ELMo Representations are deep, inthe sense that they are a function of all of the in-ternal layers of the biLM.
4 More specifically, welearn a linear combination of the vectors stackedabove each input word for each end task, whichmarkedly improves performance over just usingthe top LSTM the internal states in this manner al-lows for very rich word Representations . Using in-trinsic evaluations, we show that the higher-levelLSTM states capture context-dependent aspectsof word meaning ( , they can be used with-out modification to perform well on supervisedword sense disambiguation tasks) while lower-level states model aspects of syntax ( , they canbe used to do part-of-speech tagging). Simultane-ously exposing all of these signals is highly bene-ficial, allowing the learned models select the typesof semi-supervision that are most useful for eachend experiments demonstrate that ELMorepresentations work extremely well in first show that they can be easily added toexisting models for six diverse and challenginglanguage understanding problems, including tex-tual entailment, question answering and sentimentanalysis.
5 The addition of ELMo representationsalone significantly improves the state of the artin every case, including up to 20% relative errorreductions. For tasks where direct comparisonsare possible, ELMo outperforms CoVe (McCannet al.,2017), which computes Contextualized rep-resentations using a neural machine translation en-coder. Finally, an analysis of both ELMo andCoVe reveals that deep Representations outperform2227those derived from just the top layer of an trained models and code are publicly avail-able, and we expect that ELMo will provide simi-lar gains for many other NLP Related workDue to their ability to capture syntactic and se-mantic information of words from large scale un-labeled text, pretrained word vectors (Turian et al.,2010;Mikolov et al.)
6 ,2013;Pennington et al.,2014) are a standard component of most state-of-the-art NLP architectures, including for questionanswering (Liu et al.,2017), textual entailment(Chen et al.,2017) and semantic role labeling(He et al.,2017). However, these approaches forlearning word vectors only allow a single context-independent representation for each proposed methods overcome someof the shortcomings of traditional word vectorsby either enriching them with subword informa-tion ( ,Wieting et al.,2016;Bojanowski et al.,2017) or learning separate vectors for each wordsense ( ,Neelakantan et al.,2014). Our ap-proach also benefits from subword units throughthe use of character convolutions, and we seam-lessly incorporate multi-sense information intodownstream tasks without explicitly training topredict predefined sense recent work has also focused onlearning context-dependent (Melamud et al.
7 ,2016) uses abidirectional Long Short Term Memory (LSTM;Hochreiter and Schmidhuber,1997) to encode thecontext around a pivot word. Other approachesfor learning contextual embeddings include thepivot word itself in the representation and arecomputed with the encoder of either a supervisedneural machine translation (MT) system (CoVe;McCann et al.,2017) or an unsupervised lan-guage model (Peters et al.,2017). Both of theseapproaches benefit from large datasets, althoughthe MT approach is limited by the size of parallelcorpora. In this paper, we take full advantage ofaccess to plentiful monolingual data, and trainour biLM on a corpus with approximately 30million sentences (Chelba et al.,2014). We alsogeneralize these approaches to deep contextualrepresentations, which we show work well acrossa broad range of diverse NLP work has also shown that different lay-ers of deep biRNNs encode different types of in-formation.
8 For example, introducing multi-tasksyntactic supervision ( , part-of-speech tags) atthe lower levels of a deep LSTM can improveoverall performance of higher level tasks such asdependency parsing (Hashimoto et al.,2017) orCCG super tagging (S gaard and Goldberg,2016).In an RNN-based encoder-decoder machine trans-lation system,Belinkov et al.(2017) showed thatthe Representations learned at the first layer in a 2-layer LSTM encoder are better at predicting POStags then second layer. Finally, the top layer of anLSTM for encoding word context (Melamud et al.,2016) has been shown to learn Representations ofword sense. We show that similar signals are alsoinduced by the modified language model objectiveof our ELMo Representations , and it can be verybeneficial to learn models for downstream tasksthat mix these different types of and Le(2015) andRamachandran et al.
9 (2017) pretrain encoder-decoder pairs using lan-guage models and sequence autoencoders and thenfine tune with task specific supervision. In con-trast, after pretraining the biLM with unlabeleddata, we fix the weights and add additional task-specific model capacity, allowing us to leveragelarge, rich and universal biLM Representations forcases where downstream training data size dictatesa smaller supervised ELMo: Embeddings from LanguageModelsUnlike most widely used word embeddings (Pen-nington et al.,2014), ELMo word representationsare functions of the entire input sentence, as de-scribed in this section. They are computed on topof two-layer biLMs with character convolutions( ), as a linear function of the internal net-work states ( ). This setup allows us to dosemi-supervised learning, where the biLM is pre-trained at a large scale ( ) and easily incor-porated into a wide range of existing neural NLParchitectures ( ).
10 Bidirectional language modelsGiven a sequence ofNtokens,(t1,t2,..,tN),aforward language model computes the probabilityof the sequence by modeling the probability of to-2228kentkgiven the history(t1,..,tk 1):p(t1,t2,..,tN)=NYk=1p(tk|t1,t2,..,tk 1).Recent state-of-the-art neural language models(J ozefowicz et al.,2016;Melis et al.,2017;Mer-ity et al.,2017) compute a context-independent to-ken representationxLMk(via token embeddings ora CNN over characters) then pass it throughLlay-ers of forward LSTMs. At each positionk, eachLSTM layer outputs a context-dependent repre-sentation !hLMk,jwherej=1,..,L. The top layerLSTM output, !hLMk,L, is used to predict the nexttokentk+1with a Softmax backward LM is similar to a forward LM, ex-cept it runs over the sequence in reverse, predict-ing the previous token given the future context:p(t1,t2.)