Example: dental hygienist

arXiv:1810.04805v2 [cs.CL] 24 May 2019

BERT: Pre-training of deep Bidirectional Transformers forLanguage UnderstandingJacob Devlin Ming-Wei Chang Kenton Lee Kristina ToutanovaGoogle AI introduce a new language representa-tion model calledBERT, which stands forBidirectionalEncoderRepresentations fromTransformers. Unlike recent language repre-sentation models (Peters et al., 2018a; Rad-ford et al., 2018), BERT is designed to pre-train deep bidirectional representations fromunlabeled text by jointly conditioning on bothleft and right context in all layers. As a re-sult, the pre-trained BERT model can be fine-tuned with just one additional output layerto create state-of-the-art models for a widerange of tasks, such as question answering andlanguage inference, without substantial task-specific architecture is conceptually simple and empiricallypowerful.

trained deep bidirectional representations. This is also in contrast toPeters et al.(2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. •We show that pre-trained representations reduce the need for many heavily-engineered task-specific architectures. BERT is the first fine-

Tags:

  Deep

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of arXiv:1810.04805v2 [cs.CL] 24 May 2019

1 BERT: Pre-training of deep Bidirectional Transformers forLanguage UnderstandingJacob Devlin Ming-Wei Chang Kenton Lee Kristina ToutanovaGoogle AI introduce a new language representa-tion model calledBERT, which stands forBidirectionalEncoderRepresentations fromTransformers. Unlike recent language repre-sentation models (Peters et al., 2018a; Rad-ford et al., 2018), BERT is designed to pre-train deep bidirectional representations fromunlabeled text by jointly conditioning on bothleft and right context in all layers. As a re-sult, the pre-trained BERT model can be fine-tuned with just one additional output layerto create state-of-the-art models for a widerange of tasks, such as question answering andlanguage inference, without substantial task-specific architecture is conceptually simple and empiricallypowerful.

2 It obtains new state-of-the-art re-sults on eleven natural language processingtasks, including pushing the GLUE score ( point absolute improvement),MultiNLI accuracy to ( absoluteimprovement), SQuAD question answer-ing Test F1 to ( point absolute im-provement) and SQuAD Test F1 to ( point absolute improvement).1 IntroductionLanguage model pre-training has been shown tobe effective for improving many natural languageprocessing tasks (Dai and Le, 2015; Peters et al.,2018a; Radford et al., 2018; Howard and Ruder,2018). These include sentence-level tasks such asnatural language inference (Bowman et al., 2015;Williams et al., 2018) and paraphrasing (Dolanand Brockett, 2005), which aim to predict the re-lationships between sentences by analyzing themholistically, as well as token-level tasks such asnamed entity recognition and question answering,where models are required to produce fine-grainedoutput at the token level (Tjong Kim Sang andDe Meulder, 2003; Rajpurkar et al.)

3 , 2016).There are two existing strategies for apply-ing pre-trained language representations to down-stream tasks:feature-basedandfine-tuning. Thefeature-based approach, such as ELMo (Peterset al., 2018a), uses task-specific architectures thatinclude the pre-trained representations as addi-tional features. The fine-tuning approach, such asthe Generative Pre-trained Transformer (OpenAIGPT) (Radford et al., 2018), introduces minimaltask-specific parameters, and is trained on thedownstream tasks by simply fine-tuningallpre-trained parameters. The two approaches share thesame objective function during pre-training, wherethey use unidirectional language models to learngeneral language argue that current techniques restrict thepower of the pre-trained representations, espe-cially for the fine-tuning approaches.

4 The ma-jor limitation is that standard language models areunidirectional, and this limits the choice of archi-tectures that can be used during pre-training. Forexample, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only at-tend to previous tokens in the self-attention layersof the Transformer (Vaswani et al., 2017). Such re-strictions are sub-optimal for sentence-level tasks,and could be very harmful when applying fine-tuning based approaches to token-level tasks suchas question answering, where it is crucial to incor-porate context from both this paper, we improve the fine-tuning basedapproaches by proposing BERT:BidirectionalEncoderRepresentations alleviates the previously mentioned unidi-rectionality constraint by using a masked lan-guage model (MLM) pre-training objective, in-spired by the Cloze task (Taylor, 1953).

5 Themasked language model randomly masks some ofthe tokens from the input, and the objective is topredict the original vocabulary id of the [ ] 24 May 2019word based only on its context. Unlike left-to-right language model pre-training, the MLM ob-jective enables the representation to fuse the leftand the right context, which allows us to pre-train a deep bidirectional Transformer. In addi-tion to the masked language model, we also usea next sentence prediction task that jointly pre-trains text-pair representations. The contributionsof our paper are as follows: We demonstrate the importance of bidirectionalpre-training for language representations. Un-like Radford et al. (2018), which uses unidirec-tional language models for pre-training, BERT uses masked language models to enable pre-trained deep bidirectional representations.

6 Thisis also in contrast to Peters et al. (2018a), whichuses a shallow concatenation of independentlytrained left-to-right and right-to-left LMs. We show that pre-trained representations reducethe need for many heavily-engineered task-specific architectures. BERT is the first fine-tuning based representation model that achievesstate-of-the-art performance on a large suiteof sentence-levelandtoken-level tasks, outper-forming many task-specific architectures. BERT advances the state of the art for elevenNLP code and pre-trained mod-els are available Related WorkThere is a long history of pre-training general lan-guage representations, and we briefly review themost widely-used approaches in this Unsupervised Feature-based ApproachesLearning widely applicable representations ofwords has been an active area of research fordecades, including non-neural (Brown et al.)

7 , 1992;Ando and Zhang, 2005; Blitzer et al., 2006) andneural (Mikolov et al., 2013; Pennington et al.,2014) word embeddingsare an integral part of modern NLP systems, of-fering significant improvements over embeddingslearned from scratch (Turian et al., 2010). To pre-train word embedding vectors, left-to-right lan-guage modeling objectives have been used (Mnihand Hinton, 2009), as well as objectives to dis-criminate correct from incorrect words in left andright context (Mikolov et al., 2013).These approaches have been generalized tocoarser granularities, such as sentence embed-dings (Kiros et al., 2015; Logeswaran and Lee,2018) or paragraph embeddings (Le and Mikolov,2014). To train sentence representations, priorwork has used objectives to rank candidate nextsentences (Jernite et al.

8 , 2017; Logeswaran andLee, 2018), left-to-right generation of next sen-tence words given a representation of the previoussentence (Kiros et al., 2015), or denoising auto-encoder derived objectives (Hill et al., 2016).ELMo and its predecessor (Peters et al., 2017,2018a) generalize traditional word embedding re-search along a different dimension. They extractcontext-sensitivefeatures from a left-to-right and aright-to-left language model. The contextual rep-resentation of each token is the concatenation ofthe left-to-right and right-to-left integrating contextual word embeddingswith existing task-specific architectures, ELMoadvances the state of the art for several major NLPbenchmarks (Peters et al., 2018a) including ques-tion answering (Rajpurkar et al., 2016), sentimentanalysis (Socher et al., 2013), and named entityrecognition (Tjong Kim Sang and De Meulder,2003).

9 Melamud et al. (2016) proposed learningcontextual representations through a task to pre-dict a single word from both left and right contextusing LSTMs. Similar to ELMo, their model isfeature-based and not deeply bidirectional. Feduset al. (2018) shows that the cloze task can be usedto improve the robustness of text generation Unsupervised Fine-tuning ApproachesAs with the feature-based approaches, the firstworks in this direction only pre-trained word em-bedding parameters from unlabeled text (Col-lobert and Weston, 2008).More recently, sentence or document encoderswhich produce contextual token representationshave been pre-trained from unlabeled text andfine-tuned for a supervised downstream task (Daiand Le, 2015; Howard and Ruder, 2018; Radfordet al., 2018). The advantage of these approachesis that few parameters need to be learned fromscratch.

10 At least partly due to this advantage,OpenAI GPT (Radford et al., 2018) achieved pre-viously state-of-the-art results on many sentence-level tasks from the GLUE benchmark (Wanget al., 2018a).Left-to-right language model-BERTBERTE[CLS]E1 E[SEP]..ENE1 ..EM CT1T[SEP]..TNT1 ..TM [CLS]Tok 1 [SEP]..Tok NTok SpanBERTE[CLS]E1 E[SEP]..ENE1 ..EM CT1T[SEP]..TNT1 ..TM [CLS]Tok 1 [SEP]..Tok NTok Sentence AMasked Sentence BPre-trainingFine-TuningNSPMask LMMask LMUnlabeled Sentence A and B Pair SQuADQuestion Answer PairNERMNLIF igure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec-tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initializemodels for different down-stream tasks. During fine-tuning, all parameters are fine-tuned.


Related search queries