Example: confidence

Language Models as Knowledge Bases?

Language Models as Knowledge Bases? Fabio Petroni1 Tim Rockta schel1,2 Patrick Lewis1,2 Anton Bakhtin1. Yuxiang Wu1,2 Alexander H. Miller1 Sebastian Riedel1,2. 1. Facebook AI Research 2. University College London {fabiopetroni, rockt, plewis, yolo, yuxiangwu, ahm, Abstract Memory Query Answer (Dante, born-in, X). Recent progress in pretraining Language mod- els on large textual corpora led to a surge Symbolic KB. KB Dante Florence Memory Access of improvements for downstream NLP tasks. born-in Whilst learning linguistic Knowledge , these Florence Models may also be storing relational knowl- edge present in the training data, and may Dante was born in [Mask].. be able to answer queries structured as fill- Neural LM. LM Florence in-the-blank cloze statements. Language Memory Access Models have many advantages over structured ELMo/BERT. Knowledge bases: they require no schema en- gineering, allow practitioners to query about Figure 1: Querying Knowledge bases (KB) and lan- an open class of relations, are easy to extend to guage Models (LM) for factual Knowledge .}

knowledge bases: they require no schema en-gineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained ...

Tags:

  Ingegneri, E ngineering

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Language Models as Knowledge Bases?

1 Language Models as Knowledge Bases? Fabio Petroni1 Tim Rockta schel1,2 Patrick Lewis1,2 Anton Bakhtin1. Yuxiang Wu1,2 Alexander H. Miller1 Sebastian Riedel1,2. 1. Facebook AI Research 2. University College London {fabiopetroni, rockt, plewis, yolo, yuxiangwu, ahm, Abstract Memory Query Answer (Dante, born-in, X). Recent progress in pretraining Language mod- els on large textual corpora led to a surge Symbolic KB. KB Dante Florence Memory Access of improvements for downstream NLP tasks. born-in Whilst learning linguistic Knowledge , these Florence Models may also be storing relational knowl- edge present in the training data, and may Dante was born in [Mask].. be able to answer queries structured as fill- Neural LM. LM Florence in-the-blank cloze statements. Language Memory Access Models have many advantages over structured ELMo/BERT. Knowledge bases: they require no schema en- gineering, allow practitioners to query about Figure 1: Querying Knowledge bases (KB) and lan- an open class of relations, are easy to extend to guage Models (LM) for factual Knowledge .}

2 More data, and require no human supervision to train. We present an in-depth analysis of the relational Knowledge already present (without fine-tuning) in a wide range of state-of-the- vast amounts of linguistic Knowledge (Peters et al., art pretrained Language Models . We find that 2018b; Goldberg, 2019; Tenney et al., 2019) use- (i) without fine-tuning, BERT contains rela- ful for downstream tasks. This Knowledge is tional Knowledge competitive with traditional usually accessed either by conditioning on latent NLP methods that have some access to ora- context representations produced by the original cle Knowledge , (ii) BERT also does remark- ably well on open-domain question answer- model or by using the original model weights to ing against a supervised baseline, and (iii) cer- initialize a task-specific model which is then fur- tain types of factual Knowledge are learned ther fine-tuned. This type of Knowledge transfer much more readily than others by standard lan- is crucial for current state-of-the-art results on a guage model pretraining approaches.

3 The sur- wide range of tasks. prisingly strong ability of these Models to re- call factual Knowledge without any fine-tuning In contrast, Knowledge bases are effective so- demonstrates their potential as unsupervised lutions for accessing annotated gold-standard re- open-domain QA systems. The code to re- lational data by enabling queries such as (Dante, produce our analysis is available at https: born-in, X). However, in practice we often need to extract relational data from text or other modal- ities to populate these Knowledge bases. This 1 Introduction requires complex NLP pipelines involving entity Recently, pretrained high-capacity Language mod- extraction, coreference resolution, entity linking els such as ELMo (Peters et al., 2018a) and BERT and relation extraction (Surdeanu and Ji, 2014) . (Devlin et al., 2018a) have become increasingly components that often need supervised data and important in NLP. They are optimised to either fixed schemas.

4 Moreover, errors can easily prop- predict the next word in a sequence or some agate and accumulate throughout the pipeline. In- masked word anywhere in a given sequence ( stead, we could attempt to query neural Language Dante was born in [Mask] in the year 1265. ). Models for relational data by asking them to fill in The parameters of these Models appear to store masked tokens in sequences like Dante was born 2463. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2463 2473, Hong Kong, China, November 3 7, 2019. c 2019 Association for Computational Linguistics in [Mask] , as illustrated in Figure 1. In this set- els, however, for some relations (particularly ting, Language Models come with various attractive N-to-M relations) performance is very poor, properties: they require no schema engineering, (iii) BERT-large consistently outperforms other do not need human annotations, and they support Language Models in recovering factual and com- an open set of queries.

5 Monsense Knowledge while at the same time Given the above qualities of Language Models as being more robust to the phrasing of a query, and potential representations of relational Knowledge , (iv) BERT-large achieves remarkable results for we are interested in the relational Knowledge al- open-domain QA, reaching ready present in pretrained off-the-shelf Language compared to of a Knowledge base con- Models such as ELMo and BERT. How much re- structed using a task-specific supervised relation lational Knowledge do they store? How does this extraction system. differ for different types of Knowledge such as 2 Background facts about entities, common sense, and general question answering? How does their performance In this section we provide background on Language without fine-tuning compare to symbolic knowl- Models . Statistics for the Models that we include edge bases automatically extracted from text? in our investigation are summarized in Table 1.

6 Beyond gathering a better general understand- ing of these Models , we believe that answers to Unidirectional Language Models these questions can help us design better unsuper- Given an input sequence of tokens w =. vised Knowledge representations that could trans- [w1 , w2 , .. , wN ], unidirectional Language Models fer factual and commonsense Knowledge reliably commonly assign a probability p(w) to the se- to downstream tasks such as commonsense (vi- quence by factorizing it as follows sual) question answering (Zellers et al., 2018; Tal- Y. mor et al., 2019) or reinforcement learning (Brana- p(w) = p(wt | wt 1 , .. , w1 ). (1). van et al., 2011; Chevalier-Boisvert et al., 2018; t Bahdanau et al., 2019; Luketina et al., 2019). A common way to estimate this probability is us- For the purpose of answering the above ques- ing neural Language Models (Mikolov and Zweig, tions we introduce the LAMA ( Language Model 2012; Melis et al., 2017; Bengio et al.))

7 , 2003) with Analysis) probe, consisting of a set of Knowledge sources, each comprised of a set of facts. We p(wt | wt 1 , .. , w1 ) = softmax(Wht + b) (2). define that a pretrained Language model knows a fact (subject, relation, object) such as (Dante, where ht Rk is the output vector of a neural net- born-in, Florence) if it can successfully predict work at position t and W R|V| k is a learned masked objects in cloze sentences such as Dante parameter matrix that maps ht to unnormalized was born in expressing that fact. We test scores for every word in the vocabulary V. Var- for a variety of types of Knowledge : relations be- ious neural Language Models then mainly differ in tween entities stored in Wikidata, common sense how they compute ht given the word history, , relations between concepts from ConceptNet, and by using a multi-layer perceptron (Bengio et al., Knowledge necessary to answer natural Language 2003; Mikolov and Zweig, 2012), convolutional questions in SQuAD.

8 In the latter case we man- layers (Dauphin et al., 2017), recurrent neural net- ually map a subset of SQuAD questions to cloze works (Zaremba et al., 2014; Merity et al., 2016;. sentences. Melis et al., 2017) or self-attention mechanisms (Radford et al., 2018; Dai et al., 2019; Radford Our investigation reveals that (i) the largest et al., 2019). BERT model from Devlin et al. (2018b). fairseq-fconv: Instead of commonly used recur- (BERT-large) captures (accurate) relational rent neural networks, Dauphin et al. (2017) use Knowledge comparable to that of a Knowledge multiple layers of gated convolutions. We use base extracted with an off-the-shelf relation the pretrained model in the fairseq1 library in our extractor and an oracle-based entity linker from study. It has been trained on the WikiText-103 cor- a corpus known to express the relevant knowl- pus introduced by Merity et al. (2016). edge, (ii) factual Knowledge can be recovered surprisingly well from pretrained Language mod- 1.

9 2464. Model Base Model #Parameters Training Corpus Corpus Size fairseq-fconv (Dauphin et al., 2017) ConvNet 324M WikiText-103 103M Words Transformer-XL (large) (Dai et al., 2019) Transformer 257M WikiText-103 103M Words ELMo (original) (Peters et al., 2018a) BiLSTM Google Billion Word 800M Words ELMo (Peters et al., 2018a) BiLSTM Wikipedia (en) & WMT 2008-2012 Words BERT (base) (Devlin et al., 2018a) Transformer 110M Wikipedia (en) & BookCorpus Words BERT (large) (Devlin et al., 2018a) Transformer 340M Wikipedia (en) & BookCorpus Words Table 1: Language Models considered in this study. Transformer-XL: Dai et al. (2019) introduce a tion to this pseudo Language model objective, they large-scale Language model based on the Trans- use an auxiliary binary classification objective to former (Vaswani et al., 2017). Transformer-XL predict whether a particular sentence follows the can take into account a longer history by caching given sequence of words.

10 Previous outputs and by using relative instead of absolute positional encoding. It achieves a test 3 Related Work perplexity of on the WikiText-103 corpus. Many studies have investigated pretrained word Bidirectional Language Models 2 representations, sentence representations, and lan- So far, we have looked at Language Models that guage Models . Existing work focuses on un- predict the next word given a history of words. derstanding linguistic and semantic properties of However, in many downstream applications we word representations or how well pretrained sen- mostly care about having access to contextual rep- tence representations and Language Models trans- resentations of words, , word representations fer linguistic Knowledge to downstream tasks. In that are a function of the entire context of a unit contrast, our investigation seeks to answer to what of text such as a sentence or paragraph, and not extent pretrained Language Models store factual only conditioned on previous words.


Related search queries