Example: bankruptcy

Cross-lingual Language Model Pretraining

Cross-lingual Language Model Pretraining Guillaume Lample Alexis Conneau . Facebook AI Research Facebook AI Research Sorbonne Universit es Universit e Le Mans Abstract et al., 2013) or natural Language inference (Bow- man et al., 2015; Williams et al., 2017). Al- Recent studies have demonstrated the ef- though there has been a surge of interest in learn- ficiency of generative Pretraining for En- [ ] 22 Jan 2019. ing general-purpose sentence representations, re- glish natural Language understanding. In search in that area has been essentially monolin- this work, we extend this approach to mul- gual, and largely focused around English bench- tiple languages and show the effectiveness marks (Conneau and Kiela, 2018; Wang et al., of Cross-lingual Pretraining . We propose 2018).

Cross-lingual Language Model Pretraining Guillaume Lample Facebook AI Research Sorbonne Universit´es glample@fb.com Alexis Conneau Facebook AI Research

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Cross-lingual Language Model Pretraining

1 Cross-lingual Language Model Pretraining Guillaume Lample Alexis Conneau . Facebook AI Research Facebook AI Research Sorbonne Universit es Universit e Le Mans Abstract et al., 2013) or natural Language inference (Bow- man et al., 2015; Williams et al., 2017). Al- Recent studies have demonstrated the ef- though there has been a surge of interest in learn- ficiency of generative Pretraining for En- [ ] 22 Jan 2019. ing general-purpose sentence representations, re- glish natural Language understanding. In search in that area has been essentially monolin- this work, we extend this approach to mul- gual, and largely focused around English bench- tiple languages and show the effectiveness marks (Conneau and Kiela, 2018; Wang et al., of Cross-lingual Pretraining . We propose 2018).

2 Recent developments in learning and eval- two methods to learn Cross-lingual lan- uating Cross-lingual sentence representations in guage models (XLMs): one unsupervised many languages (Conneau et al., 2018b) aim at that only relies on monolingual data, and mitigating the English-centric bias and suggest one supervised that leverages parallel data that it is possible to build universal Cross-lingual with a new Cross-lingual Language Model encoders that can encode any sentence into a objective. We obtain state-of-the-art re- shared embedding space. sults on Cross-lingual classification, unsu- In this work, we demonstrate the effective- pervised and supervised machine transla- ness of Cross-lingual Language Model Pretraining tion. On XNLI, our approach pushes the on multiple Cross-lingual understanding (XLU).

3 State of the art by an absolute gain of benchmarks. Precisely, we make the following accuracy. On unsupervised machine trans- contributions: lation, we obtain BLEU on WMT'16. German-English, improving the previous 1. We introduce a new unsupervised method for state of the art by more than 9 BLEU. On learning Cross-lingual representations using supervised machine translation, we obtain Cross-lingual Language modeling and investi- a new state of the art of BLEU on gate two monolingual Pretraining objectives. WMT'16 Romanian-English, outperform- ing the previous best approach by more 2. We introduce a new supervised learning ob- than 4 BLEU. Our code and pretrained jective that improves Cross-lingual pretrain- models will be made publicly available. ing when parallel data is available.

4 1 Introduction 3. We significantly outperform the previous Generative Pretraining of sentence encoders (Rad- state of the art on Cross-lingual classification, ford et al., 2018; Howard and Ruder, 2018; Devlin unsupervised machine translation and super- et al., 2018) has led to strong improvements on vised machine translation. numerous natural Language understanding bench- marks (Wang et al., 2018). In this context, a Trans- 4. We show that Cross-lingual Language models former (Vaswani et al., 2017) Language Model is can provide significant improvements on the learned on a large unsupervised text corpus, and perplexity of low-resource languages. then fine-tuned on natural Language understand- ing (NLU) tasks such as classification (Socher 5. We will make our code and pretrained models.)

5 Equal contribution. publicly available. 2 Related Work zero-shot translation. Following this approach, Artetxe and Schwenk (2018) show that the result- Our work builds on top of Radford et al. (2018);. ing encoder can be used to produce Cross-lingual Howard and Ruder (2018); Devlin et al. (2018). sentence embeddings. Their approach leverages who investigate Language modeling for pretrain- more than 200 million parallel sentences. They ing Transformer encoders. Their approaches lead obtained a new state of the art on the XNLI cross- to drastic improvements on several classification lingual classification benchmark (Conneau et al., tasks from the GLUE benchmark (Wang et al., 2018b) by learning a classifier on top of the fixed 2018). Ramachandran et al. (2016) show that sentence representations.

6 While these methods re- Language modeling Pretraining can also provide quire a significant amount of parallel data, recent significant improvements on machine translation work in unsupervised machine translation show tasks, even for high-resource Language pairs such that sentence representations can be aligned in as English-German where there exists a signifi- a completely unsupervised way (Lample et al., cant amount of parallel data. Concurrent to our 2018a; Artetxe et al., 2018). For instance, Lample work, results on Cross-lingual classification using et al. (2018b) obtained BLEU on WMT'16. a Cross-lingual Language modeling approach were German-English without using parallel sentences. showcased on the BERT repository1 . We compare Similar to this work, we show that we can align those results to our approach in Section 5.

7 Distributions of sentences in a completely unsuper- Aligning distributions of text representations vised way, and that our Cross-lingual models can has a long tradition, starting from word embed- be used for a broad set of natural Language under- dings alignment and the work of Mikolov et al. standing tasks, including machine translation. (2013a) that leverages small dictionaries to align The most similar work to ours is probably the word representations from different languages. A. one of Wada and Iwata (2018), where the au- series of follow-up studies show that Cross-lingual thors train a LSTM (Hochreiter and Schmidhuber, representations can be used to improve the qual- 1997) Language Model with sentences from dif- ity of monolingual representations (Faruqui and ferent languages.)

8 They share the LSTM param- Dyer, 2014), that orthogonal transformations are eters, but use different lookup tables to represent sufficient to align these word distributions (Xing the words in each Language . They focus on align- et al., 2015), and that all these techniques can be ing word representations and show that their ap- applied to an arbitrary number of languages (Am- proach work well on word translation tasks. mar et al., 2016). Following this line of work, the need for Cross-lingual supervision was further re- 3 Cross-lingual Language models duced (Smith et al., 2017) until it was completely removed (Conneau et al., 2018a). In this work, we In this section, we present the three Language mod- take these ideas one step further by aligning dis- eling objectives we consider throughout this work.

9 Tributions of sentences and also reducing the need Two of them only require monolingual data (un- for parallel data. supervised), while the third one requires parallel There is a large body of work on aligning sen- sentences (supervised). We consider N languages. tence representations from multiple languages. By Unless stated otherwise, we suppose that we have using parallel data, Hermann and Blunsom (2014); N monolingual corpora {Ci }i= , and we de- Conneau et al. (2018b); Eriguchi et al. (2018) in- note by ni the number of sentences in Ci . vestigated zero-shot Cross-lingual sentence classi- fication. But the most successful recent approach Shared sub-word vocabulary of Cross-lingual encoders is probably the one of In all our experiments we process all languages Johnson et al.

10 (2017) for multilingual machine with the same shared vocabulary created through translation. They show that a single sequence-to- Byte Pair Encoding (BPE) (Sennrich et al., 2015). sequence Model can be used to perform machine As shown in Lample et al. (2018a), this greatly im- translation for many Language pairs, by using a proves the alignment of embedding spaces across single shared LSTM encoder and decoder. Their languages that share either the same alphabet or multilingual Model outperformed the state of the anchor tokens such as digits (Smith et al., 2017) or art on low-resource Language pairs, and enabled proper nouns. We learn the BPE splits on the con- 1. catenation of sentences sampled randomly from the monolingual corpora. Sentences are sampled sampled according to a multinomial distribution, according to a multinomial distribution with prob- whose weights are proportional to the square root abilities {qi }i= , where: of their invert frequencies.


Related search queries