arXiv:2012.15671v4 [cs.CL] 16 Aug 2021

Vocabulary Learning via Optimal Transportfor Neural Machine TranslationJingjing Xu1, Hao Zhou1, Chun Gan1,2 , Zaixiang Zheng1,3 , Lei Li11 ByteDance AI Lab2 Math Department, University of Wisconsin Madison3 Nanjing choice of token vocabulary affects the per-formance of machine translation. This paperaims to figure out what is a good vocabularyand whether one can find the optimal vocab-ulary without trial training. To answer thesequestions, we first provide an alternative un-derstanding of the role of vocabulary from theperspective of information theory. Motivatedby this, we formulate the quest of vocabular-ization finding the best token dictionary witha proper size as an optimal transport (OT)problem.

We proposeVOLT, a simple andefficient solution without trial training. Em-pirical results show that VOLT outperformswidely-used vocabularies in diverse scenarios,including WMT-14 English-German and TEDmultilingual translation. For example, VOLT achieves almost 70% vocabulary size reduc-tion and BLEU gain on English-Germantranslation. Also, compared to BPE-search,VOLT reduces the search time from 384 GPUhours to 30 GPU hours on English-Germantranslation. Codes are available IntroductionDue to the discreteness of text, vocabulary con-struction ( vocabularization for short) is a prereq-uisite for neural machine translation (NMT) andmany other natural language processing (NLP)tasks using neural networks (Mikolov et al.)

, 2013;Vaswani et al., 2017; Gehrmann et al., 2018;Zhang et al., 2018; Devlin et al., 2019). Cur-rently, sub-word approaches like Byte-Pair En-coding (BPE) are widely used in the commu-nity (Ott et al., 2018; Ding et al., 2019; Liu et al.,2020), and achieve quite promising results in prac-tice (Sennrich et al., 2016; Costa-juss`a and Fonol-losa, 2016; Lee et al., 2017; Kudo and Richardson, This work is done during the internship at ByteDance ; Al-Rfou et al., 2019; Wang et al., 2020).The key idea of these approaches is selecting themost frequent sub-words (or word pieces withhigher probabilities) as the vocabulary information theory, these frequency-based ap-proaches are simple forms of data compression toreduce entropy (Gage, 1994), which makes the re-sulting corpus easy to learn and predict (Martinand England, 2011; Bentz and Alikaniotis, 2016).

However, the effects of vocabulary size are notsufficiently taken into account since current ap-proaches only consider frequency (or entropy) asthe main criteria. Many previous studies (Sennrichand Zhang, 2019; Ding et al., 2019; Provilkovet al., 2020; Salesky et al., 2020) show that vocab-ulary size also affects downstream performances,especially on low-resource tasks. Due to the lackof appropriate inductive bias about size, trial train-ing (namely traversing all possible sizes) is usuallyrequired to search for the optimal size, which takeshigh computation costs. For convenience, mostexisting studies only adopt the widely-used set-tings in implementation.

For example, 30K-40 Kis the most popular size setting in all 42 papersof Conference of Machine Translation (WMT)through 2017 and 2018 (Ding et al., 2019).In this paper, we propose to explore auto-matic vocabularization by simultaneously consid-ering entropy and vocabulary size without expen-sive trial training. Designing such a vocabulariza-tion approach is non-trivial for two main , it is challenging to find an appropriate objec-tive function to optimize them at the same speaking, the corpus entropy decreaseswith the increase of vocabulary size, which bene-fits model learning (Martin and England, 2011).On the other side, too many tokens cause to-ken sparsity, which hurts model learning (Allisonet al.)

, 2006). Second, supposing that an appropri-ate measurement is given, it is still challenging [ ] 26 Nov 1: An illustration of marginal utility. We sampleBPE-generated vocabularies with different sizes fromEo-En translation and draw their entropy (See )and BLEU lines. Star represents the vocabulary withthe maximum marginal utility. Marginal utility ( ) evaluates the increase of benefit (entropy de-crease) from an increase of cost (size).solve such a discrete optimization problem due tothe exponential search address the above problems, we proposeaVOcabularyLearning approach via optimalTransport, VOLT for short.

It can give an appro-priate vocabulary in polynomial time by consider-ing corpus entropy and vocabulary size. Specifi-cally, given the above insight of contradiction be-tween entropy and size, we first borrow the con-cept ofMarginal Utilityin economics (Samuelson,1937) and propose to useMarginal Utility of Vo-cabularization(MUV) as the measurement. Theinsight is quite simple: in economics, marginalutility is used to balance the benefit and the costand we use MUV to balance the entropy (bene-fit) and vocabulary size (cost). Higher MUV isexpected for Pareto optimality. Formally, MUVis defined as the negative derivative of entropyto vocabulary size.

Figure 1 gives an exampleabout marginal utility. Preliminary results verifythat MUV correlates with the downstream perfor-mances on two-thirds of tasks (See Figure 2).Then our goal turns to maximize MUV intractable time complexity. We reformulate our dis-crete optimization objective into an optimal trans-port problem (Cuturi, 2013) that can be solvedin polynomial time by linear programming. In-tuitively, the vocabularization process can be re-garded as finding theoptimal transport matrixfrom thecharacter distributionto thevocabularytoken distribution. Finally, our proposed VOLT will yield a vocabulary from the optimal evaluate our approach on multiple machinetranslation tasks, including WMT-14 English-German translation, TED bilingual translation,and TED multilingual translation.

Empirical re-sults show that VOLT beats widely-used vocabu -laries in diverse scenarios. Furthermore, VOLT isa lightweight solution and does not require expen-sive computation resources. On English-Germantranslation, VOLT only takes 30 GPU hours to findvocabularies, while the traditional BPE-Search so-lution takes 384 GPU Related WorkInitially, most neural models were built uponword-level vocabularies (Costa-juss`a and Fonol-losa, 2016; Vaswani et al., 2017; Zhao et al.,2019). While achieving promising results, it isa common constraint that word-level vocabulariesfail on handling rare words under limited vocabu -lary recently have proposed several ad-vanced vocabularization approaches, like byte-level approaches (Wang et al.)

, 2020), character-level approaches (Costa-juss`a and Fonollosa,2016; Lee et al., 2017; Al-Rfou et al., 2019),and sub-word approaches (Sennrich et al., 2016;Kudo and Richardson, 2018). Byte-Pair Encoding(BPE) (Sennrich et al., 2016) is proposed to getsubword-level vocabularies. The general idea isto merge pairs of frequent character sequences tocreate sub-word units. Sub-word vocabularies canbe regarded as a trade-off between character-levelvocabularies and word-level vocabularies. Com-pared to word-level vocabularies, it can decreasethe sparsity of tokens and increase the sharedfeatures between similar words, which probablyhave similar semantic meanings, like happy and happier.

arXiv:2012.15671v4 [cs.CL] 16 Aug 2021

Tags:

Information

Transcription of arXiv:2012.15671v4 [cs.CL] 16 Aug 2021

Related search queries

arXiv:2012.15671v4 [cs.CL] 16 Aug 2021

Tags:

Information

Documents from same domain

Related documents

Related search queries