arXiv:2012.15671v4 [cs.CL] 16 Aug 2021

Vocabulary Learning via Optimal Transportfor Neural Machine TranslationJingjing Xu1, Hao Zhou1, Chun Gan1,2 , Zaixiang Zheng1,3 , Lei Li11 ByteDance AI Lab2 Math Department, University of Wisconsin Madison3 Nanjing choice of token vocabulary affects the per-formance of machine translation. This paperaims to figure out what is a good vocabularyand whether one can find the optimal vocab-ulary without trial training . To answer thesequestions, we first provide an alternative un-derstanding of the role of vocabulary from theperspective of information theory. Motivatedby this, we formulate the quest of vocabular-ization finding the best token dictionary witha proper size as an optimal transport (OT)problem. We proposeVOLT, a simple andefficient solution without trial training . Em-pirical results show that VOLT outperformswidely-used vocabularies in diverse scenarios,including WMT-14 English-German and TEDmultilingual translation.

For example, VOLT achieves almost 70% vocabulary size reduc-tion and BLEU gain on English-Germantranslation. Also, compared to BPE-search,VOLT reduces the search time from 384 GPUhours to 30 GPU hours on English-Germantranslation. Codes are available IntroductionDue to the discreteness of text, vocabulary con-struction ( vocabularization for short) is a prereq-uisite for neural machine translation (NMT) andmany other natural language processing (NLP)tasks using neural networks (Mikolov et al., 2013;Vaswani et al., 2017; Gehrmann et al., 2018;Zhang et al., 2018; Devlin et al., 2019). Cur-rently, sub-word approaches like Byte-Pair En-coding (BPE) are widely used in the commu-nity (Ott et al., 2018; Ding et al., 2019; Liu et al.,2020), and achieve quite promising results in prac-tice (Sennrich et al., 2016; Costa-juss`a and Fonol-losa, 2016; Lee et al., 2017; Kudo and Richardson, This work is done during the internship at ByteDance ; Al-Rfou et al.)

, 2019; Wang et al., 2020).The key idea of these approaches is selecting themost frequent sub-words (or word pieces withhigher probabilities) as the vocabulary information theory, these frequency-based ap-proaches are simple forms of data compression toreduce entropy (Gage, 1994), which makes the re-sulting corpus easy to learn and predict (Martinand England, 2011; Bentz and Alikaniotis, 2016).However, the effects of vocabulary size are notsufficiently taken into account since current ap-proaches only consider frequency (or entropy) asthe main criteria. Many previous studies (Sennrichand Zhang, 2019; Ding et al., 2019; Provilkovet al., 2020; Salesky et al., 2020) show that vocab-ulary size also affects downstream performances,especially on low-resource tasks. Due to the lackof appropriate inductive bias about size, trial train-ing (namely traversing all possible sizes) is usuallyrequired to search for the optimal size, which takeshigh computation costs.

For convenience, mostexisting studies only adopt the widely-used set-tings in implementation. For example, 30K-40 Kis the most popular size setting in all 42 papersof Conference of Machine Translation (WMT)through 2017 and 2018 (Ding et al., 2019).In this paper, we propose to explore auto-matic vocabularization by simultaneously consid-ering entropy and vocabulary size without expen-sive trial training . Designing such a vocabulariza-tion approach is non-trivial for two main , it is challenging to find an appropriate objec-tive function to optimize them at the same speaking, the corpus entropy decreaseswith the increase of vocabulary size, which bene-fits model learning (Martin and England, 2011).On the other side, too many tokens cause to-ken sparsity, which hurts model learning (Allisonet al., 2006). Second, supposing that an appropri-ate measurement is given, it is still challenging [ ] 26 Nov 1: An illustration of marginal utility.

We sampleBPE-generated vocabularies with different sizes fromEo-En translation and draw their entropy (See )and BLEU lines. Star represents the vocabulary withthe maximum marginal utility. Marginal utility ( ) evaluates the increase of benefit (entropy de-crease) from an increase of cost (size).solve such a discrete optimization problem due tothe exponential search address the above problems, we proposeaVOcabularyLearning approach via optimalTransport, VOLT for short. It can give an appro-priate vocabulary in polynomial time by consider-ing corpus entropy and vocabulary size. Specifi-cally, given the above insight of contradiction be-tween entropy and size, we first borrow the con-cept ofMarginal Utilityin economics (Samuelson,1937) and propose to useMarginal Utility of Vo-cabularization(MUV) as the measurement. Theinsight is quite simple: in economics, marginalutility is used to balance the benefit and the costand we use MUV to balance the entropy (bene-fit) and vocabulary size (cost).

Higher MUV isexpected for Pareto optimality. Formally, MUVis defined as the negative derivative of entropyto vocabulary size. Figure 1 gives an exampleabout marginal utility. Preliminary results verifythat MUV correlates with the downstream perfor-mances on two-thirds of tasks (See Figure 2).Then our goal turns to maximize MUV intractable time complexity. We reformulate our dis-crete optimization objective into an optimal trans-port problem (Cuturi, 2013) that can be solvedin polynomial time by linear programming. In-tuitively, the vocabularization process can be re-garded as finding theoptimal transport matrixfrom thecharacter distributionto thevocabularytoken distribution. Finally, our proposed VOLT will yield a vocabulary from the optimal evaluate our approach on multiple machinetranslation tasks, including WMT-14 English-German translation, TED bilingual translation,and TED multilingual translation.

Empirical re-sults show that VOLT beats widely-used vocabu-laries in diverse scenarios. Furthermore, VOLT isa lightweight solution and does not require expen-sive computation resources. On English-Germantranslation, VOLT only takes 30 GPU hours to findvocabularies, while the traditional BPE-Search so-lution takes 384 GPU Related WorkInitially, most neural models were built uponword-level vocabularies (Costa-juss`a and Fonol-losa, 2016; Vaswani et al., 2017; Zhao et al.,2019). While achieving promising results, it isa common constraint that word-level vocabulariesfail on handling rare words under limited vocabu-lary recently have proposed several ad-vanced vocabularization approaches, like byte-level approaches (Wang et al., 2020), character-level approaches (Costa-juss`a and Fonollosa,2016; Lee et al., 2017; Al-Rfou et al., 2019),and sub-word approaches (Sennrich et al., 2016;Kudo and Richardson, 2018).

Byte-Pair Encoding(BPE) (Sennrich et al., 2016) is proposed to getsubword-level vocabularies. The general idea isto merge pairs of frequent character sequences tocreate sub-word units. Sub-word vocabularies canbe regarded as a trade-off between character-levelvocabularies and word-level vocabularies. Com-pared to word-level vocabularies, it can decreasethe sparsity of tokens and increase the sharedfeatures between similar words, which probablyhave similar semantic meanings, like happy and happier . Compared to character-level vocabu-laries, it has shorter sentence lengths without rarewords. Following BPE, some variants recentlyhave been proposed, like BPE-dropout (Provilkovet al., 2020), SentencePiece (Kudo and Richard-son, 2018), and so promising results, most existing sub-word approaches only consider frequency whilethe effects of vocabulary size is neglected. Thus,trial training is required to find the optimal size,which brings high computation , some studies notice this problem andpropose some practical solutions (Kreutzer andSokolov, 2018; Cherry et al.)

, 2018; Chen et al.,2019; Salesky et al., 2020). Score024681012 CountFigure 2: MUV and downstream performance are pos-itively correlated on two-thirds of tasks. X-axis clas-sifies Spearman scores into different groups. Y-axisshows the number of tasks in each group. The middleSpearman score is Marginal Utility of VocabularizationIn this section, we propose to find a good vocabu-lary measurement by considering entropy and introduced in Section 1, it is non-trivial to findan appropriate objective function to optimize themsimultaneously. On one side, with the increase ofvocabulary size, the corpus entropy is decreased,which benefits model learning (Bentz and Alikan-iotis, 2016). On the other side, a large vocabu-lary causes parameter explosion and token spar-sity problems, which hurts model learning (Alli-son et al., 2006).To address this problem, we borrow the con-cept ofMarginal Utilityin economics (Samuel-son, 1937) and propose to useMarginal Utility ofVocabularization(MUV) as the optimization ob-jective.

MUV evaluates the benefits (entropy) acorpus can get from an increase of cost (size).Higher MUV is expected for higher benefit-costratio. Preliminary results verify that MUV corre-lates with downstream performances on two-thirdsof translation tasks (See Figure 2). According tothis feature, our goal turns to maximize MUV intractable time of MUVF ormally, MUV representsthe negative derivation of entropy to size. For sim-plification, we leverage a smaller vocabulary to es-timate MUV in implementation. Specially, MUVis calculated as:Mv(k+m)= (Hv(k+m) Hv(k))m,(1)wherev(k),v(k+m)are two vocabularies withkandk+mtokens, corpus entropy with the vocabularyv, which isdefined by the sum of token entropy. To avoid theeffects of token length, here we normalize entropywith the average length of tokens and the final en-tropy is defined as:Hv= 1lv j vP(j) logP(j),(2)whereP(j)is the relative frequency of tokenjfrom the training corpus andlvis the averagelength of tokens in ResultsTo verify the effectivenessof MUV as the vocabulary measurement, we con-duct experiments on 45 language pairs from TEDand calculate theSpearman correlation score*be-tween MUV and BLEU scores.

arXiv:2012.15671v4 [cs.CL] 16 Aug 2021

Tags:

Information

Transcription of arXiv:2012.15671v4 [cs.CL] 16 Aug 2021

Related search queries

arXiv:2012.15671v4 [cs.CL] 16 Aug 2021

Tags:

Information

Documents from same domain

Related documents

Related search queries