Massive Exploration of Neural Machine Translation ...

Massive Exploration of Neural Machine TranslationArchitecturesDenny Britz , Anna Goldie , Minh-Thang Luong, Quoc BrainAbstractNeural Machine Translation (NMT) hasshown remarkable progress over the pastfew years with production systems nowbeing deployed to end-users. One majordrawback of current architectures is thatthey are expensive to train, typically re-quiring days to weeks of GPU time toconverge. This makes exhaustive hyper-parameter search, as is commonly donewith other Neural network architectures,prohibitively this work,we present the first large-scale analy-sis of NMT architecture report empirical results and variancenumbers for several hundred experimentalruns, corresponding to over 250,000 GPUhours on the standard WMT English toGerman Translation task.

Our experimentslead to novel insights and practical advicefor building and extending NMT part of this contribution, werelease an open-source NMT framework1that enables researchers to easily experi-ment with novel techniques and reproducestate of the art IntroductionNeural Machine Translation (NMT) (Kalchbren-ner and Blunsom, 2013; Sutskever et al., 2014;Cho et al., 2014) is an end-to-end approach toautomated Translation . NMT has shown impres-sive results (Jean et al., 2015; Luong et al., 2015b;Sennrich et al., 2016a; Wu et al., 2016) sur-passing those of phrase-based systems while ad-dressing shortcomings such as the need for hand- Both authors contributed equally to this work. Work done as a member of the Google Brain Residencyprogram ( ).

1 features. The most popular approachesto NMT are based on an encoder-decoder architec-ture consisting of two recurrent Neural networks(RNNs) and an attention mechanism that alignstarget with source tokens (Bahdanau et al., 2015;Luong et al., 2015a).One shortcoming of current NMT architecturesis the amount of compute required to train on real-world datasets of several millionexamples typically requires dozens of GPUs andconvergence time is on the order of days to weeks(Wu et al., 2016). While sweeping across large hy-perparameter spaces is common in Computer Vi-sion (Huang et al., 2016b), such Exploration wouldbe prohibitively expensive for NMT models, lim-iting researchers to well-established architecturesand hyperparameter choices.

Furthermore, therehave been no large-scale studies of how architec-tural hyperparameters affect the performance ofNMT systems. As a result, it remains unclear whythese models perform as well as they do, as wellas how we might improve this work, we present the first comprehen-sive analysis of architectural hyperparameters forNeural Machine Translation systems. Using a to-tal of more than 250,000 GPU hours, we explorecommon variations of NMT architectures and pro-vide insight into which architectural choices mat-ter most. We report BLEU scores, perplexities,model sizes, and convergence time for all ex-periments, including variance numbers calculatedacross several runs of each experiment. In ad-dition, we release to the public a new softwareframework that was used to run the summary, the main contributions of this workare as follows: We provide immediately applicable insightsinto the optimization of Neural MachineTranslation models, as well as promising di-rections for future research.

For example, we1 [ ] 21 Mar 2017found that deep encoders are more difficultto optimize than decoders, that dense resid-ual connections yield better performance thanregular residual connections, that LSTM soutperform GRUs, and that a well-tunedbeam search is crucial to obtaining state ofthe art results. By presenting practical advicefor choosing baseline architectures, we helpresearchers avoid wasting time on unpromis-ing model variations. We also establish the extent to which metricssuch as BLEU are influenced by random ini-tialization and slight hyperparameter varia-tion, helping researchers to distinguish statis-tically significant results from random noise. Finally, we release an open source packagebased on TensorFlow, specifically designedfor implementing reproducible state of theart sequence-to-sequence models.

All experi-ments were run using this framework and wehope to accelerate future research by releas-ing it to the public. We also release all con-figuration files and processing scripts neededto reproduce the experiments in this Background and Neural Machine TranslationOur models are based on an encoder-decoder ar-chitecture with attention mechanism (Bahdanauet al., 2015; Luong et al., 2015a), as shown in fig-ure 1. An encoder functionfenctakes as input asequence of source tokensx= (x1, .., xm)andproduces a sequence of statesh= (h1, .., hm).In our base model,fencis a bi-directional RNNand the statehicorresponds to the concatenationof the states produced by the backward and for-ward RNNs,hi= [ hi; hi].The decoderfdecisan RNN that predicts the probability of a targetsequencey= (y1.)

, yk)based onh. The proba-bility of each target tokenyi 1, ..Vis predictedbased on the recurrent state in the decoder RNNsi, the previous words,y<i, and a context vectorci. The context vectorciis also called the atten-tion vector and is calculated as a weighted averageof the source jaijhj(1)aij= aij j aij(2) aij=att(si, hj)(3)Here,att(si, hj)is an attention function thatcalculates an unnormalized alignment score be-tween the encoder statehjand the decoder statesi. In our base model, we use a function of theformatt(si, hj) = Whhj, Wssi , where the ma-tricesWare used to transform the source and tar-get states into a representation of the same decoder outputs a distribution over a vocab-ulary of fixed-sizeV:P(yi|y1, .., yi 1,x)=softmax(W[si;ci] +b)The whole model is trained end-to-end by min-imizing the negative log likelihood of the targetwords using stochastic gradient Experimental Datasets and PreprocessingWe run all experiments on the WMT 15 English German task consisting of sen-tence pairs, obtained by combining the Europarlv7, News Commentary v10, and Common Crawlcorpora.

We use newstest2013 as our validationset and newstest2014 and newstest2015 as our testsets. To test for generality, we also ran a smallnumber of experiments on English French trans-lation, and we found that the performance washighly correlated with that of English Germanbut that it took much longer to train models on thelarger English French dataset. Given that trans-lation from the morphologically richer German isalso considered a more challenging task, we feltjustified in using the English German translationtask for this hyperparameter tokenize and clean all datasets with thescripts in Moses2and learn shared subword unitsusing Byte Pair Encoding (BPE) (Sennrich et al.,2016b) using 32,000 merge operations for a finalvocabulary size of approximately 37k.

We discov-ered that data preprocessing can have a large im-pact on final numbers, and since we wish to enable2 1: Encoder-Decoder architecture with attention module. Section numbers reference experimentscorresponding to the , we release our data preprocessingscripts together with the NMT framework to thepublic. For more details on data preprocessing pa-rameters, we refer the reader to the code Training Setup and SoftwareAll of the following experiments are run usingour own software framework based on Tensor-Flow (Abadi et al., 2016). We purposely builtthis framework to enable reproducible state-of-the-art implementations of Neural Machine Trans-lation architectures. As part of our contribution,we are releasing the framework and all configura-tion files needed to reproduce our results.

Train-ing is performed on Nvidia Tesla K40m and TeslaK80 GPUs, distributed over 8 parallel workers and6 parameter servers per experiment. We use abatch size of 128 and decode using beam searchwith a beam width of 10 and the length normaliza-tion penalty of described in (Wu et al., 2016).BLEU scores are calculated on tokenized data us-ing in Moses3. Each ex-periment is run for a maximum of steps andreplicated 4 times with different save model checkpoints every 30 minutes andchoose the best checkpoint based on the validationset BLEU score. We report mean and standard de-3 as well as highest scores (as per cross val-idation) for each Baseline ModelBased on a review of previous literature, we chosea baseline model that we knew would perform rea-sonably well.

Massive Exploration of Neural Machine Translation ...

Tags:

Information

Transcription of Massive Exploration of Neural Machine Translation ...

Related search queries

Massive Exploration of Neural Machine Translation ...

Tags:

Information

Documents from same domain

Related documents

Related search queries