1 effective approaches to attention - based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305. Abstract X Y Z <eos>. An attentional mechanism has lately been [ ] 20 Sep 2015. used to improve neural machine transla- tion (NMT) by selectively focusing on parts of the source sentence during trans- lation. However, there has been little work exploring useful architectures for attention - based NMT. This paper exam- A B C D <eos> X Y Z. ines two simple and effective classes of at- tentional mechanism: a global approach Figure 1: Neural machine translation a stack- which always attends to all source words ing recurrent architecture for translating a source and a local one that only looks at a subset sequence A B C D into a target sequence X Y.
2 Of source words at a time. We demonstrate Z. Here, <eos> marks the end of a sentence. the effectiveness of both approaches on the WMT translation tasks between English emitting one target word at a time, as illustrated in and German in both directions. With local Figure 1. NMT is often a large neural network that attention , we achieve a significant gain of is trained in an end-to-end fashion and has the abil- BLEU points over non-attentional sys- ity to generalize well to very long word sequences. tems that already incorporate known tech- This means the model does not have to explicitly niques such as dropout. Our ensemble store gigantic phrase tables and language models model using different attention architec- as in the case of standard MT; hence, NMT has tures yields a new state-of-the-art result in a small memory footprint.
3 Lastly, implementing the WMT'15 English to German transla- NMT decoders is easy unlike the highly intricate tion task with BLEU points, an im- decoders in standard MT (Koehn et al., 2003). provement of BLEU points over the In parallel, the concept of attention has existing best system backed by NMT and gained popularity recently in training neural net- an n-gram works, allowing models to learn alignments be- 1 Introduction tween different modalities, , between image objects and agent actions in the dynamic con- Neural Machine Translation (NMT) achieved trol problem (Mnih et al., 2014), between speech state-of-the-art performances in large-scale trans- frames and text in the speech recognition task lation tasks such as from English to French (?)
4 , or between visual features of a picture and (Luong et al., 2015) and English to German its text description in the image caption gener- (Jean et al., 2015). NMT is appealing since it re- ation task (Xu et al., 2015). In the context of quires minimal domain knowledge and is concep- NMT, Bahdanau et al. (2015) has successfully ap- tually simple. The model by Luong et al. (2015) plied such attentional mechanism to jointly trans- reads through all the source words until the end-of- late and align words. To the best of our knowl- sentence symbol <eos> is reached. It then starts edge, there has not been any other work exploring 1. All our code and models are publicly available at the use of attention - based architectures for NMT.
5 In this work, we design, with simplicity and ef- fectiveness in mind, two novel types of attention - recurrent neural network (RNN) architec- based models: a global approach in which all ture, which most of the recent NMT work source words are attended and a local one whereby such as (Kalchbrenner and Blunsom, 2013;. only a subset of source words are considered at a Sutskever et al., 2014; Cho et al., 2014;. time. The former approach resembles the model Bahdanau et al., 2015; Luong et al., 2015;. of (Bahdanau et al., 2015) but is simpler architec- Jean et al., 2015) have in common. They, how- turally. The latter can be viewed as an interesting ever, differ in terms of which RNN architectures blend between the hard and soft attention models are used for the decoder and how the encoder proposed in (Xu et al.)
6 , 2015): it is computation- computes the source sentence representation s. ally less expensive than the global model or the Kalchbrenner and Blunsom (2013) used an soft attention ; at the same time, unlike the hard at- RNN with the standard hidden unit for the tention, the local attention is differentiable almost decoder and a convolutional neural network for everywhere, making it easier to implement and encoding the source sentence representation. On Besides, we also examine various align- the other hand, both Sutskever et al. (2014) and ment functions for our attention - based models. Luong et al. (2015) stacked multiple layers of an Experimentally, we demonstrate that both of RNN with a Long Short-Term Memory (LSTM).
7 Our approaches are effective in the WMT trans- hidden unit for both the encoder and the decoder. lation tasks between English and German in both Cho et al. (2014), Bahdanau et al. (2015), and directions. Our attentional models yield a boost Jean et al. (2015) all adopted a different version of of up to BLEU over non-attentional systems the RNN with an LSTM-inspired hidden unit, the which already incorporate known techniques such gated recurrent unit (GRU), for both as dropout. For English to German translation, In more detail, one can parameterize the proba- we achieve new state-of-the-art (SOTA) results bility of decoding each word yj as: for both WMT'14 and WMT'15, outperforming previous SOTA systems, backed by NMT mod- p (yj |y<j , s) = softmax (g (hj )) (2).
8 Els and n-gram LM rerankers, by more than with g being the transformation function that out- BLEU. We conduct extensive analysis to evaluate puts a vocabulary-sized Here, hj is the our models in terms of learning, the ability to han- RNN hidden unit, abstractly computed as: dle long sentences, choices of attentional architec- tures, alignment quality, and translation outputs. hj = f (hj 1 , s), (3). 2 Neural Machine Translation where f computes the current hidden state A neural machine translation system is a neural given the previous hidden state and can be network that directly models the conditional prob- either a vanilla RNN unit, a GRU, or an LSTM. ability p(y|x) of translating a source sentence, unit.
9 In (Kalchbrenner and Blunsom, 2013;. x1 , .. , xn , to a target sentence, y1 , .. , ym .3 A Sutskever et al., 2014; Cho et al., 2014;. basic form of NMT consists of two components: Luong et al., 2015), the source representa- (a) an encoder which computes a representation s tion s is only used once to initialize the for each source sentence and (b) a decoder which decoder hidden state. On the other hand, in generates one target word at a time and hence de- (Bahdanau et al., 2015; Jean et al., 2015) and composes the conditional probability as: this work, s, in fact, implies a set of source hidden states which are consulted throughout the Xm log p(y|x) = log p (yj |y<j , s) (1) entire course of the translation process.
10 Such an j=1 approach is referred to as an attention mechanism, which we will discuss next. A natural choice to model such a de- In this work, following (Sutskever et al., 2014;. composition in the decoder is to use a Luong et al., 2015), we use the stacking LSTM. 2. There is a recent work by Gregor et al. (2015), which is architecture for our NMT systems, as illustrated very similar to our local attention and applied to the image 4. generation task. However, as we detail later, our model is They all used a single RNN layer except for the latter two much simpler and can achieve good performance for NMT. works which utilized a bidirectional RNN for the encoder. 3 5. All sentences are assumed to terminate with a special One can provide g with other inputs such as the currently end-of-sentence token <eos>.