Effective Approaches to Attention-based Neural Machine ...

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning computer Science Department, Stanford University, Stanford, CA 94305. Abstract X Y Z <eos>. An attentional mechanism has lately been used to improve Neural Machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for Attention-based NMT. This paper exam- A B C D <eos> X Y Z. ines two simple and Effective classes of attentional mechanism: a global approach Figure 1: Neural Machine translation a stack- which always attends to all source words ing recurrent architecture for translating a source and a local one that only looks at a subset sequence A B C D into a target sequence X Y.

Of source words at a time. We demonstrate Z. Here, <eos> marks the end of a sentence. the effectiveness of both Approaches on the WMT translation tasks between English and German in both directions. With local is often a large Neural network that is trained in an attention , we achieve a significant gain of end-to-end fashion and has the ability to general- BLEU points over non-attentional sys- ize well to very long word sequences. This means tems that already incorporate known tech- the model does not have to explicitly store gigantic niques such as dropout. Our ensemble phrase tables and language models as in the case model using different attention architec- of standard MT; hence, NMT has a small memory tures yields a new state-of-the-art result in footprint.

Lastly, implementing NMT decoders is the WMT'15 English to German transla- easy unlike the highly intricate decoders in stan- tion task with BLEU points, an im- dard MT (Koehn et al., 2003). provement of BLEU points over the In parallel, the concept of attention has gained existing best system backed by NMT and popularity recently in training Neural networks, al- an n-gram lowing models to learn alignments between different modalities, , between image objects 1 Introduction and agent actions in the dynamic control problem Neural Machine Translation (NMT) achieved (Mnih et al., 2014), between speech frames and state-of-the-art performances in large-scale trans- text in the speech recognition task (Chorowski et lation tasks such as from English to French (Luong al.))

, 2014), or between visual features of a picture et al., 2015) and English to German (Jean et al., and its text description in the image caption gen- 2015). NMT is appealing since it requires minimal eration task (Xu et al., 2015). In the context of domain knowledge and is conceptually simple. NMT, Bahdanau et al. (2015) has successfully ap- The model by Luong et al. (2015) reads through all plied such attentional mechanism to jointly trans- the source words until the end-of-sentence symbol late and align words. To the best of our knowl- <eos> is reached. It then starts emitting one tar- edge, there has not been any other work exploring get word at a time, as illustrated in Figure 1. NMT the use of Attention-based architectures for NMT. 1. All our code and models are publicly available at http: In this work, we design, with simplicity and effectiveness in mind, two novel types of attention - based models: a global approach in which all cent NMT work such as (Kalchbrenner and Blun- source words are attended and a local one whereby som, 2013; Sutskever et al.

, 2014; Cho et al., 2014;. only a subset of source words are considered at a Bahdanau et al., 2015; Luong et al., 2015; Jean et time. The former approach resembles the model al., 2015) have in common. They, however, dif- of (Bahdanau et al., 2015) but is simpler architec- fer in terms of which RNN architectures are used turally. The latter can be viewed as an interesting for the decoder and how the encoder computes the blend between the hard and soft attention models source sentence representation s. proposed in (Xu et al., 2015): it is computationally Kalchbrenner and Blunsom (2013) used an less expensive than the global model or the soft at- RNN with the standard hidden unit for the decoder tention; at the same time, unlike the hard attention , and a convolutional Neural network for encoding the local attention is differentiable almost every- the source sentence representation.

On the other where, making it easier to implement and hand, both Sutskever et al. (2014) and Luong et Besides, we also examine various alignment func- al. (2015) stacked multiple layers of an RNN with tions for our Attention-based models. a Long Short-Term Memory (LSTM) hidden unit Experimentally, we demonstrate that both of for both the encoder and the decoder. Cho et al. our Approaches are Effective in the WMT trans- (2014), Bahdanau et al. (2015), and Jean et al. lation tasks between English and German in both (2015) all adopted a different version of the RNN. directions. Our attentional models yield a boost with an LSTM-inspired hidden unit, the gated re- of up to BLEU over non-attentional systems current unit (GRU), for both which already incorporate known techniques such In more detail, one can parameterize the proba- as dropout.

For English to German translation, bility of decoding each word yj as: we achieve new state-of-the-art (SOTA) results p (yj |y<j , s) = softmax (g (hj )) (2). for both WMT'14 and WMT'15, outperforming previous SOTA systems, backed by NMT mod- with g being the transformation function that out- els and n-gram LM rerankers, by more than puts a vocabulary-sized Here, hj is the BLEU. We conduct extensive analysis to evaluate RNN hidden unit, abstractly computed as: our models in terms of learning, the ability to han- hj = f (hj 1 , s), (3). dle long sentences, choices of attentional architec- where f computes the current hidden state given tures, alignment quality, and translation outputs. the previous hidden state and can be either a 2 Neural Machine Translation vanilla RNN unit, a GRU, or an LSTM unit.

In (Kalchbrenner and Blunsom, 2013; Sutskever et A Neural Machine translation system is a Neural al., 2014; Cho et al., 2014; Luong et al., 2015), network that directly models the conditional prob- the source representation s is only used once to ability p(y|x) of translating a source sentence, initialize the decoder hidden state. On the other x1 , .. , xn , to a target sentence, y1 , .. , ym .3 A hand, in (Bahdanau et al., 2015; Jean et al., 2015). basic form of NMT consists of two components: and this work, s, in fact, implies a set of source (a) an encoder which computes a representation s hidden states which are consulted throughout the for each source sentence and (b) a decoder which entire course of the translation process. Such an generates one target word at a time and hence de- approach is referred to as an attention mechanism, composes the conditional probability as: which we will discuss next.

Xm In this work, following (Sutskever et al., 2014;. log p(y|x) = log p (yj |y<j , s) (1) Luong et al., 2015), we use the stacking LSTM. j=1. architecture for our NMT systems, as illustrated A natural choice to model such a decomposi- in Figure 1. We use the LSTM unit defined in tion in the decoder is to use a recurrent Neural net- (Zaremba et al., 2015). Our training objective is work (RNN) architecture, which most of the re- formulated as follows: X. 2. There is a recent work by Gregor et al. (2015), which is Jt = log p(y|x) (4). (x,y) D. very similar to our local attention and applied to the image 4. generation task. However, as we detail later, our model is They all used a single RNN layer except for the latter two much simpler and can achieve good performance for NMT.

Works which utilized a bidirectional RNN for the encoder. 3 5. All sentences are assumed to terminate with a special One can provide g with other inputs such as the currently end-of-sentence token <eos>. predicted word yj as in (Bahdanau et al., 2015). yt with D being our parallel training corpus. t h 3 Attention-based Models attention Layer Our various Attention-based models are classifed into two broad categories, global and local. These Context vector ct classes differ in terms of whether the attention . Global align weights is placed on all source positions or on only a few at source positions. We illustrate these two model s h types in Figure 2 and 3 respectively. ht Common to these two types of models is the fact that at each time step t in the decoding phase, both Approaches first take as input the hidden state ht at the top layer of a stacking LSTM.

Effective Approaches to Attention-based Neural Machine ...

Tags:

Information

Transcription of Effective Approaches to Attention-based Neural Machine ...

Related search queries

Effective Approaches to Attention-based Neural Machine ...

Tags:

Information

Documents from same domain

Related documents

Related search queries