Transcription of Learning Phrase Representations using RNN Encoder- …
1 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724 1734,October 25-29, 2014, Doha, 2014 Association for Computational LinguisticsLearning Phrase Representations using RNN Encoder Decoderfor statistical Machine TranslationKyunghyun ChoBart van Merri enboer Caglar GulcehreUniversit e de Montr BahdanauJacobs University, Bougares Holger SchwenkUniversit e du Maine, BengioUniversit e de Montr eal, CIFAR Senior this paper, we propose a novel neu-ral network model called RNN Encoder Decoder that consists of two recurrentneural networks (RNN). One RNN en-codes a sequence of symbols into a fixed-length vector representation, and the otherdecodes the representation into another se-quence of symbols. The encoder and de-coder of the proposed model are jointlytrained to maximize the conditional prob-ability of a target sequence given a sourcesequence.
2 The performance of a statisti-cal machine translation system is empiri-cally found to improve by using the con-ditional probabilities of Phrase pairs com-puted by the RNN Encoder Decoder as anadditional feature in the existing log-linearmodel. Qualitatively, we show that theproposed model learns a semantically andsyntactically meaningful representation oflinguistic IntroductionDeep neural networks have shown great success invarious applications such as objection recognition(see, , (Krizhevsky et al., 2012)) and speechrecognition (see, , (Dahl et al., 2012)). Fur-thermore, many recent works showed that neu-ral networks can be successfully used in a num-ber of tasks in natural language processing (NLP).These include, but are not limited to, languagemodeling (Bengio et al., 2003), paraphrase detec-tion (Socher et al., 2011) and word embedding ex-traction (Mikolov et al., 2013).
3 In the field of sta-tistical machine translation (SMT), deep neuralnetworks have begun to show promising results.(Schwenk, 2012) summarizes a successful usageof feedforward neural networks in the frameworkof Phrase -based SMT this line of research on using neural net-works for SMT, this paper focuses on a novel neu-ral network architecture that can be used as a partof the conventional Phrase -based SMT proposed neural network architecture, whichwe will refer to as anRNN Encoder Decoder, con-sists of two recurrent neural networks (RNN) thatact as an encoder and a decoder pair. The en-coder maps a variable-length source sequence to afixed-length vector, and the decoder maps the vec-tor representation back to a variable-length targetsequence. The two networks are trained jointly tomaximize the conditional probability of the targetsequence given a source sequence. Additionally,we propose to use a rather sophisticated hiddenunit in order to improve both the memory capacityand the ease of proposed RNN Encoder Decoder with anovel hidden unit is empirically evaluated on thetask of translating from English to French.
4 Wetrain the model to learn the translation probabil-ity of an English Phrase to a corresponding Frenchphrase. The model is then used as a part of a stan-dard Phrase -based SMT system by scoring eachphrase pair in the Phrase table. The empirical eval-uation reveals that this approach of scoring phrasepairs with an RNN Encoder Decoder improvesthe translation qualitatively analyze the trained RNNE ncoder Decoder by comparing its Phrase scoreswith those given by the existing translation qualitative analysis shows that the RNNE ncoder Decoder is better at capturing the lin-guistic regularities in the Phrase table, indirectlyexplaining the quantitative improvements in theoverall translation performance. The further anal-ysis of the model reveals that the RNN Encoder Decoder learns a continuous space representationof a Phrase that preserves both the semantic andsyntactic structure of the RNN Encoder Preliminary: Recurrent Neural NetworksA recurrent neural network (RNN) is a neural net-work that consists of a hidden statehand anoptional outputywhich operates on a variable-length sequencex= (x1.)
5 ,xT). At each timestept, the hidden stateh t of the RNN is updatedbyh t =f(h t 1 ,xt),(1)wherefis a non-linear activation be as simple as an element-wise logistic sigmoid function and as com-plex as a long short-term memory (LSTM)unit (Hochreiter and Schmidhuber, 1997).An RNN can learn a probability distributionover a sequence by being trained to predict thenext symbol in a sequence. In that case, the outputat each timesteptis the conditional distributionp(xt|xt 1,..,x1). For example, a multinomialdistribution (1-of-Kcoding) can be output using asoftmax activation functionp(xt,j= 1|xt 1,..,x1) =exp(wjh t ) Kj =1exp(wj h t ),(2)for all possible symbolsj= 1,..,K, wherewjare the rows of a weight matrixW. By combiningthese probabilities, we can compute the probabil-ity of the sequencexusingp(x) =T t=1p(xt|xt 1,..,x1).(3)From this learned distribution, it is straightfor-ward to sample a new sequence by iteratively sam-pling a symbol at each time RNN Encoder DecoderIn this paper, we propose a novel neural networkarchitecture that learns toencodea variable-lengthsequence into a fixed-length vector representationand todecodea given fixed-length vector rep-resentation back into a variable-length a probabilistic perspective, this new modelis a general method to learn the conditional dis-tribution over a variable-length sequence condi-tioned on yet another variable-length sequence, (y1.
6 ,yT |x1,..,xT), where onex1x2xTyT'y2y1cDecoderEncoderFigure 1: An illustration of the proposed RNNE ncoder note that the input and output sequencelengthsTandT may encoder is an RNN that reads each symbolof an input sequencexsequentially. As it readseach symbol, the hidden state of the RNN changesaccording to Eq. (1). After reading the end ofthe sequence (marked by an end-of-sequence sym-bol), the hidden state of the RNN is a summarycof the whole input decoder of the proposed model is anotherRNN which is trained togeneratethe output se-quence by predicting the next symbolytgiven thehidden stateh t . However, unlike the RNN de-scribed in Sec. , bothytandh t are also con-ditioned onyt 1and on the summarycof the inputsequence. Hence, the hidden state of the decoderat timetis computed by,h t =f(h t 1 ,yt 1,c),and similarly, the conditional distribution of thenext symbol isP(yt|yt 1,yt 2.
7 ,y1,c) =g(h t ,yt 1,c).for given activation functionsfandg(the lattermust produce valid probabilities, with a soft-max).See Fig. 1 for a graphical depiction of the pro-posed model two components of the proposedRNNE ncoder Decoderare jointly trained to maximizethe conditional log-likelihoodmax 1NN n=1logp (yn|xn),(4)1725where is the set of the model parameters andeach(xn,yn)is an (input sequence, output se-quence) pair from the training set. In our case,as the output of the decoder, starting from the in-put, is differentiable, we can use a gradient-basedalgorithm to estimate the model the RNN Encoder Decoder is trained, themodel can be used in two ways. One way is to usethe model to generate a target sequence given aninput sequence. On the other hand, the model canbe used toscorea given pair of input and outputsequences, where the score is simply a probabilityp (y|x)from Eqs.
8 (3) and (4). Hidden Unit that Adaptively Remembersand ForgetsIn addition to a novel model architecture, we alsopropose a new type of hidden unit (fin Eq. (1))that has been motivated by the LSTM unit but ismuch simpler to compute and 2shows the graphical depiction of the proposed hid-den us describe how the activation of thej-thhidden unit is computed. First, theresetgaterjiscomputed byrj= ([Wrx]j+[Urh t 1 ]j),(5)where is the logistic sigmoid function, and[.]jdenotes thej-th element of a 1are the input and the previous hidden state, weight matrices which , theupdategatezjis computed byzj= ([Wzx]j+[Uzh t 1 ]j).(6)The actual activation of the proposed unithjisthen computed byh t j=zjh t 1 j+ (1 zj) h t j,(7)where h t j= ([Wx]j+[U(r h t 1 )]j).(8)In this formulation, when the reset gate is closeto 0, the hidden state is forced to ignore the pre-vious hidden state and reset with the current input1 The LSTM unit, which has shown impressive results inseveral applications such as speech recognition, has a mem-ory cell and four gating units that adaptively control the in-formation flow inside the unit, compared to only two gatingunits in the proposed hidden unit.
9 For details on LSTM net-works, see, , (Graves, 2012).zrhh~xFigure 2: An illustration of the proposed hiddenactivation update gatezselectswhether the hidden state is to be updated witha new hidden state h. The reset gaterdecideswhether the previous hidden state is ignored. SeeEqs. (5) (8) for the detailed equations ofr,z,hand This effectively allows the hidden state todropany information that is found to be irrelevantlater in the future, thus, allowing a more the other hand, the update gate controls howmuch information from the previous hidden statewill carry over to the current hidden state. Thisacts similarly to the memory cell in the LSTM network and helps the RNN to remember long-term information. Furthermore, this may be con-sidered an adaptive variant of a leaky-integrationunit (Bengio et al., 2013).As each hidden unit has separate reset and up-date gates, each hidden unit will learn to capturedependencies over different time scales.
10 Thoseunits that learn to capture short-term dependencieswill tend to have reset gates that are frequently ac-tive, but those that capture longer-term dependen-cies will have update gates that are mostly our preliminary experiments, we found thatit is crucial to use this new unit with gating were not able to get meaningful result with anoft-usedtanhunit without any statistical Machine TranslationIn a commonly used statistical machine translationsystem (SMT), the goal of the system (decoder,specifically) is to find a translationfgiven a sourcesentencee, which maximizesp(f|e) p(e|f)p(f),where the first term at the right hand side is calledtranslation modeland the latterlanguage model(see, , (Koehn, 2005)). In practice, however,most SMT systems modellogp(f|e)as a log-linear model with additional features and corre-1726sponding weights:logp(f|e) =N n=1wnfn(f,e) + logZ(e),(9)wherefnandwnare then-th feature and weight, (e)is a normalization constant thatdoes not depend on the weights.