Transcription of Convolutional Sequence to Sequence Learning - arXiv
1 Convolutional Sequence to Sequence LearningJonas GehringMichael AuliDavid GrangierDenis YaratsYann N. DauphinFacebook AI ResearchAbstractThe prevalent approach to Sequence to sequencelearning maps an input Sequence to a variablelength output Sequence via recurrent neural net-works. We introduce an architecture based en-tirely on Convolutional neural to recurrent models, computations over allelements can be fully parallelized during trainingto better exploit the GPU hardware and optimiza-tion is easier since the number of non-linearitiesis fixed and independent of the input length. Ouruse of gated linear units eases gradient propaga-tion and we equip each decoder layer with a sep-arate attention module. We outperform the accu-racy of the deep LSTM setup of Wu et al. (2016)on both WMT 14 English-German and WMT 14 English-French translation at an order of magni-tude faster speed, both on GPU and IntroductionSequence to Sequence Learning has been successful inmany tasks such as machine translation, speech recogni-tion (Sutskever et al.)
2 , 2014; Chorowski et al., 2015) andtext summarization (Rush et al., 2015; Nallapati et al.,2016; Shen et al., 2016) amongst others. The dominantapproach to date encodes the input Sequence with a se-ries of bi-directional recurrent neural networks (RNN) andgenerates a variable length output with another set of de-coder RNNs, both of which interface via a soft-attentionmechanism (Bahdanau et al., 2014; Luong et al., 2015).In machine translation, this architecture has been demon-strated to outperform traditional phrase-based models bylarge margins (Sennrich et al., 2016b; Zhou et al., 2016;Wu et al., 2016; 2).1 The source code and models are available neural networks are less common for se-quence modeling, despite several advantages (Waibel et al.,1989; LeCun & Bengio, 1995). Compared to recurrent lay-ers, convolutions create representations for fixed size con-texts, however, the effective context size of the network caneasily be made larger by stacking several layers on top ofeach other.
3 This allows to precisely control the maximumlength of dependencies to be modeled. Convolutional net-works do not depend on the computations of the previoustime step and therefore allow parallelization over every ele-ment in a Sequence . This contrasts with RNNs which main-tain a hidden state of the entire past that prevents parallelcomputation within a Convolutional neural networks create hierarchi-cal representations over the input Sequence in which nearbyinput elements interact at lower layers while distant ele-ments interact at higher layers. Hierarchical structure pro-vides a shorter path to capture long-range dependenciescompared to the chain structure modeled by recurrent net-works, we can obtain a feature representation captur-ing relationships within a window ofnwords by applyingonlyO(nk) Convolutional operations for kernels of widthk, compared to a linear numberO(n)for recurrent neu-ral networks.
4 Inputs to a Convolutional network are fedthrough a constant number of kernels and non-linearities,whereas recurrent networks apply up tonoperations andnon-linearities to the first word and only a single set ofoperations to the last word. Fixing the number of non-linearities applied to the inputs also eases work has applied Convolutional neural networks tosequence modeling such as Bradbury et al. (2016) who in-troduce recurrent pooling between a succession of convo-lutional layers or Kalchbrenner et al. (2016) who tackleneural translation without attention. However, none ofthese approaches has been demonstrated improvementsover state of the art results on large benchmark convolutions have been previously explored for ma-chine translation by Meng et al. (2015) but their evaluationwas restricted to a small dataset and the model was usedin tandem with a traditional count-based model.
5 Architec-1 [ ] 25 Jul 2017 Convolutional Sequence to Sequence Learningtures which are partially Convolutional have shown strongperformance on larger tasks but their decoder is still recur-rent (Gehring et al., 2016).In this paper we propose an architecture for Sequence to se-quence modeling that is entirely Convolutional . Our modelis equipped with gated linear units (Dauphin et al., 2016)and residual connections (He et al., 2015a). We also useattention in every decoder layer and demonstrate that eachattention layer only adds a negligible amount of combination of these choices enables us to tackle largescale problems ( 3).We evaluate our approach on several large datasets for ma-chine translation as well as summarization and compare tothe current best architectures reported in the literature. OnWMT 16 English-Romanian translation we achieve a newstate of the art, outperforming the previous best result BLEU.
6 On WMT 14 English-German we outperformthe strong LSTM setup of Wu et al. (2016) by BLEUand on WMT 14 English-French we outperform the like-lihood trained system of Wu et al. (2016) by , our model can translate unseen sentences atan order of magnitude faster speed than Wu et al. (2016)on GPU and CPU hardware ( 4, 5).2. Recurrent Sequence to Sequence LearningSequence to Sequence modeling has been synonymouswith recurrent neural network based encoder-decoder ar-chitectures (Sutskever et al., 2014; Bahdanau et al., 2014).The encoder RNN processes an input sequencex=(x1,..,xm)ofmelements and returns state representa-tionsz= ( ,zm). The decoder RNN takeszandgenerates the output sequencey= (y1,..,yn)left toright, one element at a time. To generate outputyi+1, thedecoder computes a new hidden statehi+1based on theprevious statehi, an embeddinggiof the previous targetlanguage wordyi, as well as a conditional inputciderivedfrom the encoder outputz.
7 Based on this generic formula-tion, various encoder-decoder architectures have been pro-posed, which differ mainly in the conditional input and thetype of without attention consider only the final encoderstatezmby settingci=zmfor alli(Cho et al., 2014), orsimply initialize the first decoder state withzm(Sutskeveret al., 2014), in which caseciis not used. Architectureswith attention (Bahdanau et al., 2014; Luong et al., 2015)computecias a weighted sum of( ,zm)at each timestep. The weights of the sum are referred to as attentionscores and allow the network to focus on different parts ofthe input Sequence as it generates the output sequences. At-tention scores are computed by essentially comparing eachencoder statezjto a combination of the previous decoderstatehiand the last predictionyi; the result is normalizedto be a distribution over input choices for recurrent networks in encoder-decodermodels are long short term memory networks (LSTM;Hochreiter & Schmidhuber, 1997) and gated recurrent units(GRU; Cho et al.)
8 , 2014). Both extend Elman RNNs (El-man, 1990) with a gating mechanism that allows the mem-orization of information from previous time steps in orderto model long-term dependencies. Most recent approachesalso rely on bi-directional encoders to build representationsof both past and future contexts (Bahdanau et al., 2014;Zhou et al., 2016; Wu et al., 2016). Models with many lay-ers often rely on shortcut or residual connections (He et al.,2015a; Zhou et al., 2016; Wu et al., 2016).3. A Convolutional ArchitectureNext we introduce a fully Convolutional architecture for se-quence to Sequence modeling. Instead of relying on RNNsto compute intermediate encoder stateszand decoder stateshwe use Convolutional neural networks (CNN). Position EmbeddingsFirst, we embed input elementsx= (x1,..,xm)in dis-tributional space asw= (w1,..,wm), wherewj Rfis a column in an embedding matrixD RV f.
9 We alsoequip our model with a sense of order by embedding the ab-solute position of input elementsp= (p1,..,pm)wherepj Rf. Both are combined to obtain input element rep-resentationse= (w1+p1,..,wm+pm). We proceedsimilarly for output elements that were already generatedby the decoder network to yield output element represen-tations that are being fed back into the decoder networkg= (g1,..,gn). Position embeddings are useful in ourarchitecture since they give our model a sense of whichportion of the Sequence in the input or output it is currentlydealing with ( ). Convolutional Block StructureBoth encoder and decoder networks share a simple blockstructure that computes intermediate states based on a fixednumber of input elements. We denote the output of thel-th block ashl= (hl1,..,hln)for the decoder network,andzl= (zl1,..,zlm)for the encoder network; we referto blocks and layers interchangeably.
10 Each block containsa one dimensional convolution followed by a a decoder network with a single block and kernel widthk, each resulting stateh1icontains information overkinputelements. Stacking several blocks on top of each other in-creases the number of input elements represented in a instance, stacking6blocks withk= 5results in an in-put field of25elements, each output depends on252 Convolutional Sequence to Sequence Learninginputs. Non-linearities allow the networks to exploit thefull input field, or to focus on fewer elements if convolution kernel is parameterized asW R2d kd,bw R2dand takes as inputX Rk dwhich is aconcatenation ofkinput elements embedded inddimen-sions and maps them to a single output elementY R2dthat has twice the dimensionality of the input elements;subsequent layers operate over thekoutput elements ofthe previous layer.