Self-Attention with Relative Position Representations

Self-Attention with Relative Position RepresentationsPeter UszkoreitGoogle VaswaniGoogle entirely on an attention mechanism,the Transformer introduced by Vaswani etal. (2017) achieves state-of-the-art results formachine translation. In contrast to recurrentand convolutional neural networks, it doesnot explicitly model Relative or absolute po-sition information in its ,it requires adding Representations of abso-lute positions to its this workwe present an alternative approach, extend-ing the Self-Attention mechanism to efficientlyconsider Representations of the Relative posi-tions, or distances between sequence the WMT 2014 English-to-German andEnglish-to-French translation tasks, this ap-proach yields improvements of BLEU BLEU over absolute Position representa-tions, respectively. Notably, we observe thatcombining Relative and absolute Position rep-resentations yields no further improvement intranslation quality.

We describe an efficientimplementation of our method and cast it as aninstance of relation-aware Self-Attention mech-anisms that can generalize to arbitrary graph-labeled IntroductionRecent approaches to sequence to sequence learn-ing typically leverage recurrence (Sutskever et al.,2014), convolution (Gehring et al., 2017; Kalch-brenner et al., 2016), attention (Vaswani et al.,2017), or a combination of recurrence and atten-tion (Bahdanau et al., 2014; Cho et al., 2014; Lu-ong et al., 2015; Wu et al., 2016) as basic buildingblocks. These approaches incorporate informationabout the sequential Position of elements neural networks (RNNs) typicallycompute a hidden stateht, as a function of theirinput at timetand a previous hidden stateht 1,capturing Relative and absolute positions along thetime dimension directly through their sequentialstructure. Non-recurrent models do not necessar-ily consider input elements sequentially and mayhence require explicitly encoding Position infor-mation to be able to use sequence common approach is to use Position encod-ings which are combined with input elements toexpose Position information to the model.

Theseposition encodings can be a deterministic func-tion of Position (Sukhbaatar et al., 2015; Vaswaniet al., 2017) or learned Representations . Convolu-tional neural networks inherently capture relativepositions within the kernel size of each convolu-tion. They have been shown to still benefit fromposition encodings (Gehring et al., 2017), the Transformer, which employs neitherconvolution nor recurrence, incorporating explicitrepresentations of Position information is an espe-cially important consideration since the model isotherwise entirely invariant to sequence models have therefore used posi-tion encodings or biased attention weights basedon distance (Parikh et al., 2016).In this work we present an efficient way ofincorporating Relative Position Representations inthe Self-Attention mechanism of the when entirely replacing its absolute positionencodings, we demonstrate significant improve-ments in translation quality on two machine trans-lation approach can be cast as a special case of ex-tending the Self-Attention mechanism of the Trans-former to considering arbitrary relations betweenany two elements of the input, a direction we planto explore in future work on modeling labeled, di-rected [ ] 12 Apr 20182 TransformerThe Transformer(Vaswani et al.)

, 2017) em-ploys an encoder-decoder structure, consisting ofstacked encoder and decoder consist of two sublayers: self-attentionfollowed by a Position -wise feed-forward layers consist of three sublayers: Self-Attention followed by encoder-decoder attention,followed by a Position -wise feed-forward uses residual connections around each of thesublayers, followed by layer normalization (Baet al., 2016). The decoder uses masking in its Self-Attention to prevent a given output Position fromincorporating information about future output po-sitions during encodings based on sinusoids of vary-ing frequency are added to encoder and decoderinput elements prior to the first layer. In contrastto learned, absolute Position Representations , theauthors hypothesized that sinusoidal Position en-codings would help the model to generalize to se-quence lengths unseen during training by allowingit to learn to attend also by Relative Position .

Thisproperty is shared by our Relative Position repre-sentations which, in contrast to absolute positionrepresentations, are invariant to the total connections help propagate positioninformation to higher Self-AttentionSelf-attention sublayers employhattention form the sublayer output, results from eachhead are concatenated and a parameterized lineartransformation is attention head operates on an input se-quence,x= (x1,..,xn)ofnelements wherexi Rdx, and computes a new sequencez=(z1,..,zn)of the same length wherezi output element,zi, is computed asweighted sum of a linearly transformed input el-ements:zi=n j=1 ij(xjWV)(1)Each weight coefficient, ij, is computed usinga softmax function: ij=expeij nk=1expeikAndeijis computed using a compatibility func-tion that compares two input elements:eij=(xiWQ)(xjWK)T dz(2)Scaled dot product was chosen for the compat-ibility function, which enables efficient computa-tion.

Linear transformations of the inputs add suf-ficient expressive ,WK,WV Rdx dzare parameter matri-ces. These parameter matrices are unique per layerand attention Proposed Relation-aware Self-AttentionWe propose an extension to Self-Attention to con-sider the pairwise relationships between input ele-ments. In this sense, we model the input as a la-beled, directed, fully-connected edge between input elementsxiandxjisrepresented by vectorsaVij,aKij Rda. The mo-tivation for learning two distinct edge represen-tations is thataVijandaKijare suitable for use ineq. (3) and eq. (4), respectively, without requiringadditional linear transformations. These represen-tations can be shared across attention heads. Weuseda= modify eq. (1) to propagate edge informa-tion to the sublayer output:zi=n j=1 ij(xjWV+aVij)(3)This extension is presumably important fortasks where information about the edge types se-lected by a given attention head is useful to down-stream encoder or decoder layers.

However, as ex-plored in , this may not be necessary for ma-chine also, importantly, modify eq. (2) to consideredges when determining compatibility:eij=xiWQ(xjWK+aKij)T dz(4)The primary motivation for using simple addi-tion to incorporate edge Representations in eq. (3)and eq. (4) is to enable an efficient implementationdescribed in ,1=wV-1aK2,1=wK-1aV2,4=wV2aK2,4=wK2aV4,n =wVkaK4,n=wKkFigure 1: Example edges representing Relative posi-tions, or the distance between elements. We learn rep-resentations for each Relative Position within a clippingdistancek. The figure assumes2<=k <=n that not all edges are Relative Position RepresentationsFor linear sequences, edges can capture infor-mation about the Relative Position differences be-tween input elements. The maximum Relative po-sition we consider is clipped to a maximum abso-lute value ofk. We hypothesized that precise rel-ative Position information is not useful beyond acertain distance.

Clipping the maximum distancealso enables the model to generalize to sequencelengths not seen during training. Therefore, weconsider2k+ 1unique edge (j i,k)aVij=wVclip(j i,k)clip(x,k) = max( k,min(k,x))We then learn Relative Position representationswK= (wK k,..,wKk)andwV= (wV k,..,wVk)wherewKi,wVi Efficient ImplementationThere are practical space complexity concernswhen considering edges between input elements,as noted by Veli ckovi c et al. (2017), which consid-ers unlabeled graph inputs to an attention a sequence of lengthnandhattentionheads, we reduce the space complexity of storingrelative Position Representations fromO(hn2da)toO(n2da)by sharing them across each , Relative Position Representations canbe shared across sequences. Therefore, the over-all Self-Attention space complexity increases fromO(bhndz)toO(bhndz+n2da). Givenda=dz,the size of the Relative increase depends Transformer computes Self-Attention effi-ciently for all sequences, heads, and positions ina batch using parallel matrix multiplication opera-tions (Vaswani et al.)

, 2017). Without Relative posi-tion Representations , eacheijcan be computed us-ingbhparallel multiplications ofn dzanddz nmatrices. Each matrix multiplication computeseijfor all sequence positions, for a particular headand sequence. For any sequence and head, thisrequires sharing the same representation for eachposition across all compatibility function applica-tions (dot products) with other we consider Relative positions the repre-sentations differ with different pairs of prevents us from computing alleijfor allpairs of positions in a single matrix also want to avoid broadcasting Relative po-sition Representations . However, both issues canbe resolved by splitting the computation of eq. (4)into two terms:eij=xiWQ(xjWK)T+xiWQ(aKij)T dz(5)The first term is identical to eq. (2), and can becomputed as described above. For the second terminvolving Relative Position Representations , tensorreshaping can be used to computenparallel multi-plications ofbh dzanddz nmatrices.

Each ma-trix multiplication computes contributions toeijfor all heads and batches, corresponding to a par-ticular sequence Position . Further reshaping al-lows adding the two terms. The same approachcan be used to efficiently compute eq. (3).For our machine translation experiments, the re-sult was a modest 7% decrease in steps per sec-ond, but we were able to maintain the same modeland batch sizes on P100 GPUs as Vaswani etal. (2017).4 Experimental SetupWe use the tensor2tensor1library for training andevaluating our evaluated our model on the WMT 2014machine translation task, using the WMT 2014 English-German dataset consisting of approxi-mately sentence pairs and the 2014 WMTE nglish-French dataset consisting of approxi-mately 36M sentence tensor2tensor library is available InformationEN-DE BLEUEN-FR BLEUT ransformer (base)Absolute Position (base) Relative Position (big)Absolute Position (big) Relative Position 1: Experimental results for WMT 2014 English-to-German (EN-DE) and English-to-French (EN-FR) trans-lation tasks, using newstest2014 test all experiments, we split tokens into a32,768 word-piece vocabulary (Wu et al.)

Self-Attention with Relative Position Representations

Tags:

Information

Transcription of Self-Attention with Relative Position Representations

Related search queries

Self-Attention with Relative Position Representations

Tags:

Information

Documents from same domain

Related documents

Related search queries