On the difficulty of training Recurrent Neural Networks

On the difficulty of training Recurrent Neural NetworksRazvan de MontrealTomas UniversityYoshua de MontrealAbstractThere are two widely known issues with prop-erly training Recurrent Neural Networks , thevanishingand theexplodinggradient prob-lems detailed in Bengioet al.(1994). Inthis paper we attempt to improve the under-standing of the underlying issues by explor-ing these problems from an analytical, a geo-metric and a dynamical systems analysis is used to justify a simple yet ef-fective solution. We propose a gradient normclipping strategy to deal with exploding gra-dients and a soft constraint for the vanishinggradients problem. We validate empiricallyour hypothesis and proposed solutions in theexperimental IntroductionA Recurrent Neural network (RNN), Fig.

1, is aneural network model proposed in the 80 s (Rumelhartet al., 1986; Elman, 1990; Werbos, 1988) for modelingtime series. The structure of the network is similar tothat of a standard multilayer perceptron, with the dis-tinction that we allow connections among hidden unitsassociated with a time delay. Through these connec-tions the model can retain information about the pastinputs, enabling it to discover temporal correlationsbetween events that are possibly far away from eachother in the data (a crucial property for proper learn-ing of time series).While in principle the Recurrent network is a simpleand powerful model, in practice, it is unfortunatelyhard to train properly.

Among the main reasons whythis model is so unwieldy are thevanishing gradientutxtEtFigure of a Recurrent Neural network . Therecurrent connections in the hidden layer allow informationto persist from one input to gradientproblems described in Bengioet al.(1994). training Recurrent networksA generic Recurrent Neural network , with inpututandstatextfor time stept, is given by equation (1). Inthe theoretical section of this paper we will sometimesmake use of the specific parametrization given by equa-tion (11)1in order to provide more precise conditionsand intuitions about the everyday (xt 1,ut, )(1)xt=Wrec (xt 1) +Winut+b(2)The parameters of the model are given by the recurrentweight matrixWrec, the biasesband input weightmatrixWin, collected in for the general by the user, set to zero or learned, and is anelement-wise function (usually thetanhorsigmoid).

A costEmeasures the performance of the networkon some given task and it can be broken apart intoindividual costs for each stepE= 1 t TEt, whereEt=L(xt).One approach that can be used to compute the nec-essary gradients is Backpropagation Through Time(BPTT), where the Recurrent model is represented as1 This formulation is equivalent to the more widelyknown equationxt= (Wrecxt 1+Winut+b), and itwas chosen for [ ] 16 Feb 2013On the difficulty of training Recurrent Neural Networksa deep multi-layer one (with an unbounded number oflayers) and backpropagation is applied on the unrolledmodel (see Fig. 2). Et+1 xt+1Et+1 EtEt 1xt+1xtxt 1ut 1utut+1 Et xt Et 1 xt 1 xt+2 xt+1 xt+1 xt xt xt 1 xt 1 xt 2 Figure Recurrent Neural Networks in time bycreating a copy of the model for each time step.

We denotebyxtthe hidden state of the network at timet, byuttheinput of the network at timetand byEtthe error obtainedfrom the output at will diverge from the classical BPTT equations atthis point and re-write the gradients (see equations (3),(4) and (5)) in order to better highlight the explodinggradients problem. These equations were obtained bywriting the gradients in a sum-of-products form. E = 1 t T Et (3) Et = 1 k t( Et xt xt xk +xk )(4) xt xk= t i>k xi xi 1= t i>kWTrecdiag( (xi 1)) (5) +xk refers to the immediate partial derivative ofthe statexkwith respect to , , wherexk 1istaken as a constant with respect to . Specifically,considering equation 2, the value of any rowiof thematrix ( +xk Wrec) is just (xk 1).

Equation (5) alsoprovides the form of Jacobian matrix xi xi 1for thespecific parametrization given in equation (11), wherediagconverts a vector into a diagonal matrix, and computes the derivative of in an element-wise that each term Et from equation (3) has thesame form and the behaviour of these individual termsdetermine the behaviour of the sum. Henceforth wewill focus on one such generic term, calling it simplythe gradient when there is no gradient component Et is also a sum (see equa-tion (4)), whose terms we refer to astemporalcontribu-tions ortemporalcomponents. One can see that eachsuch temporal contribution Et xt xt xk +xk measures how at stepkaffects the cost at stept > k.

The factors xt xk(equation (5)) transport the error in time fromsteptback to stepk. We would further loosely distin-guish betweenlong termandshort termcontributions,where long term refers to components for whichk tand short term to everything Exploding and Vanishing GradientsAs introduced in Bengioet al.(1994), theexplodinggradientsproblem refers to the large increase in thenorm of the gradient during training . Such events arecaused by the explosion of the long term components,which can grow exponentially more then short termones. Thevanishing gradientsproblem refers to theopposite behaviour, when long term components goexponentially fast to norm 0, making it impossible forthe model to learn correlation between temporally dis-tant The mechanicsTo understand this phenomenon we need to look at theform of each temporal component, and in particular atthe matrix factors xt xk(see equation (5)) that take theform of a product oft kJacobian thesame way a product oft kreal numbers can shrinkto zero or explode to infinity, so does this product ofmatrices(along some directionv).

In what follows we will try to formalize these intu-itions (extending a similar derivation done in Bengioet al.(1994) where only a single hidden unit case wasconsidered).If we consider a linear version of the model ( set tothe identity function in equation (11)) we can use thepower iteration methodto formally analyze this prod-uct of Jacobian matrices and obtain tight conditionsfor when the gradients explode or vanish (see the sup-plementary materials for a detailed derivation of theseconditions). It issufficientfor the largest eigenvalue 1of the Recurrent weight matrix to be smaller than1 for long term components to vanish (ast ) andnecessaryfor it to be larger than 1 for gradients can generalize these results for nonlinear functions where the absolute values of (x) is bounded (sayby a value R) and therefore diag( (xk)).

We firstprovethat it issufficientfor 1<1 , where 1is the absolute value of the largest eigenvalue ofthe Recurrent weight matrixWrec, for thevanishinggradientproblem to occur. Note that we assume theparametrization given by equation (11). The Jacobianmatrix xk+1 xkis given byWTrecdiag( (xk)). The 2-norm of this Jacobian is bounded by the product ofOn the difficulty of training Recurrent Neural Networksthe norms of the two matrices (see equation (6)). Dueto our assumption, this implies that it is smaller than1. k, xk+1 xk WTrec diag( (xk)) <1 <1(6)Let R be such that k, xk+1 xk <1. Theexistence of is given by equation (6).

By inductionoveri, we can show that Et xt(t 1 i=k xi+1 xi) t k Et xt(7)As <1, it follows that, according to equation (7),long term contributions (for whicht kis large) go to0 exponentially fast witht k. By inverting this proof we get thenecessaryconditionforexploding gradients, namely that the largest eigen-value 1is larger than1 (otherwise the long term com-ponents would vanish instead of exploding). For tanhwe have = 1 while for sigmoid we have =1 Drawing similarities with DynamicalSystemsWe can improve our understanding of the explodinggradients and vanishing gradients problems by employ-ing a dynamical systems perspective, as it was donebefore in Doya (1993); Bengioet al.

On the difficulty of training Recurrent Neural Networks

Tags:

Information

Transcription of On the difficulty of training Recurrent Neural Networks

Related search queries

On the difficulty of training Recurrent Neural Networks

Tags:

Information

Documents from same domain

Related documents

Related search queries