Example: quiz answers

Incorporating Nesterov Momentum into ... - Stanford University

Incorporating Nesterov Momentum intoAdamTimothy Dozat1 IntroductionWhen attempting to improve the performance of adeep learning system, there are more or less threeapproaches one can take: the first is to improve thestructure of the model, perhaps adding another layer,switching from simple recurrent units to LSTM cells[4], or in the realm of NLP taking advantage ofsyntactic parses ( as in [13, et seq.]); another ap-proach is to improve the initialization of the model,guaranteeing that the early-stage gradients have cer-tain beneficial properties [3], or building in largeamounts of sparsity [6], or taking advantage of prin-ciples of linear algebra [15]; the final approach is totry a more powerful learning algorithm, such as in-cluding a decaying sum over the previous gradientsin the update [12], by dividing each parameter up-date by theL2norm of the previous updates for thatparameter [2], or even by foregoing first-order algo-rithms for more powerf

accelerating gradient descent learning along dimen-sions where the gradient remains relatively consis-tent across training steps and slowing it along turbu-lent dimensions where the gradient is significantly oscillating. Algorithm 2 Classical Momentum g t Ñq t 1 f(q t 1) m t m t 1 +g t q t1 hm [14] show that Nesterov’s accelerated gradient

Tags:

  Dimensions, Sion, Mined, Dimen sions, Nesterov

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Incorporating Nesterov Momentum into ... - Stanford University

1 Incorporating Nesterov Momentum intoAdamTimothy Dozat1 IntroductionWhen attempting to improve the performance of adeep learning system, there are more or less threeapproaches one can take: the first is to improve thestructure of the model, perhaps adding another layer,switching from simple recurrent units to LSTM cells[4], or in the realm of NLP taking advantage ofsyntactic parses ( as in [13, et seq.]); another ap-proach is to improve the initialization of the model,guaranteeing that the early-stage gradients have cer-tain beneficial properties [3], or building in largeamounts of sparsity [6], or taking advantage of prin-ciples of linear algebra [15]; the final approach is totry a more powerful learning algorithm, such as in-cluding a decaying sum over the previous gradientsin the update [12], by dividing each parameter up-date by theL2norm of the previous updates for thatparameter [2], or even by foregoing first-order algo-rithms for more powerful but more computationallycostly second order algorithms [9].

2 This paper hasas its goal the third option improving the qualityof the final solution by using a faster, more powerfullearning Related Momentum -based algorithmsGradient descentis a simple, well-known, and gen-erally very robust optimization algorithm where thegradient of the function to be minimized with re-spect to the parameters ( f( t 1)) is computed, anda portion of that gradient is subtracted off of theparameters:Classical Momentum [12] accumulates a decayingAlgorithm 1 Gradient Descentgt t 1f( t 1) t t 1 gtsum (with decay constant ) of the previous gradi-ents into a Momentum vectorm, and using that in-stead of the true gradient.

3 This has the advantage ofaccelerating gradient descent learning along dimen-sions where the gradient remains relatively consis-tent across training steps and slowing it along turbu-lent dimensions where the gradient is 2 Classical Momentumgt t 1f( t 1)mt mt 1+gt t t 1 mt[14] show thatNesterov s accelerated gradient(NAG) [11] which has a provably better bound thangradient descent can be rewritten as a kind of im-proved Momentum . If we can substitute the defini-tion formtin place of the symbolmtin the parame-ter update as in (2) t t 1 mt(1) t t 1 mt 1 gt(2)we can see that the termmt 1doesn t depend onthe current gradientgt so in principle, we can geta superior step direction by applying the momentumvector to the parametersbeforecomputing the gradi-ent.

4 The authors provide empirical evidence that thisalgorithm is superior to the gradient descent, classi-Algorithm 3 Nesterov s accelerated gradientgt t 1f( t 1 mt 1)mt mt 1+gt t t 1 mtcal Momentum , and Hessian-Free [9] algorithms forconventionally difficult optimization algorithms[2] presentadaptive subgradient descent(Ada-Grad), which divides of every step by theL2normof all previous gradients; this slows down learningalong dimensions that have already changed signifi-cantly and speeds up learning along dimensions thathave only changed slightly, stabilizing the model srepresentation of common features and allowing itto rapidly catch up its representation of rare 4 AdaGradgt t 1f( t 1)nt nt 1+g2t t t 1 gt nt+ One notable problem with AdaGrad is that thenorm vectorneventually becomes so large thattraining slows to a halt, preventing the model fromreaching the local minimum; [16] go on to motivateRMSProp, an alternative to AdaGrad that replacesthe sum inntwith a decaying mean parameterizedhere by.

5 This allows the model to continue to 5 RMSP ropgt t 1f( t 1)nt nt 1+(1 )g2t t t 1 gt nt+ CombinationOne might ask if combining the Momentum -basedand norm-based methods might provide the ad-vantages of both. In fact, [5] successfully do so1 Most implementations of this kind of algorithm include an parameter to keep the denominator from being too small andresulting in an irrecoverably large stepwithadaptive moment estimation(Adam), combin-ing classical Momentum (using a decaying mean in-stead of a decaying sum) with RMSProp to improveperformance on a number of benchmarks. In theiralgorithm, they include initialization bias correctionterms, which offset some of the instability that ini-tializingmandnto0can 6 Adamgt t 1f( t 1)mt mt 1+(1 )gt mt mt1 tnt nt 1+(1 )g2t nt nt1 t t t 1 mt nt+ [5] also include an algorithm AdaMax that re-places theL2norm with theL norm, removing theneed for ntand replacingntand twith the follow-ing updates:nt max( nt 1,|gt|) t t 1 mtnt+ We can generalize this to RMSProp as well, usingtheL norm in the denominator instead of theL2norm, giving what might be called NAG revisitedAdam combines RMSProp with classical momen-tum.

6 But as [14] show, NAG is in general supe-rior to classical Momentum so how would we goabout modifying Adam to use NAG instead? First,we rewrite the NAG algorithm to be more straight-forward and efficient to implement at the cost ofsome intuitive readability. Momentum is most ef-fective with a warming schedule, so for complete-ness we parameterize bytas well. Here, theAlgorithm 7 NAG rewrittengt t 1f( t 1)mt tmt 1+gt mt gt+ t+1mt t t 1 mtvector mcontains the gradient update for the cur-rent timestepgtin addition to the Momentum vectorupdate for the next timestep t+1mt, which needsto be applied before taking the gradient at the nexttimestep.

7 We don t need to apply the momentumvector for the current timestep anymore because wealready applied it in the last update of the parame-ters, at timestept Applying NAG to AdamIgnoring the initialization bias correction terms forthe moment, Adam s update rule can be written interms of the previous Momentum /norm vectors andcurrent gradient update as in (3). t t 1 mt 1 nt 1+(1 )g2t+ (3) (1 )gt nt 1+(1 )g2t+ In rewritten NAG, we would take the first part of thestep and apply it before taking the gradient of thecost functionf however, the denominator dependsongt, so we can t take advantage of the trick usedin NAG for this equation.

8 However, is generallychosen to be very large (normally>.9), so the dif-ference betweennt 1andntwill in general be verysmall. We can then replacentwithnt 1without los-ing too much accuracy: t t 1 mt 1 nt 1+ (4) (1 )gt nt 1+(1 )g2t+ The the first term in the expression in (4) no longerdepends ongt, meaning here wecanuse the Nes-terov trick; this give us the following expressions for mtand t: mt (1 t)gt+ t+1mt t t 1 mt vt+ All that s left is to determine how to include theinitialization bias correction terms, taking into con-sideration thatgtcomes from the current timestepbutmtcomes from the subsequent timestep. Thisgives us the following, final form of the Nesterov -accelerated adaptive moment estimation (Nadam)algorithm.

9 AdaMax can make use of the Nesterovacceleration trick identically (NadaMax).Algorithm 8 Nesterov -accelerated adaptive momentestimationgt t 1f( t 1) g gt1 ti=1 imt mt 1+(1 )gt mt mt1 t+1i=1 int nt 1+(1 )g2t nt nt1 t mt (1 t) gt+ t+1 mt t t 1 mt nt+ 4 ExperimentsTo test this algorithm, we compared the perfor-mance of nine algorithms GD, Momentum , NAG,RMSProp, Adam, Nadam, MaxaProp, AdaMax,and Nadamax on three benchmarks word2vec [10],MNIST image classification [7], and LSTM lan-guage models [17]. All algorithms used =.999and =1e 8as suggested in [5], with a momen-tum schedule given by t= (1 .5 .96t250)with =.

10 99, similar to the recommendation in [14].Only the learning rate differed across algorithmsand experiments. The algorithms were all coded us-ing Google s TensorFlow [1] API and the experi-ments were done using the built-in TensorFlow mod-els, making only small edits to the default algorithms used initialization bias Word2 VecWord2vec [10] word embeddings were trained us-ing each of the nine algorithms. Approximately100MB of cleaned text2from Wikipedia were usedas the source text, and any word not in the top 50000words was replaced withUNK. 128-dimensional vec-tors with a left and right context size of 1 weretrained using noise-contrastive estimation with 64negative samples.


Related search queries