Adam: A Method for Stochastic Optimization

Published as a conference paper at ICLR 2015 ADAM: A Method FORSTOCHASTICOPTIMIZATIOND iederik P. Kingma*University of Amsterdam, Lei Ba University of introduceAdam, an algorithm for first-order gradient-based Optimization ofstochastic objective functions, based on adaptive estimates of lower-order mo-ments. The Method is straightforward to implement, is computationally efficient,has little memory requirements, is invariant to diagonal rescaling of the gradients,and is well suited for problems that are large in terms of data and/or Method is also appropriate for non-stationary objectives and problems withvery noisy and/or sparse gradients. The hyper-parameters have intuitive interpre-tations and typically require little tuning. Some connections to related algorithms,on whichAdamwas inspired, are discussed.

We also analyze the theoretical con-vergence properties of the algorithm and provide a regret bound on the conver-gence rate that is comparable to the best known results under the online convexoptimization framework. Empirical results demonstrate that Adam works well inpractice and compares favorably to other Stochastic Optimization methods. Finally,we discussAdaMax, a variant ofAdambased on the infinity gradient-based Optimization is of core practical importance in many fields of science andengineering. Many problems in these fields can be cast as the Optimization of some scalar parameter-ized objective function requiring maximization or minimization with respect to its parameters. If thefunction is differentiable its parameters, gradient descent is a relatively efficient optimizationmethod, since the computation of first-order partial derivatives all the parameters is of the samecomputational complexity as just evaluating the function.

Often, objective functions are example, many objective functions are composed of a sum of subfunctions evaluated at differentsubsamples of data; in this case Optimization can be made more efficient by taking gradient individual subfunctions, Stochastic gradient descent (SGD) or ascent . SGD proved itselfas an efficient and effective Optimization Method that was central in many machine learning successstories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton& Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have othersources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. Forall such noisy objectives, efficient Stochastic Optimization techniques are required. The focus of thispaper is on the Optimization of Stochastic objectives with high-dimensional parameters spaces.

Inthese cases, higher-order Optimization methods are ill-suited, and discussion in this paper will berestricted to first-order proposeAdam, a Method for efficient Stochastic Optimization that only requires first-order gra-dients with little memory requirement. The Method computes individual adaptive learning rates fordifferent parameters from estimates of first and second moments of the gradients; the nameAdamis derived from adaptive moment estimation. Our Method is designed to combine the advantagesof two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gra-dients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationarysettings; important connections to these and other Stochastic Optimization methods are clarified insection 5. Some of Adam s advantages are that the magnitudes of parameter updates are invariant torescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter,it does not require a stationary objective, it works with sparse gradients, and it naturally performs aform of step size annealing.

Equal contribution. Author ordering determined by coin flip over a Google [ ] 30 Jan 2017 Published as a conference paper at ICLR 2015 Algorithm 1:Adam, our proposed algorithm for Stochastic Optimization . See section 2 for details,and for a slightly more efficient (but less clear) order of the elementwisesquaregt gt. Good default settings for the tested machine learning problems are = , 1= , 2= = 10 8. All operations on vectors are element-wise. With t1and t2we denote 1and 2to the : : StepsizeRequire: 1, 2 [0,1): Exponential decay rates for the moment estimatesRequire:f( ): Stochastic objective function with parameters Require: 0: Initial parameter vectorm0 0(Initialize 1stmoment vector)v0 0(Initialize 2ndmoment vector)t 0(Initialize timestep)while tnot convergeddot t+ 1gt ft( t 1)(Get gradients Stochastic objective at timestept)mt 1 mt 1+ (1 1) gt(Update biased first moment estimate)vt 2 vt 1+ (1 2) g2t(Update biased second raw moment estimate) mt mt/(1 t1)(Compute bias-corrected first moment estimate) vt vt/(1 t2)(Compute bias-corrected second raw moment estimate) t t 1 mt/( vt+ )(Update parameters)end whilereturn t(Resulting parameters)In section 2 we describe the algorithm and the properties of its update rule.]

Section 3 explainsour initialization bias correction technique, and section 4 provides a theoretical analysis of Adam sconvergence in online convex programming. Empirically, our Method consistently outperforms othermethods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam isa versatile algorithm that scales to large-scale high-dimensional machine learning algorithm 1 for pseudo-code of our proposed algorithmAdam. Letf( )be a noisy objec-tive function: a Stochastic scalar function that is differentiable parameters . We are in-terested in minimizing the expected value of this function,E[f( )] its parameters . Withf1( ),..,,fT( )we denote the realisations of the Stochastic function at subsequent timesteps1,..,T. The stochasticity might come from the evaluation at random subsamples (minibatches)of datapoints, or arise from inherent function noise.

Withgt= ft( )we denote the gradient, vector of partial derivatives offt, evaluated at algorithm updates exponential moving averages of the gradient (mt) and the squared gradient(vt) where the hyper-parameters 1, 2 [0,1)control the exponential decay rates of these movingaverages. The moving averages themselves are estimates of the 1stmoment (the mean) and the2ndraw moment (the uncentered variance) of the gradient. However, these moving averages areinitialized as (vectors of) 0 s, leading to moment estimates that are biased towards zero, especiallyduring the initial timesteps, and especially when the decay rates are small ( the s are close to 1).The good news is that this initialization bias can be easily counteracted, resulting in bias-correctedestimates mtand vt. See section 3 for more that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changingthe order of computation, by replacing the last three lines in the loop with the following lines: t= 1 t2/(1 t1)and t t 1 t mt/( vt+ ).]

S UPDATE RULEAn important property of Adam s update rule is its careful choice of stepsizes. Assuming = 0, theeffective step taken in parameter space at timesteptis t= mt/ vt. The effective stepsize hastwo upper bounds:| t| (1 1)/ 1 2in the case(1 1)> 1 2, and| t| 2 Published as a conference paper at ICLR 2015otherwise. The first case only happens in the most severe case of sparsity: when a gradient hasbeen zero at all timesteps except at the current timestep. For less sparse cases, the effective stepsizewill be smaller. When(1 1) = 1 2we have that| mt/ vt|<1therefore| t|< . Inmore common scenarios, we will have that mt/ vt 1since|E[g]/ E[g2]| 1. The effectivemagnitude of the steps taken in parameter space at each timestep are approximately bounded bythe stepsize setting , ,| t|/.

This can be understood as establishing atrust regionaroundthe current parameter value, beyond which the current gradient estimate does not provide sufficientinformation. This typically makes it relatively easy to know the right scale of in advance. Formany machine learning models, for instance, we often know in advance that good optima are withhigh probability within some set region in parameter space; it is not uncommon, for example, tohave a prior distribution over the parameters. Since sets (an upper bound of) the magnitude ofsteps in parameter space, we can often deduce the right order of magnitude of such that optimacan be reached from 0within some number of iterations. With a slight abuse of terminology,we will call the ratio mt/ vtthesignal-to-noiseratio (SNR). With a smaller SNR the effectivestepsize twill be closer to zero.

This is a desirable property, since a smaller SNR means thatthere is greater uncertainty about whether the direction of mtcorresponds to the direction of the truegradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leadingto smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize tis also invariant to the scale of the gradients; rescaling the gradientsgwith factorcwill scale mtwith a factorcand vtwith a factorc2, which cancel out:(c mt)/( c2 vt) = mt/ BIAS CORRECTIONAs explained in section 2, Adam utilizes initialization bias correction terms. We will here derivethe term for the second moment estimate; the derivation for the first moment estimate is completelyanalogous. Letgbe the gradient of the Stochastic objectivef, and we wish to estimate its secondraw moment (uncentered variance) using an exponential moving average of the squared gradient,with decay rate 2.

Adam: A Method for Stochastic Optimization

Tags:

Information

Advertisement

Transcription of Adam: A Method for Stochastic Optimization

Related search queries

Adam: A Method for Stochastic Optimization

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries