Transcription of Gradient Descent - CMU Statistics
1 Gradient DescentRyan TibshiraniConvex Optimization 10-725 Last time: canonical convex programs Linear program (LP): takes the formminxcTxsubject toDx dAx=b Quadratic program (QP): like LP, but with quadratic criterion Semidefinite program (SDP): like LP, but with matrices Conic program: the most general form of all2 Gradient descentConsider unconstrained, smooth convex optimizationminxf(x)That is,fis convex and differentiable withdom(f) =Rn. Denoteoptimal criterion value byf?= minxf(x), and a solution byx? Gradient Descent : choose initial pointx(0) Rn, repeat:x(k)=x(k 1) tk f(x(k 1)), k= 1,2,3,..Stop at some point3lllll4lllll5 Gradient Descent interpretationAt each iteration, consider the expansionf(y) f(x) + f(x)T(y x) +12t y x 22 Quadratic approximation, replacing usual Hessian 2f(x)by1tIf(x) + f(x)T(y x)linear approximation tof12t y x 22proximity term tox, with weight1/(2t)Choose next pointy=x+to minimize quadratic approximation:x+=x t f(x)6llBlue point isx, red point isx+= argminyf(x) + f(x)T(y x) +12t y x 227 OutlineToday: How to choose step sizes Convergence analysis Nonconvex functions Gradient boosting8 Fixed step sizeSimply taketk=tfor allk= 1,2,3.
2 , can diverge iftis too (x) = (10x21+x22)/2, Gradient Descent after 8 steps: 20 1001020 20 1001020lll*9 Can be slow iftis too small. Same example, Gradient Descent after100 steps: 20 1001020 20 1001020lllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllll*10 Converges nicely whentis just right . Same example, 40 steps: 20 1001020 20 1001020lllllllllllllllllllllllllllllllll llllll*Convergence analysis later will give us a precise idea of just right 11 Backtracking line searchOne way to adaptively choose the step size is to use backtrackingline search: First fix parameters0< <1and0< 1/2 At each iteration, start witht=tinit, and whilef(x t f(x))> f(x) t f(x) 22shrinkt= t. Else perform Gradient Descent updatex+=x t f(x)Simple and tends to work well in practice (further simplification:just take = 1/2)12 Backtracking Descent methods465tf(x+t x)t=0t0f(x)+ t f(x)T xf(x)+t f(x)T xFigure line curve showsf,restrictedtothelineover which we search.
3 The lower dashed line shows the linear extrapolationoff, and the upper dashed line has a slope a factor of smaller. Thebacktracking condition is thatflies below the upper dashed line, ,0 t line search is called backtracking because it starts withunitstepsizeandthen reduces it by the factor until the stopping conditionf(x+t x) f(x)+ t f(x)T xholds. Since xis a Descent direction, we have f(x)T x<0, sofor small enoughtwe havef(x+t x) f(x)+t f(x)T x<f(x)+ t f(x)T x,which shows that the backtracking line search eventually terminates. The constant can be interpreted as the fraction of the decrease infpredicted by linear extrap-olation that we will accept. (The reason for requiring to be smaller than clear later.)The backtracking condition is illustrated in figure Thisfiguresuggests,and it can be shown, that the backtracking exit inequalityf(x+t x) f(x)+ t f(x)T xholds fort 0inaninterval(0,t0].)
4 It follows that the backtrackingline search stops with a step lengthtthat satisfiest=1,ort ( t0,t0].The first case occurs when the step lengtht=1satisfiesthebacktrackingconditi on, ,1 ,wecansaythatthesteplengthobtainedbyback trackingline search satisfiest min{1, t0}.Whendomfis not all ofRn,theconditionf(x+t x) f(x)+ t f(x)T xin the backtracking line search must be interpreted carefully. By our conventionthatfis infinite outside its domain, the inequality implies thatx+t x a practical implementation, we first multiplytby untilx+t x domf;For us x= f(x)13 Setting = = , backtracking picks up roughly the right stepsize (12 outer steps, 40 steps total), 20 1001020 20 1001020llllllllllll*14 Exact line searchWe could also choose step to do the best we can along direction ofnegative Gradient , called exact line search:t= argmins 0f(x s f(x))Usually not possible to do this minimization exactlyApproximations to exact line search are typically not as efficient asbacktracking, and it s typically not worth it15 Convergence analysisAssume thatfconvex and differentiable, withdom(f) =Rn, andadditionally that fis Lipschitz continuous with constantL >0, f(x) f(y) 2 L x y 2for anyx,y(Or when twice differentiable: 2f(x) LI)Theorem: Gradient Descent with fixed step sizet 1/Lsatisfiesf(x(k)) f?)
5 X(0) x? 222tkand same result holds for backtracking, withtreplaced by /LWe say Gradient Descent has convergence rateO(1/k). That is, itfinds -suboptimal point inO(1/ )iterations16 Analysis for strong convexityReminder: strong convexity offmeansf(x) m2 x 22is convexfor somem >0(when twice differentiable: 2f(x) mI)Assuming Lipschitz Gradient as before, and also strong convexity:Theorem: Gradient Descent with fixed step sizet 2/(m+L)or with backtracking line search search satisfiesf(x(k)) f? kL2 x(0) x? 22where0< <1 Rate under strong convexity isO( k), exponentially fast! That is,it finds -suboptimal point inO(log(1/ ))iterations17 Called linear convergenceObjective versus iterationcurve looks linear on semi-log Descent method473kf(x(k)) p exact 410 2100102104 Figure (x(k)) p versus iterationkfor the Gradient method withbacktracking and exact line search, for a problem experiments suggest that the effect of the backtrackingparametersontheconvergence is not large, no more than a factor of two or method and condition numberOur last experiment will illustrate the importance of the condition number of 2f(x)(orthesublevelsets)ontherateofconve rgenceofthegradient start with the function given by ( ), but replace the variablexbyx=T x,whereT=diag((1, 1/n, 2/n.))
6 , (n 1)/n)), ,weminimize f( x)=cTT x m i=1log(bi aTiT x).( )This gives us a family of optimization problems, indexed by ,whichaffects theproblem condition shows the number of iterations required to achieve f( x(k)) p <10 5as a function of ,usingabacktrackinglinesearchwith = = Thisplot shows that for diagonal scaling as small as 10 : 1 ( , =10),thenumberofiterations grows to more than a thousand; for a diagonal scaling of 20 or more, thegradient method slows to essentially condition number of the Hessian 2 f( x )attheoptimumisshowninfigure For large and small ,theconditionnumberincreasesroughlyasmax { 2,1/ 2},inaverysimilarwayasthenumberofiterati onsdependson .This shows again that the relation between conditioning andconvergence speed isarealphenomenon,andnotjustanartifactof ouranalysis.(From B & V page 487)Important note: =O(1 m/L).
7 Thus we can write convergencerate asO(Lmlog(1/ ))Higher condition numberL/m slower rate. This is not only trueof in theory .. very apparent in practice too18A look at the conditionsA look at the conditions for a simple problem,f( ) =12 y X 22 Lipschitz continuity of f: Recall this means 2f(x) LI As 2f( ) =XTX, we haveL= max(XTX)Strong convexity off: Recall this means 2f(x) mI As 2f( ) =XTX, we havem= min(XTX) IfXis wide (Xisn pwithp > n), then min(XTX) = 0,andfcan t be strongly convex Even if min(X)>0, can have a very large condition numberL/m= max(XTX)/ min(XTX)19 PracticalitiesStopping rule: stop when f(x) 2is small Recall f(x?) = 0at solutionx? Iffis strongly convex with parameterm, then f(x) 2 2m = f(x) f? Pros and cons of Gradient Descent : Pro: simple idea, and each iteration is cheap (usually) Pro: fast for well-conditioned, strongly convex problems Con: can often be slow, because many interesting problemsaren t strongly convex or well-conditioned Con: can t handle nondifferentiable functions20 Can we do better?
8 Gradient Descent hasO(1/ )convergence rate over problem classof convex, differentiable functions with Lipschitz gradientsFirst-order method: iterative method, which updatesx(k)inx(0)+ span{ f(x(0)), f(x(1)),.. f(x(k 1))}Theorem (Nesterov):For anyk (n 1)/2and any startingpointx(0), there is a functionfin the problem class such thatany first-order method satisfiesf(x(k)) f? 3L x(0) x? 2232(k+ 1)2 Can attain rateO(1/k2), orO(1/ )? Answer: yes (we ll see)!21 Analysis for nonconvex caseAssumefis differentiable with Lipschitz Gradient , now for optimality is too much. Let s settle for a -substationarypointx, which means f(x) 2 Theorem: Gradient Descent with fixed step sizet 1/Lsatisfiesmini=0,..,k f(x(i)) 2 2(f(x(0)) f?)t(k+ 1)Thus Gradient Descent has rateO(1/ k), orO(1/ 2), even in thenonconvex case for finding stationary pointsThis rate cannot be improved (over class of differentiable functionswith Lipschitz gradients) by any deterministic algorithm11 Carmon et al.
9 (2017), Lower bounds for finding stationary points I 22 Gradient boosting23 Given responsesyi Rand featuresxi Rp,i= 1,..,nWant to construct a flexible (nonlinear) model for response basedon features. Weighted sum of trees:ui=m j=1 j Tj(xi), i= 1,..,nEach treeTjinputsxi, outputs predicted response. Typically treesare pretty a loss functionLto reflect setting. For continuous responses, , could takeL(yi,ui) = (yi ui)2 Want to solvemin n i=1L(yi,M j=1 j Tj(xi))Indexes all trees of a fixed size ( , depth = 5), soMis is simply too big to optimizeGradient boosting: basically a version of Gradient Descent that isforced to work with treesFirst think of optimization asminuf(u), over predicted valuesu,subject toucoming from trees25 Start with initial model, a single treeu(0)=T0. Repeat: Compute negative gradientdat latest predictionu(k 1),di= [ L(yi,ui) ui] ui=u(k 1)i, i= 1.
10 ,n Find a treeTkthat is close toa, , according tomintreesTn i=1(di T(xi))2 Not hard to (approximately) solve for a single tree Compute step size k, and update our prediction:u(k)=u(k 1)+ k TkNote: predictions are weighted sums of trees, as desired26 References and further reading S. Boyd and L. Vandenberghe (2004), Convex optimization ,Chapter 9 T. Hastie, R. tibshirani and J. Friedman (2009), Theelements of statistical learning , Chapters 10 and 16 Y. Nesterov (1998), Introductory lectures on convexoptimization: a basic course , Chapter 2 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring2011-201227