Gradient Descent - CMU Statistics

Gradient DescentRyan TibshiraniConvex Optimization 10-725 Last time: canonical convex programs Linear program (LP): takes the formminxcTxsubject toDx dAx=b Quadratic program (QP): like LP, but with quadratic criterion Semidefinite program (SDP): like LP, but with matrices Conic program: the most general form of all2 Gradient descentConsider unconstrained, smooth convex optimizationminxf(x)That is,fis convex and differentiable withdom(f) =Rn. Denoteoptimal criterion value byf?= minxf(x), and a solution byx? Gradient Descent : choose initial pointx(0) Rn, repeat:x(k)=x(k 1) tk f(x(k 1)), k= 1,2,3.

Stop at some point3lllll4lllll5 Gradient Descent interpretationAt each iteration, consider the expansionf(y) f(x) + f(x)T(y x) +12t y x 22 Quadratic approximation, replacing usual Hessian 2f(x)by1tIf(x) + f(x)T(y x)linear approximation tof12t y x 22proximity term tox, with weight1/(2t)Choose next pointy=x+to minimize quadratic approximation:x+=x t f(x)6llBlue point isx, red point isx+= argminyf(x) + f(x)T(y x) +12t y x 227 OutlineToday: How to choose step sizes Convergence analysis Nonconvex functions Gradient boosting8 Fixed step sizeSimply taketk=tfor allk= 1,2,3.

, can diverge iftis too (x) = (10x21+x22)/2, Gradient Descent after 8 steps: 20 1001020 20 1001020lll*9 Can be slow iftis too small. Same example, Gradient Descent after100 steps: 20 1001020 20 1001020lllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllll*10 Converges nicely whentis just right . Same example, 40 steps: 20 1001020 20 1001020lllllllllllllllllllllllllllllllll llllll*Convergence analysis later will give us a precise idea of just right 11 Backtracking line searchOne way to adaptively choose the step size is to use backtrackingline search: First fix parameters0< <1and0< 1/2 At each iteration, start witht=tinit, and whilef(x t f(x))> f(x) t f(x) 22shrinkt= t.

Else perform Gradient Descent updatex+=x t f(x)Simple and tends to work well in practice (further simplification:just take = 1/2)12 Backtracking Descent methods465tf(x+t x)t=0t0f(x)+ t f(x)T xf(x)+t f(x)T xFigure line curve showsf,restrictedtothelineover which we search. The lower dashed line shows the linear extrapolationoff, and the upper dashed line has a slope a factor of smaller. Thebacktracking condition is thatflies below the upper dashed line, ,0 t line search is called backtracking because it starts withunitstepsizeandthen reduces it by the factor until the stopping conditionf(x+t x) f(x)+ t f(x)T xholds.

Since xis a Descent direction, we have f(x)T x<0, sofor small enoughtwe havef(x+t x) f(x)+t f(x)T x<f(x)+ t f(x)T x,which shows that the backtracking line search eventually terminates. The constant can be interpreted as the fraction of the decrease infpredicted by linear extrap-olation that we will accept. (The reason for requiring to be smaller than clear later.)The backtracking condition is illustrated in figure Thisfiguresuggests,and it can be shown, that the backtracking exit inequalityf(x+t x) f(x)+ t f(x)T xholds fort 0inaninterval(0,t0]. It follows that the backtrackingline search stops with a step lengthtthat satisfiest=1,ort ( t0,t0].))

The first case occurs when the step lengtht=1satisfiesthebacktrackingconditi on, ,1 ,wecansaythatthesteplengthobtainedbyback trackingline search satisfiest min{1, t0}.Whendomfis not all ofRn,theconditionf(x+t x) f(x)+ t f(x)T xin the backtracking line search must be interpreted carefully. By our conventionthatfis infinite outside its domain, the inequality implies thatx+t x a practical implementation, we first multiplytby untilx+t x domf;For us x= f(x)13 Setting = = , backtracking picks up roughly the right stepsize (12 outer steps, 40 steps total), 20 1001020 20 1001020llllllllllll*14 Exact line searchWe could also choose step to do the best we can along direction ofnegative Gradient , called exact line search.

T= argmins 0f(x s f(x))Usually not possible to do this minimization exactlyApproximations to exact line search are typically not as efficient asbacktracking, and it s typically not worth it15 Convergence analysisAssume thatfconvex and differentiable, withdom(f) =Rn, andadditionally that fis Lipschitz continuous with constantL >0, f(x) f(y) 2 L x y 2for anyx,y(Or when twice differentiable: 2f(x) LI)Theorem: Gradient Descent with fixed step sizet 1/Lsatisfiesf(x(k)) f? x(0) x? 222tkand same result holds for backtracking, withtreplaced by /LWe say Gradient Descent has convergence rateO(1/k).

That is, itfinds -suboptimal point inO(1/ )iterations16 Analysis for strong convexityReminder: strong convexity offmeansf(x) m2 x 22is convexfor somem >0(when twice differentiable: 2f(x) mI)Assuming Lipschitz Gradient as before, and also strong convexity:Theorem: Gradient Descent with fixed step sizet 2/(m+L)or with backtracking line search search satisfiesf(x(k)) f? kL2 x(0) x? 22where0< <1 Rate under strong convexity isO( k), exponentially fast! That is,it finds -suboptimal point inO(log(1/ ))iterations17 Called linear convergenceObjective versus iterationcurve looks linear on semi-log Descent method473kf(x(k)) p exact 410 2100102104 Figure (x(k))

P versus iterationkfor the Gradient method withbacktracking and exact line search, for a problem experiments suggest that the effect of the backtrackingparametersontheconvergence is not large, no more than a factor of two or method and condition numberOur last experiment will illustrate the importance of the condition number of 2f(x)(orthesublevelsets)ontherateofconve rgenceofthegradient start with the function given by ( ), but replace the variablexbyx=T x,whereT=diag((1, 1/n, 2/n,.., (n 1)/n)), ,weminimize f( x)=cTT x m i=1log(bi aTiT x).

( )This gives us a family of optimization problems, indexed by ,whichaffects theproblem condition shows the number of iterations required to achieve f( x(k)) p <10 5as a function of ,usingabacktrackinglinesearchwith = = Thisplot shows that for diagonal scaling as small as 10 : 1 ( , =10),thenumberofiterations grows to more than a thousand; for a diagonal scaling of 20 or more, thegradient method slows to essentially condition number of the Hessian 2 f( x )attheoptimumisshowninfigure For large and small ,theconditionnumberincreasesroughlyasmax { 2,1/ 2},inaverysimilarwayasthenumberofiterati onsdependson.

Gradient Descent - CMU Statistics

Tags:

Information

Advertisement

Transcription of Gradient Descent - CMU Statistics

Related search queries

Gradient Descent - CMU Statistics

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries