Transcription of Subgradient Methods
1 Subgradient MethodsStephenBoyd, Lin Xiao,and AlmirMutapcicNotesfor EE392o,StanfordUniversity, Autumn,2003 October 1, 2003 Thesubgradientmethodis a simplealgorithmfor minimizinga nondi looks very much like the ordinarygradient method for di erentiablefunctions,but withseveral example,the Subgradient method usessteplengthsthatare xedaheadof time,insteadof an exactor approximateline search asin the gradient method. Unlike the ordinarygradient method, the Subgradient method isnota descent method; the functionvaluecan (andoftendoes) method is far slower thanNewton'smethod, but is much simplerandcan be appliedto a far widervariety of combiningthe Subgradient methodwithprimalor dualdecompositiontechniques,it is sometimespossibleto developa simpledistributedalgorithmfor a method was originallydeveloped by Shorin the SovietUnionin Subgradient Methods is his book [Sho85].
2 Anotherbook onthe topicis Akgul[Akg84]. Bertsekas [Ber99]is a good referenceon the Subgradient method,combinedwithprimalor Thesubgradient methodSupposef:Rn!Ris convex. To minimizef, the Subgradient method usesthe iterationx(k+1)=x(k) kg(k):Herex(k)is thekth iterate,g(k)isanysubgradient offatx(k), and k>0 is thekth , at each iterationof the Subgradient method, we take a stepin the directionofa negative Subgradient . Recallthata Subgradient offatxis any vectorgthatsatis esthe inequalityf(y) f(x) +gT(y x) for ally. Whenfis di erentiable,the onlypossiblechoiceforg(k)isrf(x(k)), and the Subgradient method thenreducesto the gradient method(except,as we'll see below, for the choiceof stepsize).Sincethe Subgradient method is not a descent method, it is commonto keep track of thebest point foundso far, , the one each step,we setf(k)best= minff(k 1)best; f(x(k))g;1and seti(k)best=kiff(x(k)) =f(k)best, , ifx(k)is the best point foundso far.
3 (In a descentmethod thereis no needto do this | the current point is always the best one so far.)Thenwe havef(k)best= minff(x(1)); : : : ; f(x(k))g; ,f(k)bestis the best objective (k)bestis decreasing,it hasa limit(which can be 1). (For convenienceof laternotations,we label the initialpointwith1 insteadof 0.) rulesSeveral di erent types of stepsize rulesare used. Constantstep size. k=his a constant, independent ofk. Constantstep length. k=h=kg(k)k2. Thismeansthatkx(k+1) x(k)k2=h. Square summablebut not 2k<1;1Xk=1 k=1:Onetypicalexampleis k=a=(b+k), wherea >0 andb 0. !1 k= 0;1Xk=1 k=1:Stepsizesthatsatisfythis conditionare calleddiminishingstep size rules. A typicalexampleis k=a=pk, wherea > many resultson convergenceof the Subgradient method. For constant stepsizeandconstant steplength,the Subgradient algorithmis guaranteedto convergeto withinsomerangeof the optimalvalue, , we havelimk!
4 1f(k)best f?< ;wheref?denotesthe optimalvalueof the problem, ,f?= infxf(x). (Thisimpliesthatthe Subgradient method ndsan -suboptimalpoint withina nitenumber of steps.)Thenumber is a functionof the stepsize parameterh, and the diminishingstepsizerule(andthereforealso the squaresummablebut notsummablestepsize rule),the algorithmis guaranteedto convergeto the optimalvalue, ,we have limk!1f(x(k)) =f?.Whenthe functionfis di erentiable,we can say a bit moreabout the case,the Subgradient method withconstant stepsize yieldsconvergenceto the optimalvalue,providedthe parameterhis ConvergenceproofHerewe give a proof of sometypicalconvergenceresultsfor the Subgradient method. Weassumethatthereis a minimizeroff, sayx?. We alsomake one otherassumptiononf:We will assumethatthe normof the subgradients is bounded, , thereis aGsuch thatkg(k)k2 Gfor allk.
5 Thiswill be the caseif, for example,fsatis esthe Lipschitz conditionjf(u) f(v)j Gku vk2;for allu; v, becausethenkgk2 Gfor anyg2@f(x), and anyx. In fact,somevariants ofthe Subgradient method work whenthis assumptiondoesn'thold;see [Sho85].Recallthatfor the standardgradient descent method, the convergenceproof is basedonthe functionvaluedecreasingat each the Subgradient method, the key quantity isnot the functionvalue(which oftenincreases);it is theEuclidean distance to the a point thatminimizesf, , it is an arbitraryoptimalpoint. We havekx(k+1) x?k22=kx(k) kg(k) x?k22=kx(k) x?k22 2 kg(k)T(x(k) x?) + 2kkg(k)k22 kx(k) x?k22 2 k(f(x(k)) f?) + 2kkg(k)k22;wheref?=f(x?). Thelast line follows fromthe de nitionof Subgradient , which givesf(x?) f(x(k)) +g(k)T(x? x(k)):Applyingthe inequality above recursively, we havekx(k+1) x?k22 kx(1) x?
6 K22 2kXi=1 i(f(x(i)) f?) +kXi=1 2ikg(i)k22:Usingkx(k+1) x?k22 0 we have2kXi=1 i(f(x(i)) f?) kx(1) x?k22+kXi=1 2ikg(i)k22:Combiningthis withkXi=1 i(f(x(i)) f?) kXi=1 i!mini=1;:::;k(f(x(i)) f?);we have the inequalityf(k)best f?= mini=1;:::;kf(x(i)) f? kx(1) x?k22+Pki=1 2ikg(i)k222 Pki=1 i:(1)Finally, usingthe assumptionkg(k)k2 G, we obtainthe basicinequalityf(k)best f?= mini=1;:::;kf(x(i)) f? kx(1) x?k22+G2 Pki=1 2i2 Pki=1 i:(2)3 Fromthis inequality we can reado any minimizeroff, we can statethatf(k)best f? dist(x(1); X?)2+G2 Pki=1 2i2 Pki=1 i;whereX?denotesthe optimalset, anddist(x(1); X?) is the (Euclidean)distanceofx(1)tothe k=h, we havef(k)best f? dist(x(1); X?)2+G2h2k2hk:Therighthandsideconvergest oG2h=2 ask! 1. Thus, for the Subgradient method with xedstepsizeh,f(k)bestthatconvergesto withinG2h=2 of can also say, for example,thatwe havef(x(k)) f?
7 G2hwithina nitenumber ofsteps.(Indeed,we can easilygive a speci cupper boundon the number of stepsneeded.)Constant k=h=kg(k)k2, theninequality (1) becomesf(k)best f? dist(x(1); X?)2+h2k2 Pki=1 i:By assmpution,we have k h=G. Applyingthis to the denominatorof the above inequalitygivesf(k)best f? dist(x(1); X?)2+h2k2hk=G:Therighthandsideconvergest oGh=2 ask! 1, so in thiscasethe Subgradient methodconvergesto withinGh=2 of not supposek k22=1Xk=1 2k<1;1Xk=1 k=1:Thenwe havef(k)best f? dist(x(1); X?)2+G2k k222 Pki=1 i;which convergesto zeroask! 1. In otherwords,the Subgradient method converges(inthe sensef(k)best!f?).4 Diminishingstepsize the sequence kconvergesto zeroand is nonsummable,thenthe righthandside of the inequality (2) convergesto zero,which impliesthe subgradientmethod show this,let >0. Thenthereexistsan integerN1such that i =G2, for alli > N1.
8 Therealsoexistsan integerN2such thatkXi=1 i 1 0@kx(1) x?k22+G2N1Xi=1 2i1 Afor allk > N2(becauseP1i=1 i=1). LetN= maxfN1; N2g. Thenfor allk > N, we havemini=1;:::;kf(x(i)) f? kx(1) x?k22+G2PN1i=1 2i2 Pki=1 i+G2 Pki=N1+1 2i2PN1i=1 i+ 2 Pki=N1+1 i kx(1) x?k22+G2PN1i=1 2i2 kx(1) x?k22+G2PN1i=1 2i +G2 Pki=N1+1 G2 i2 Pki=N1+1 i= 2+ 2= :3 Projectedsubgradient methodOneextensionof the Subgradient method is theprojected subgradientmethod, which solvesthe constrainedconvex optimizationproblemminimizef(x)subject tox2C;whereCis a convex set. Theprojectedsubgradient method is given byx(k+1)=P x(k) kg(k) ;wherePis (Euclidean)projectiononC, andg(k)is any Subgradient offatx(k). Thestepsize rulesdescribed beforecan be usedhere, for the Subgradient method are readilyextendedto handletheprojectedsubgradient method. Letz(k+1)=x(k) kg(k), , a standardsubgradient update,beforethe projectionback ontoC.
9 As in the Subgradient method, we havekz(k+1) x?k22=kx(k) kg(k) x?k22=kx(k) x?k22 2 kg(k)T(x(k) x?) + 2kkg(k)k22 kx(k) x?k22 2 k(f(x(k)) f?) + 2kkg(k)k22:Now we observe thatkx(k+1) x?k2=kP(z(k+1)) x?k2 kz(k+1) x?k2; , whenwe project a point ontoC, we move closerto every point inC, and in particular,any optimalpoint. Combiningthis withthe inequality above we getkx(k+1) x?k22 kx(k) x?k22 2 k(f(x(k)) f?) + 2kkg(k)k22;and the proof proceedsexactlyas in the ordinarysubgradient methodThisproof reveals thatthe method still workseven whenPis not the thatmattersis thatPshouldsatisfyP(u)2C, and the inequalitykP(u) zk2 ku zk2;for anyz2C. In otherwords,Pmust returna point inCthatis no fartherfromzthanuis. In otherwords,the operatorPcan be anapproximateprojectionoperator; it onlyneedsto not increasethe distanceto the whenCcan be expressedasC1 C2 Ck=C;whereCiare convex sets,and we have a simplemethod for computingPi(z), the EuclideanprojectiononCi, providedz2 Ci 1.
10 Thenwe can constructan approximateprojectionoperatoronCasP(u) =Pk Pk 1 P1(u); , by rstprojectinguontoC1, thenprojectingthe resultontoC2, and so on. To see thatthis satis esthe requiredcondition,letz2C. Sincez2C1, we havekP1(u) zk2 ku zk2:Sincez2C2, we havekP2(P1(u)) zk2 kP1(u) zk2;which, combinedwiththe inequality above yieldskP2(P1(u)) zk2 ku zk2:Continuingin this way we ndthatkP(u) zk2 ku PiecewiselinearminimizationOur rstexampleis minimizinga piecewiselinearconvex function:minimizef(x) = maxi=1;:::;m(aTix+bi):Of coursethis problemis readily(ande ciently) solved via offis easy:givenx, we rst ndan indexjfor whichaTjx+bj= maxi=1;:::;m(aTix+bi):Thenwe can take as subgradientg=aj. Thesubgradient method updatehas the formx(k+1)=x(k) kaj;6wherejis chosenso thataTjx(k)+bj= maxi=1;:::;m(aTix(k)+bi).Notethatto applythe Subgradient method, all we needis a way to evaluatemaxi(aTix+bi)(alongwithan indexcorrespondingto one of the maximizers)and the ability to carryoutthe simpleupdateabove.