Example: bachelor of science

Alternating Direction Method of Multipliers

Alternating Direction Method of MultipliersProf S. BoydEE364b, Stanford Universitysource:Distributed Optimization and Statistical Learning via the AlternatingDirection Method of Multipliers (Boyd, Parikh, Chu, Peleato, Eckstein)1 Goalsrobust methods for arbitrary-scale optimization machine learning/statistics with huge data-sets dynamic optimization on large-scale network decentralized optimization devices/processors/agents coordinate to solve large problem, by passingrelatively small messages2 OutlineDual decompositionMethod of multipliersAlternating Direction Method of multipliersCommon patternsExamplesConsensus and exchangeConclusionsDual decomposition3 Dual problem convex equality constrained optimization problemminimizef(x)subject toAx=b Lagrangian:L(x, y) =f(x) +yT(Ax b) dual function:g(y) = infxL(x, y) dual problem: maximizeg(y) recoverx = argminxL(x, y )Dual decomposition4 Dual ascent gradient Method for dual problem.

Related algorithms operator splitting methods (Douglas, Peaceman, Rachford, Lions, Mercier, ...1950s, 1979) proximal point algorithm (Rockafellar 1976) Dykstra’s alternating projections algorithm (1983) Spingarn’s method of partial inverses (1985) Rockafellar-Wets progressive hedging (1991) proximal methods (Rockafellar, many others, 1976–present)

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Alternating Direction Method of Multipliers

1 Alternating Direction Method of MultipliersProf S. BoydEE364b, Stanford Universitysource:Distributed Optimization and Statistical Learning via the AlternatingDirection Method of Multipliers (Boyd, Parikh, Chu, Peleato, Eckstein)1 Goalsrobust methods for arbitrary-scale optimization machine learning/statistics with huge data-sets dynamic optimization on large-scale network decentralized optimization devices/processors/agents coordinate to solve large problem, by passingrelatively small messages2 OutlineDual decompositionMethod of multipliersAlternating Direction Method of multipliersCommon patternsExamplesConsensus and exchangeConclusionsDual decomposition3 Dual problem convex equality constrained optimization problemminimizef(x)subject toAx=b Lagrangian:L(x, y) =f(x) +yT(Ax b) dual function:g(y) = infxL(x, y) dual problem: maximizeg(y) recoverx = argminxL(x, y )Dual decomposition4 Dual ascent gradient Method for dual problem.

2 Yk+1=yk+ k g(yk) g(yk) =A x b, where x= argminxL(x, yk) dual ascent Method isxk+1:= argminxL(x, yk)//x-minimizationyk+1:=yk+ k(Axk+1 b)// dual update works, with lots of strong assumptionsDual decomposition5 Dual decomposition supposefis separable:f(x) =f1(x1) + +fN(xN), x= (x1, .. , xN) thenLis separable inx:L(x, y) =L1(x1, y) + +LN(xN, y) yTb,Li(xi, y) =fi(xi) +yTAixi x-minimization in dual ascent splits intoNseparate minimizationsxk+1i:= argminxiLi(xi, yk)which can be carried out in parallelDual decomposition6 Dual decomposition dual decomposition (Everett, Dantzig, Wolfe, Benders 1960 65)xk+1i:= argminxiLi(xi, yk), i= 1, .. , Nyk+1:=yk+ k( Ni=1 Aixk+1i b) scatteryk; updatexiin parallel; gatherAixk+1i solve a large problem by iteratively solving subproblems (in parallel) dual variable update provides coordination works, with lots of assumptions; often slowDual decomposition7 OutlineDual decompositionMethod of multipliersAlternating Direction Method of multipliersCommon patternsExamplesConsensus and exchangeConclusionsMethod of multipliers8 Method of Multipliers a Method to robustify dual ascent useaugmented Lagrangian(Hestenes, Powell 1969), >0L (x, y) =f(x) +yT(Ax b) + ( /2)kAx bk22 Method of Multipliers (Hestenes, Powell.)

3 Analysis in Bertsekas1982)xk+1:= argminxL (x, yk)yk+1:=yk+ (Axk+1 b)(note specific dual update step length ) Method of multipliers9 Method of Multipliers dual update step optimality conditions (for differentiablef):Ax b= 0, f(x ) +ATy = 0(primal and dual feasibility) sincexk+1minimizesL (x, yk)0 = xL (xk+1, yk)= xf(xk+1) +AT(yk+ (Axk+1 b))= xf(xk+1) +ATyk+1 dual updateyk+1=yk+ (xk+1 b)makes(xk+1, yk+1)dual feasible primal feasibilityachieved in limit:Axk+1 b 0 Method of multipliers10 Method of Multipliers (compared to dual decomposition) good news: converges under much more relaxed conditions(fcan be nondifferentiable, take on value+ , .. ) bad news: quadratic penalty destroys splitting of thex-update, so can tdo decompositionMethod of multipliers11 OutlineDual decompositionMethod of multipliersAlternating Direction Method of multipliersCommon patternsExamplesConsensus and exchangeConclusionsAlternating Direction Method of multipliers12 Alternating Direction Method of Multipliers a Method with good robustness of Method of Multipliers which can support decomposition robust dual decomposition or decomposable Method of Multipliers proposed by Gabay, Mercier, Glowinski, Marrocco in 1976 Alternating Direction Method of multipliers13 Alternating Direction Method of Multipliers ADMM problem form (withf,gconvex)minimizef(x) +g(z)subject toAx+Bz=c two sets of variables, with separable objective L (x, z, y) =f(x) +g(z) +yT(Ax+Bz c) + ( /2)kAx+Bz ck22 ADMM:xk+1:= argminxL (x, zk, yk)//x-minimizationzk+1.

4 = argminzL (xk+1, z, yk)//z-minimizationyk+1:=yk+ (Axk+1+Bzk+1 c)// dual updateAlternating Direction Method of multipliers14 Alternating Direction Method of Multipliers if we minimized overxandzjointly, reduces to Method of Multipliers instead, we do one pass of a Gauss-Seidel Method we get splitting since we minimize overxwithzfixed, and vice versaAlternating Direction Method of multipliers15 ADMM and optimality conditions optimality conditions (for differentiable case): primal feasibility:Ax+Bz c= 0 dual feasibility: f(x) +ATy= 0, g(z) +BTy= 0 sincezk+1minimizesL (xk+1, z, yk)we have0 = g(zk+1) +BTyk+ BT(Axk+1+Bzk+1 c)= g(zk+1) +BTyk+1 so with ADMM dual variable update,(xk+1, zk+1, yk+1)satisfies seconddual feasibility condition primal and first dual feasibility are achieved ask Alternating Direction Method of multipliers16 ADMM with scaled dual variables combine linear and quadratic terms in augmented LagrangianL (x, z, y) =f(x) +g(z) +yT(Ax+Bz c) + ( /2)kAx+Bz ck22=f(x) +g(z) + ( /2)kAx+Bz c+uk22+ (1/ )yk ADMM (scaled dual form):xk+1:= argminx(f(x) + ( /2)kAx+Bzk c+ukk22)zk+1:= argminz(g(z) + ( /2)kAxk+1+Bz c+ukk22)uk+1:=uk+ (Axk+1+Bzk+1 c) Alternating Direction Method of multipliers17 Convergence assume (very little!)

5 F,gconvex, closed, proper L0has a saddle point then ADMM converges: iterates approach feasibility:Axk+Bzk c 0 objective approaches optimal value:f(xk) +g(zk) p Alternating Direction Method of multipliers18 Related algorithms operator splitting methods(Douglas, Peaceman, Rachford, Lions, Mercier, .. 1950s, 1979) proximal point algorithm (Rockafellar 1976) Dykstra s Alternating projections algorithm (1983) Spingarn s Method of partial inverses (1985) Rockafellar-Wets progressive hedging (1991) proximal methods (Rockafellar, many others, 1976 present) Bregman iterative methods (2008 present) most of these are special cases of the proximal point algorithmAlternating Direction Method of multipliers19 OutlineDual decompositionMethod of multipliersAlternating Direction Method of multipliersCommon patternsExamplesConsensus and exchangeConclusionsCommon patterns20 Common patterns x-update step requires minimizingf(x) + ( /2)kAx vk22(withv=Bzk c+uk, which is constant duringx-update) similar forz-update several special cases come up often can simplify update by exploit structure in these casesCommon patterns21 Decomposition supposefis block-separable,f(x) =f1(x1) + +fN(xN),x= (x1.)

6 , xN) Ais conformably block separable:ATAis block diagonal thenx-update splits intoNparallel updates ofxiCommon patterns22 Proximal operator considerx-update whenA=Ix+= argminx(f(x) + ( /2)kx vk22)=proxf, (v) some special cases:f=IC(indicator fct. of setC)x+:= C(v)(projection ontoC)f= k k1( 1norm)x+i:=S / (vi)(soft thresholding)(Sa(v) = (v a)+ ( v a)+)Common patterns23 Quadratic objective f(x) = (1/2)xTP x+qTx+r x+:= (P+ ATA) 1( ATv q) use matrix inversion lemma when computationally advantageous(P+ ATA) 1=P 1 P 1AT(I+ AP 1AT) 1AP 1 (direct Method ) cache factorization ofP+ ATA(orI+ AP 1AT) (iterative Method ) warm start, early stopping, reducing tolerancesCommon patterns24 Smooth objective fsmooth can use standard methods for smooth minimization gradient, Newton, or quasi-Newton preconditionned CG, limited-memory BFGS (scale to very large problems) can exploit warm start early stopping, with tolerances decreasing as ADMM proceedsCommon patterns25 OutlineDual decompositionMethod of multipliersAlternating Direction Method of multipliersCommon patternsExamplesConsensus and exchangeConclusionsExamples26 Constrained convex optimization consider ADMM for generic problemminimizef(x)subject tox C ADMM form: takegto be indicator ofCminimizef(x) +g(z)subject tox z= 0 algorithm:xk+1.

7 = argminx(f(x) + ( /2)kx zk+ukk22)zk+1:= C(xk+1+uk)uk+1:=uk+xk+1 zk+1 Examples27 Lasso lasso problem:minimize(1/2)kAx bk22+ kxk1 ADMM form:minimize(1/2)kAx bk22+ kzk1subject tox z= 0 ADMM:xk+1:= (ATA+ I) 1(ATb+ zk yk)zk+1:=S / (xk+1+yk/ )yk+1:=yk+ (xk+1 zk+1)Examples28 Lasso example example with denseA R1500 5000(1500measurements;5000regressors) computation timesfactorization (same as ridge regression) ADMM solve (about 50 ADMM iterations) regularization path (30 s) not bad for avery shortMatlab scriptExamples29 Sparse inverse covariance selection S: empirical covariance of samples fromN(0, ), with 1sparse( , Gaussian Markov random field) estimate 1via 1regularized maximum likelihoodminimizeTr(SX) log detX+ kXk1 methods: COVSEL (Banerjee et al 2008), graphical lasso (FHT 2008)Examples30 Sparse inverse covariance selection via ADMM ADMM form:minimizeTr(SX) log detX+ kZk1subject toX Z= 0 ADMM:Xk+1:= argminX(Tr(SX) log detX+ ( /2)kX Zk+Ukk2F)Zk+1:=S / (Xk+1+Uk)Uk+1:=Uk+ (Xk+1 Zk+1)Examples31 Analytical solution forX-update compute eigendecomposition (Zk Uk) S=Q QT form diagonal matrix Xwith Xii= i+ 2i+ 4 2 letXk+1.

8 =Q XQT cost ofX-update is an eigendecompositionExamples32 Sparse inverse covariance selection example 1is1000 1000with104nonzeros graphical lasso (Fortran): 20 seconds 3 minutes ADMM (Matlab): 3 10 minutes (depends on choice of ) very rough experiment, but with no special tuning, ADMM is in ballparkof recent specialized methods (for comparison, COVSEL takes 25+ min when 1is a400 400tridiagonal matrix)Examples33 OutlineDual decompositionMethod of multipliersAlternating Direction Method of multipliersCommon patternsExamplesConsensus and exchangeConclusionsConsensus and exchange34 Consensus optimization want to solve problem withNobjective termsminimize Ni=1fi(x) ,fiis the loss function forith block of training data ADMM form:minimize Ni=1fi(xi)subject toxi z= 0 xiarelocal variables zis theglobal variable xi z= 0areconsistencyorconsensusconstraints can add regularization using ag(z)termConsensus and exchange35 Consensus optimization via ADMM L (x, z, y) = Ni=1(fi(xi) +yTi(xi z) + ( /2)kxi zk22) ADMM:xk+1i:= argminxi(fi(xi) +ykTi(xi zk) + ( /2)kxi zkk22)zk+1:=1NN i=1(xk+1i+ (1/ )yki)yk+1i:=yki+ (xk+1i zk+1) with regularization, averaging inzupdate is followed byproxg, Consensus and exchange36 Consensus optimization via ADMM using Ni=1yki= 0, algorithm simplifies toxk+1i:= argminxi(fi(xi) +ykTi(xi xk) + ( /2)kxi xkk22)yk+1i.

9 =yki+ (xk+1i xk+1)wherexk= (1/N) Ni=1xki in each iteration gatherxkiand average to getxk scatter the averagexkto processors updateykilocally (in each processor, in parallel) updatexilocallyConsensus and exchange37 Statistical interpretation fiis negative log-likelihood for parameterxgivenith data block xk+1iis MAP estimate under priorN(xk+ (1/ )yki, I) prior mean is previous iteration s consensus shifted by price of processoridisagreeing with previous consensus processors only need to support a Gaussian MAP Method type or number of data in each block not relevant consensus protocol yields global maximum-likelihood estimateConsensus and exchange38 Consensus classification data (examples)(ai, bi),i= 1, .. , N,ai Rn,bi { 1,+1} linear classifiersign(aTw+v), with weightw, offsetv margin forith example isbi(aTiw+v); want margin to be positive loss forith example isl(bi(aTiw+v)) lis loss function (hinge, logistic, probit, exponential.)

10 Choosew,vto minimize1N Ni=1l(bi(aTiw+v)) +r(w) r(w)is regularization term ( 2, 1, .. ) split data and use ADMM consensus to solveConsensus and exchange39 Consensus SVM example hinge lossl(u) = (1 u)+with 2regularization baby problem withn= 2,N= 400to illustrate examples split into 20 groups, in worst possible way:each group contains only positive or negative examplesConsensus and exchange40 Iteration 1 3 2 10123 10 8 6 4 20246810 Consensus and exchange41 Iteration 5 3 2 10123 10 8 6 4 20246810 Consensus and exchange42 Iteration 40 3 2 10123 10 8 6 4 20246810 Consensus and exchange43 Distributed lasso example example withdenseA R400000 8000(roughly 30 GB of data) distributed solver written in C using MPI and GSL no optimization or tuned libraries (like ATLAS, MKL) split into 80 subsystems across 10 (8-core) machines on Amazon EC2 computation timesloading data30sfactorization5msubsequent ADMM 2slasso solve (about 15 ADMM iterations) 5 6mConsensus and exchange44 Exchange problemminimize Ni=1fi(xi)subject to Ni=1xi= 0 another canonical problem, like consensus in fact, it s the dual of consensus can interpret asNagents exchangingngoods to minimize a total cost (xi)j 0means agentireceives(xi)


Related search queries