ConvexOptimization:Algorithmsand Complexity

Foundations and TrendsR in Machine LearningVol. 8, No. 3-4 (2015) 231 357c 2015 S. BubeckDOI: optimization : Algorithms andComplexityS bastien BubeckTheory Group, Microsoft Some convex optimization problems in machine learning . Basic properties of convexity .. Why convexity? .. Black-box model .. Structured optimization .. Overview of the results and disclaimer ..2402 Convex optimization in finite The center of gravity method .. The ellipsoid method .. Vaidya s cutting plane method .. Conjugate gradient ..2583 Dimension-free convex Projected subgradient descent for Lipschitz functions .. Gradient descent for smooth functions .. Conditional gradient descent, aka Frank-Wolfe .. Strong convexity .. Lower bounds .. Geometric descent .. Nesterov s accelerated gradient descent ..2894 Almost dimension-free convex optimization in Mirror maps .. Mirror descent .. Standard setups for mirror descent .. Lazy mirror descent, aka Nesterov s dual averaging.

Mirror prox .. The vector field point of view on MD, DA, and MP ..3075 Beyond the black-box Sum of a smooth and a simple non-smooth term .. Smooth saddle-point representation of a non-smooth Interior point methods ..3186 Convex optimization and Non-smooth stochastic optimization .. Smooth stochastic optimization and mini-batch SGD .. Sum of smooth and strongly convex functions .. Random coordinate descent .. Acceleration by randomization for saddle points .. Convex relaxation and randomized rounding .. Random walk based methods ..347 AcknowledgementsReferences350 351 AbstractThis monograph presents the main Complexity theorems in convex op-timization and their corresponding algorithms. Starting from the fun-damental theory of black-box optimization , the material progresses to-wards recent advances in structural optimization and stochastic op-timization. Our presentation of black-box optimization , strongly in-fluenced by Nesterov s seminal book and Nemirovski s lecture notes,includes the analysis of cutting plane methods, as well as (acceler-ated) gradient descent schemes.

We also pay special attention to non-Euclidean settings (relevant algorithms include Frank-Wolfe, mirrordescent, and dual averaging) and discuss their relevance in machinelearning. We provide a gentle introduction to structural optimizationwith FISTA (to optimize a sum of a smooth and a simple non-smoothterm), saddle-point mirror prox (Nemirovski s alternative to Nesterov ssmoothing), and a concise description of interior point methods. Instochastic optimization we discuss stochastic gradient descent, mini-batches, random coordinate descent, and sublinear algorithms. We alsobriefly touch upon convex relaxation of combinatorial problems and theuse of randomness to round solutions, as well as random walks optimization : Algorithms and Complexity . Foundations andTrendsR in Machine Learning, vol. 8, no. 3-4, pp. 231 357, : central objects of our study are convex functions and convex (Convex sets and convex functions).A setX Rnissaid to be convex if it contains all of its segments, that is (x,y, ) X X [0,1],(1 )x+ y functionf:X Ris said to be convex if it always lies below itschords, that is (x,y, ) X X [0,1], f((1 )x+ y) (1 )f(x) + f(y).

We are interested in algorithms that take as input a convex setXand a convex functionfand output an approximate minimum offoverX. We write compactly the problem of finding the minimum offoverXasmin. f(x) the following we will make more precise how the set of constraintsXand the objective functionfare specified to the algorithm . Before Some convex optimization problems in machine learning233we proceed to give a few important examples of convex optimizationproblems in machine Some convex optimization problems in machine learningMany fundamental convex optimization problems in machine learningtake the following Rnm i=1fi(x) + R(x),( )where the functionsf1,..,fm,Rare convex and 0is a fixedparameter. The interpretation is thatfi(x)represents the cost of usingxon theithelement of some data set, andR(x)is a regularization termwhich enforces some simplicity inx. We discuss now major instancesof ( ). In all cases one has a data set of the form(wi,yi) Rn Y,i=1.

,mand the cost functionfidepends only on the pair(wi,yi). Werefer to Hastie et al. [2001], Sch lkopf and Smola [2002], Shalev-Shwartzand Ben-David [2014] for more details on the origin of these importantproblems. The mere objective of this section is to expose the reader to afew concrete convex optimization problems which are routinely classification one hasY={ 1,1}. Takingfi(x) = max(0,1 yix>wi)(the so-called hinge loss) andR(x) = x 22one obtains theSVM problem. On the other hand takingfi(x) = log(1+exp( yix>wi))(the logistic loss) and againR(x) = x 22one obtains the (regularized)logistic regression regression one hasY=R. Takingfi(x) = (x>wi yi)2andR(x) = 0one obtains the vanilla least-squares problem which can berewritten in vector notation Rn Wx Y 22,whereW Rm nis the matrix withw>ion theithrow andY=(y1,..,yn)>. WithR(x) = x 22one obtains the ridge regression prob-lem, while withR(x) = x 1this is the LASSO problem Tibshirani[1996].Our last two examples are of a slightly different flavor.

In particularthe design variablexis now best viewed as a matrix, and thus we234 Introductiondenote it by a capital letterX. The sparse inverse covariance estimationproblem can be written as follows, given some empirical covariancematrixY, (XY) logdet(X) + X Rn n,X>=X,X the above problem is simply a regularized maximum likeli-hood estimator (under a Gaussian assumption).Finally we introduce the convex version of the matrix completionproblem. Here our data set consists of observations of some of theentries of an unknown matrixY, and we want to complete" the unob-served entries ofYin such a way that the resulting matrix is simple"(in the sense that it has low rank). After some massaging (see Can-d s and Recht [2009]) the (convex) matrix completion problem can beformulated as (X) Rn n,X>=X,X 0,Xi,j=Yi,jfor(i,j) ,where [n]2and(Yi,j)(i,j) are Basic properties of convexityA basic result about convex sets that we shall use extensively is theSeparation (Separation Theorem).

LetX Rnbe a closed convexset, andx0 Rn\X. Then, there existsw Rnandt Rsuch thatw>x0< t,and x X,w>x that ifXis not closed then one can only guarantee thatw>x0 w>x, x X(andw6= 0). This immediately implies the Sup-porting Hyperplane Theorem ( Xdenotes the boundary ofX, that isthe closure without the interior):Theorem (Supporting Hyperplane Theorem).LetX Rnbe a con-vex set, andx0 X. Then, there existsw Rn,w6= 0such that x X,w>x w> Basic properties of convexity235We introduce now the key notion (Subgradients).LetX Rn, andf:X R. Theng Rnis a subgradient offatx Xif for anyy Xone hasf(x) f(y) g>(x y).The set of subgradients offatxis denoted f(x).To put it differently, for anyx Xandg f(x),fis above thelinear functiony7 f(x)+g>(y x). The next result shows (essentially)that a convex functions always admit (Existence of subgradients).LetX Rnbe convex,andf:X R. If x X, f(x)6= thenfis convex. Converselyiffis convex then for anyx int(X), f(x)6= . Furthermore iffisconvex and differentiable atxthen f(x) f(x).

Before going to the proof we recall the definition of the epigraph ofa functionf:X R:epi(f) ={(x,t) X R:t f(x)}.It is obvious that a function is convex if and only if its epigraph is aconvex first claim is almost trivial: letg f((1 )x+ y), thenby definition one hasf((1 )x+ y) f(x) + g>(y x),f((1 )x+ y) f(y) + (1 )g>(x y),which clearly shows thatfis convex by adding the two (appropriatelyrescaled) let us prove that a convex functionfhas subgradients in theinterior ofX. We build a subgradient by using a supporting hyperplaneto the epigraph of the function. Letx X. Then clearly(x,f(x)) epi(f), andepi(f)is a convex set. Thus by using the SupportingHyperplane Theorem, there exists(a,b) Rn Rsuch thata>x+bf(x) a>y+bt, (y,t) epi(f).( )236 IntroductionClearly, by lettingttend to infinity, one can see thatb 0. Now letus assume thatxis in the interior ofX. Then for >0small enough,y=x+ a X, which implies thatbcannot be equal to0(recall that ifb= 0then necessarilya6= 0which allows to conclude by contradiction).

Thus rewriting ( ) fort=f(y)one obtainsf(x) f(y) 1|b|a>(x y).Thusa/|b| f(x)which concludes the proof of the second letfbe a convex and differentiable function. Then by defi-nition:f(y) f((1 )x+ y) (1 )f(x) =f(x) +f(x+ (y x)) f(x) 0f(x) + f(x)>(y x),which shows that f(x) f(x).In several cases of interest the set of contraints can have an emptyinterior, in which case the above proposition does not yield any informa-tion. However it is easy to replaceint(X)byri(X)-the relative interiorofX- which is defined as the interior ofXwhen we view it as subset ofthe affine subspace it generates. Other notions of convex analysis willprove to be useful in some parts of this text. In particular the notionofclosed convex functionsis convenient to exclude pathological cases:these are the convex functions with closed epigraphs. Sometimes it isalso useful to consider the extension of a convex functionf:X Rtoa function fromRntoRby settingf(x) = + forx6 X. In convexanalysis one uses the termproper convex functionto denote a convexfunction with values inR {+ }such that there existsx Rnwithf(x)<+.

From now on all convex functions will be closed,and if necessary we consider also their proper the reader to Rockafellar [1970] for an extensive discussion of Why convexity? Why convexity?The key to the algorithmic success in minimizing convex functions isthat these functions exhibit alocal to globalphenomenon. We havealready seen one instance of this in Proposition , where we showedthat f(x) f(x): the gradient f(x)contains a priori only localinformation about the functionfaroundxwhile the subdifferential f(x)gives a global information in the form of a linear lower bound onthe entire function. Another instance of this local to global phenomenonis that local minima of convex functions are in fact global minima:Proposition (Local minima are global minima).Letfbe convex. Ifxis a local minimum offthenxis a global minimum off. Furthermorethis happens if and only if0 f(x). f(x)if and only ifxis a global minimum assume thatxis local minimum off. Then for small enoughone has for anyy,f(x) f((1 )x+ y) (1 )f(x) + f(y),which impliesf(x) f(y)and thusxis a global minimum nice behavior of convex functions will allow for very fast algo-rithms to optimize them.

ConvexOptimization:Algorithmsand Complexity

Tags:

Information

Advertisement

Transcription of ConvexOptimization:Algorithmsand Complexity

Related search queries

ConvexOptimization:Algorithmsand Complexity

Tags:

Information

Advertisement

Related documents

Related search queries