Regression shrinkage and selection via the lasso: a ...

2011 Royal Statistical Society1369 7412/11/73273J. R. Statist. (2011)73,Part3, 282 Regression shrinkage and selection via the lasso: a retrospectiveRobert TibshiraniStanford University, USA[Presented toThe Royal Statistical Societyat its annual conference in a session organized bytheResearch Sectionon Wednesday, September 15th, 2010,Professor D. M. Titteringtoninthe Chair] the paper I give a brief review of the basic idea and some history and then dis-cuss some developments since the original paper on Regression shrinkage and selection viathe :l1-penalty; Penalization; Regularization1. The lassoGiven a linear Regression with standardized predictorsxijand centred response valuesyifori=1,2,..,Nandj=1,2,..,p, the lasso solves thel1-penalized Regression problem of finding ={ j}to minimizeN i=1(yi jxij j)2+ p j=1| j|:This is equivalent to minimizing the sum of squares with a constraint of the form | j| is similar toridge Regression , which has constraint j 2j t.

Because of the form of thel1-penalty, the lasso does variable selection and shrinkage , whereas ridge Regression , in con-trast, only shrinks. If we consider a more general penalty of the form. pj=1 qj/1=q, then thelasso usesq=1 and ridge Regression hasq=2. Subset selection emerges asq 0, and thelasso uses the smallest value ofq( closest to subset selection ) that yields a convex is very attractive for computational History of the ideaThe lasso is just Regression with anl1-norm penalty, andl1-norms have been around for a longtime! My most direct influence was Leo Breiman snon-negative garrotte(Breiman, 1995). Hisidea was to minimize, with respect toc={cj},N i=1(yi jcjxij j)2subject tocj 0,p j=1cj t,where jare usual least squares estimates.

This is undefined whenp>N(which was not a hotAddress for correspondence: Robert tibshirani , Department of Health Research and Policy, and Departmentof Statistics, Stanford University, Stanford, CA 94305, : Tibshiranitopic in 1995!) so I just combined the two stages into one (as a Canadian I also wanted a gen-tler name). In other related work around the same time, Frank and Friedman (1993) discussedbridge regressionusing a penalty | j| , with both and estimated from the data , and Chenet al.(1998) proposedbasis pursuit, which uses anl1-penalty in a signal processing there are many other references that I am unaware of. After publication, the paper didnot receive much attention until years later. Why?: my guesses are that(a) the computation in 1996 was slow compared with today,(b) the algorithms for the lasso were black boxes and not statistically motivated (until theLARS algorithm in 2002),(c) thestatisticalandnumericaladvantages of sparsity were not immediately appreciated (byme or the community),(d) large data problems (inN,por both) were rare and(e) the community did not have the R language for fast, easy sharing of new software Computational advancesThe original lasso paper used an off-the-shelf quadratic program solver.

This does not scalewell and is not transparent. The LARS algorithm (Efronet al., 2002) gives an efficient way ofsolving the lasso and connects the lasso to forward stagewise Regression . The same algorithm iscontained in the homotopy approach of Osborneet al.(2000).Co-ordinate descentalgorithmsare extremely simple and fast, and exploit the assumed sparsity of the model to great include Fu (1998), Friedmanet al.(2007, 2010), Wu and Lange (2008) and Genkinet al.(2007). We were made aware of its real potential in the doctoral thesis of Anita van derKooij (Leiden) working with Jacqueline Meulman. TheglmnetR language package (Friedmanet al., 2010) implements the co-ordinate descent method for many popular Some generalizations and variants of the lassoThere has been much work in recent years, applying and generalizing the lasso andl1-like penal-ties to a variety of problems.

Table 1 gives a partial list. There has also been much deep and inter-esting work on the mathematical aspects of the lasso, examining its ability to produce a modelwith minimal prediction error, and also to recover the true underlying (sparse) model. Impor-tant contributors here include Bickel, B hlmann, Candes, Donoho, Johnstone, Meinshausen,van de Geer, Wainwright and Yu. I do not have the qualifications or the space to summarizethis work properly, but I hope that Professor B hlmann will cover this aspect in his methods can also shed light on more traditional techniques. The LARS algorithm,which was mentioned above, brings new understanding to forward stepwise selection meth-ods. Another example is the graphical lasso for fitting a sparse Gaussian graph, based on theGaussian log-likelihood plus 1 1, which is anl1-penalty applied to the inverse covariancematrix.

Since a missing edge in the graph corresponds to a zero element of 1, this gives apowerful method forgraph selection determining which edges to include. As a bonus, a specialcase of the graphical lasso gives a new simple method for fitting a graph withprespecifiededges(corresponding to structural 0s in 1). The details are given in chapter 17 of Hastieet al.(2008).Another recent example isnearly isotonic Regression (Fig. 1) (Tibshiraniet al., 2010). Given adata sequencey1,y2,..,yNisotonic Regression solves the problem of finding y1, y2,.., yNtominimizeRegression shrinkage and selection via the Lasso275 Table sampling of generalizations of the lassoMethodReferenceDetailGrouped lassoYuan and Lin (2007a) g g 2 Elastic netZou and Hastie (2005) 1 | j|+ 2 2jFused lassoTibshiraniet al.

(2005) | j+1 j|Adaptive lassoZou (2006) 1 wj| j|Graphical lassoYuan and Lin (2007b); Friedmanet al.(2007)loglik+ | 1|1 Dantzig selectorCandes and Tao (2007)min{ X / } 1<tNear isotonic regularization Tibshiraniet al.(2010) . j j+1/+Matrix completionCand`es and Tao (2009); Mazumderet al.(2010) X X|2+ X Compressive sensingDonoho (2004); Candes (2006)min(| |1) subject toy=X Multivariate methodsJolliffeet al.(2003); Wittenet al.(2009)Sparse principal componentsanalysis, linear discriminantanalysis and canonicalcorrelation analysis(a)(c)(b)(d)Fig. of nearly isotonic fits for a toy example: (a) interpolating function ( D0) and three joiningevents ("), (b) ( D0:25), (c) ( D0:7) and (d) ( D0:77), with the usual isotonic Regression .

Yi yi/2subject to y1 :This assumes a monotone non-decreasing approximation, with an analogous definition for themonotone non-increasing case. The solution can be computed via the well-known pool adjacentviolators algorithm ( Barlowet al.(1972)). In nearly isotonic Regression we minimize, withrespect to ,12N i= i/2+ n 1 i=1. i i+1/+,276R. Tibshiraniwithx+indicating the positive part,x+=x >0/. This is a convex problem, with i=yiat =0 and culminating in the usual isotonic Regression as . Along the way it i i+1/+is half of anl1-penalty on differences, penalizing dipsbut not increases in the sequence. This procedure allows us to assess the assumption of monoto-nicity by comparing nearly monotone approximations with the best monotone al.

(2011) have provided a simple algorithm that computes the entire path of solu-tions, which is a kind of modified version of the pooled adjacent violators procedure. Theyalso showed that the number of degrees of freedom is the number of unique values of yiin thesolution, using results from tibshirani and Taylor (2011).5. DiscussionLasso (l1-)penalties are useful for fitting a wide variety of models. Newly developed compu-tational algorithms allow application of these models to large data sets, exploiting sparsityforboth statistical and computation gains. Interesting work on the lasso is being carried out inmany fields, including statistics, engineering, mathematics and computer science. I concludewith a challenge for statisticians.

This is an enjoyable area to work in, but we should not inventnew models and algorithms just for the sake of it. We should focus on developing tools andunderstanding their properties, to help us and our collaborators to solve important work discussed here represents collaborations with many people, especially Bradley Efron,Jerome Friedman, Trevor Hastie, Holger Hoefling, Iain Johnstone, ryan tibshirani andDaniela thank the Research Section of the Royal Statistical Society for inviting me to present thisretrospective , R. E., Bartholomew, D., Bremner, J. M. and Brunk, H. D. (1972)Statistical Inference under OrderRestrictions; the Theory and Applications of Isotonic Regression . New York: , L.

Regression shrinkage and selection via the lasso: a ...

Tags:

Information

Transcription of Regression shrinkage and selection via the lasso: a ...

Related search queries

Regression shrinkage and selection via the lasso: a ...

Tags:

Information

Documents from same domain

Related documents

Related search queries