Example: barber

The Adaptive Lasso and Its Oracle Properties

The Adaptive Lasso and Its Oracle PropertiesHui ZOUThe Lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistentunder certain conditions. In this work we derive a necessary condition for the Lasso variable selection to be consistent. Consequently, thereexist certain scenarios where the Lasso is inconsistent for variable selection. We then propose a new version of the Lasso , called the adaptivelasso, where Adaptive weights are used for penalizing different coefficients in the 1penalty. We show that the Adaptive Lasso enjoys theoracle Properties ; namely, it performs as well as if the true underlying model were given in advance. Similar to the Lasso , the Adaptive lassois shown to be near-minimax optimal. Furthermore, the Adaptive Lasso can be solved by the same efficient algorithm for solving the also discuss the extension of the Adaptive Lasso in generalized linear models and show that the Oracle Properties still hold under mildregularity conditions.

of Minnesota, Minneapolis, MN 55455 (E-mail: hzou@stat.umn.edu ). The au-thor thanks an associate editor and three referees for their helpful comments and suggestions. Sincere thanks also go to a co-editor for his encouragement. high variability and in addition is often trapped into a local op-timal solution rather than the global optimal solution.

Tags:

  Oracle, Adaptive, Properties, Stats, Sasol, Adaptive lasso and its oracle properties

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of The Adaptive Lasso and Its Oracle Properties

1 The Adaptive Lasso and Its Oracle PropertiesHui ZOUThe Lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistentunder certain conditions. In this work we derive a necessary condition for the Lasso variable selection to be consistent. Consequently, thereexist certain scenarios where the Lasso is inconsistent for variable selection. We then propose a new version of the Lasso , called the adaptivelasso, where Adaptive weights are used for penalizing different coefficients in the 1penalty. We show that the Adaptive Lasso enjoys theoracle Properties ; namely, it performs as well as if the true underlying model were given in advance. Similar to the Lasso , the Adaptive lassois shown to be near-minimax optimal. Furthermore, the Adaptive Lasso can be solved by the same efficient algorithm for solving the also discuss the extension of the Adaptive Lasso in generalized linear models and show that the Oracle Properties still hold under mildregularity conditions.

2 As a byproduct of our theory, the nonnegative garotte is shown to be consistent for variable WORDS: Asymptotic normality; Lasso ; Minimax; Oracle inequality; Oracle procedure; Variable INTRODUCTIONT here are two fundamental goals in statistical learning: en-suring high prediction accuracy and discovering relevant pre-dictive variables. Variable selection is particularly importantwhen the true underlying model has a sparse significant predictors will enhance the predictionperformance of the fitted model. Fan and Li (2006) gave a com-prehensive overview of feature selection and proposed a uni-fied penalized likelihood framework to approach the problemof variable us consider model estimation and variable selection inlinear regression models. Suppose thaty=(y1,..,yn)Tisthe response vector andxj=(x1j,..,xnj)T,j=1,..,p,arethe linearly independent predictors. LetX=[x1,..,xp]bethe predictor matrix. We assume thatE[y|x]= 1x1+ + pxp.

3 Without loss of generality, we assume that the data arecentered, so the intercept is not included in the regression func-tion. LetA={j: j =0}and further assume that|A|=p0< the true model depends only on a subset of the by ( )the coefficient estimator produced by a fittingprocedure . Using the language of Fan and Li (2001), we call anoracleprocedure if ( )(asymptotically) has the followingoracle Properties : Identifies the right subset model,{j: j =0}=A Has the optimal estimation rate, n( ( )A A) dN(0, ), where is the covariance matrix knowing thetrue subset has been argued (Fan and Li 2001; Fan and Peng 2004) thata good procedure should have these Oracle Properties . How-ever, some extra conditions besides the Oracle Properties , suchas continuous shrinkage, are also required in an optimal least squares (OLS) gives nonzero estimates to allcoefficients. Traditionally, statisticians use best-subset selectionto select significant variables, but this procedure has two fun-damental limitations.

4 First, when the number of predictors islarge, it is computationally infeasible to do subset , subset selection is extremely variable because of its in-herent discreteness (Breiman 1995; Fan and Li 2001). Stepwiseselection is often used as a computational surrogate to subsetselection; nevertheless, stepwise selection still suffers from theHui Zou is Assistant Professor of Statistics, School of Statistics, Universityof Minnesota, Minneapolis, MN 55455 The au-thor thanks an associate editor and three referees for their helpful commentsand suggestions. Sincere thanks also go to a co-editor for his variability and in addition is often trapped into a local op-timal solution rather than the global optimal solution. Further-more, these selection procedures ignore the stochastic errors oruncertainty in the variable selection stage (Fan and Li 2001;Shen and Ye 2002).The Lasso is a regularization technique for simultaneous esti-mation and variable selection (Tibshirani 1996).

5 The Lasso esti-mates are defined as ( Lasso )=arg min y p j=1xj j 2+ p j=1| j|,(1)where is a nonnegative regularization parameter. The sec-ond term in (1) is the so-called 1penalty, which is crucialfor the success of the Lasso . The 1penalization approach isalso calledbasis pursuitin signal processing (Chen, Donoho,and Saunders 2001). The Lasso continuously shrinks the co-efficients toward 0 as increases, and some coefficients areshrunk to exact 0 if is sufficiently large. Moreover, contin-uous shrinkage often improves the prediction accuracy due tothe bias variance trade-off. The Lasso is supported by much the-oretical work. Donoho, Johnstone, Kerkyacharian, and Picard(1995) proved the near-minimax optimality of soft threshold-ing (the Lasso shrinkage with orthogonal predictors). It alsohas been shown that the 1approach is able to discover the right sparse representation of the model under certain condi-tions (Donoho and Huo 2002; Donoho and Elad 2002; Donoho2004).

6 Meinshausen and B hlmann (2004) showed that vari-able selection with the Lasso can be consistent if the underlyingmodel satisfies some seems safe to conclude that the Lasso is an Oracle pro-cedure for simultaneously achieving consistent variable selec-tion and optimal estimation (prediction). However, there arealso solid arguments against the Lasso Oracle statement. Fanand Li (2001) studied a class of penalization methods includ-ing the Lasso . They showed that the Lasso can perform auto-matic variable selection because the 1penalty is singular atthe origin. On the other hand, the Lasso shrinkage producesbiased estimates for the large coefficients, and thus it couldbe suboptimal in terms of estimation risk. Fan and Li conjec-tured that the Oracle Properties do not hold for the Lasso . Theyalso proposed a smoothly clipped absolute deviation (SCAD)penalty for variable selection and proved its Oracle Properties .

7 2006 American Statistical AssociationJournal of the American Statistical AssociationDecember 2006, Vol. 101, No. 476, Theory and MethodsDOI : The Adaptive Lasso1419 Meinshausen and B hlmann (2004) also showed the conflictof optimal prediction and consistent variable selection in thelasso. They proved that the optimal for prediction gives in-consistent variable selection results; in fact, many noise featuresare included in the predictive model. This conflict can be easilyunderstood by considering an orthogonal design model (Leng,Lin, and Wahba 2004).Whether the Lasso is an Oracle procedure is an importantquestion demanding a definite answer, because the Lasso hasbeen used widely in practice. In this article we attempt to pro-vide an answer. In particular, we are interested in whether the 1penalty could produce an Oracle procedure and, if so, consider the asymptotic setup where in (1) varies withn(the sample size).

8 We first show that the underlying model mustsatisfy a nontrivial condition if the Lasso variable selection isconsistent. Consequently, there are scenarios in which the lassoselection cannot be consistent. To fix this problem, we then pro-pose a new version of the Lasso , the Adaptive Lasso , in whichadaptive weights are used for penalizing different coefficientsin the 1penalty. We show that the Adaptive Lasso enjoys theoracle Properties . We also prove the near-minimax optimalityof the Adaptive Lasso shrinkage using the language of Donohoand Johnstone (1994). The Adaptive Lasso is essentially a con-vex optimization problem with an 1constraint. Therefore, theadaptive Lasso can be solved by the same efficient algorithmfor solving the Lasso . Our results show that the 1penalty is atleast as competitive as other concave Oracle penalties and alsois computationally more attractive. We consider this article toprovide positive evidence supporting the use of the 1penaltyin statistical learning and nonnegative garotte (Breiman 1995) is another popu-lar variable selection method.

9 We establish a close relation be-tween the nonnegative garotte and a special case of the adaptivelasso, which we use to prove consistency of the nonnegativegarotte rest of the article is organized as follows. In Section 2 wederive the necessary condition for the consistency of the lassovariable selection. We give concrete examples to show when thelasso fails to be consistent in variable selection. We define theadaptive Lasso in Section 3, and then prove its statistical prop-erties. We also show that the nonnegative garotte is consistentfor variable selection. We apply the LARS algorithm (Efron,Hastie, Johnstone, and Tibshirani 2004) to solve the entire so-lution path of the Adaptive Lasso . We use a simulation study tocompare the Adaptive Lasso with several popular sparse model-ing techniques. We discuss some applications of the adaptivelasso in generalized liner models in Section 4, and give con-cluding remarks in Section 5. We relegate technical proofs tothe THE Lasso VARIABLE SELECTION COULDBE INCONSISTENTWe adopt the setup of Knight and Fu (2000) for the asymp-totic analysis.

10 We assume two conditions:(a)yi=xi + i, where 1,.., nare independent identi-cally distributed (iid) random variables with mean 0 and vari-ance 2(b)1nXTX C, whereCis a positive definite loss of generality, assume thatA={1,2,..,p0}.LetC= C11C12C21C22 ,whereC11is ap0 consider the Lasso estimates, (n), (n)=arg min y p j=1xj j 2+ np j=1| j|,(2)where nvaries {j: (n)j =0}. The Lasso vari-able selection is consistent if and only if limnP(An=A)= n/n 0 0, then (n) parg minV1,whereV1(u)=(u )TC(u )+ 0p j=1|uj|.Lemma n/ n 0 0, then n( (n) ) darg min(V2), whereV2(u)= 2uTW+uTCu+ 0p j=1 ujsgn( j)I( j =0)+|uj|I( j=0) ,andWhas a N(0, 2C) two lemmas are quoted from Knight and Fu (2000).From an estimation standpoint, Lemma 2 is more interesting,because it shows that the Lasso estimate is Lemma 1, only 0=0 guarantees estimation , when considering the asymptotic behavior of variableselection, Lemma 2 actually implies that when n=O( n),Anbasically cannot beAwith a positive n/ n 0 0, then lim supnP(An=A) c<1, wherecis a constant depending on the true on Proposition 1, it seems interesting to study the as-ymptotic behavior of (n)when 0= , which amounts toconsidering the case where n/n 0 and n/ n.


Related search queries