Transcription of Gradient Boosted Feature Selection - Alice Zheng
1 Gradient Boosted Feature SelectionZhixiang (Eddie) Xu Washington University in Brookings Louis, HuangTsinghua University30 Shuangqing , Q. WeinbergerWashington University in Brookings Louis, X. Zheng GraphLab936 N. 34th St. Ste 208 Seattle, Feature Selection algorithm should ideally satisfy four con-ditions: reliably extract relevant features; be able to iden-tify non-linear Feature interactions; scale linearly with thenumber of features and dimensions; allow the incorpora-tion of known sparsity structure. In this work we propose anovel Feature Selection algorithm, Gradient Boosted FeatureSelection (GBFS), which satisfies all four of these require-ments. The algorithm is flexible, scalable, and surprisinglystraight-forward to implement as it is based on a modifi-cation of Gradient Boosted Trees. We evaluate GBFS onseveral real world data sets and show that it matches or out-performs other state of the art Feature Selection it scales to larger data set sizes and naturally allows fordomain-specific side and Subject [Information Storage and Retrieval]: Miscellaneous; [Pattern Recognition]: Design Methodology Fea-ture evaluation and selectionGeneral TermsLearningKeywordsFeature Selection ; Large-scale; Gradient boosting Work done while at Microsoft ResearchPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page.
2 Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from 14,August 24 27, 2014, New York, NY, is held by the owner/author(s). Publication rights licensed to 978-1-4503-2956-9/14/08 ..$ .1. INTRODUCTIONF eature Selection (FS) [8] is an important problems in ma-chine learning. In many applications, , bio-informatics [21]or neuroscience [12], researchers hope to gain insight by ana-lyzinghowa classifier can predict a label and what featuresit uses. Moreover, effective Feature Selection leads to par-simonious classifiers that require less memory [25] and arefaster to train and test [5]. It can also reduce Feature extrac-tion costs [29, 30] and lead to better generalization [9].
3 Linear Feature Selection algorithms such as LARS [7] arehighly effective at discovering linear dependencies betweenfeatures and labels. However, they fail when features in-teract in nonlinear ways. Nonlinear Feature Selection algo-rithms, such as Random Forest [9] or recently introducedkernel methods [32, 23], can cope with nonlinear interac-tions. But their computational and memory complexity typ-ically grow super-linearly with the training set size. As datasets grow in size, this is increasingly problematic. Balancingthe twin goals ofscalabilityandnonlinearfeature selectionis still an open this paper, we focus on the scenario where data setscontain a large number of samples. Specifically, we aim toperform efficient Feature Selection when the number of datapoints is much larger than the number of features (n d).We start with the (NP-Hard) Feature Selection problem thatalso motivated LARS [7] and LASSO [26].
4 But instead ofusing a linear classifier and approximating the Feature selec-tion cost with anl1-norm, we follow [31] and use gradientboosted regression trees [7] for which greedy approximationsexist [2].The resulting algorithm is surprisingly simple yet very ef-fective. We refer to it as Gradient Boosted Feature Selection (GBFS). Following the Gradient boosting framework, treesare built with the greedy CART algorithm [2]. Features areselected sparsely following an important change in the im-purity function: splitting on new features is penalized bya cost >0, whereas re-use of previously selected featuresincurs no additional has several compelling properties. 1. As it learnsan ensemble of regression trees, it can naturally discovernonlinear interactions between features. 2. In contrast to, , FS with Random Forests, it unifies Feature selectionand classification into a single optimization.
5 3. In contrastto existing nonlinear FS algorithms, its time and memorycomplexity scales asO(dn), whereddenotes the number offeatures dimensionality andnthe number of data points1,and is very fast in practice. 4. GBFS can naturally incorpo-rate pre-specified Feature cost structures or side-information, , select bags of features or focus on regions of interest,similar to generalized lasso in linear FS [19].We evaluate this algorithm on several real-world data setsof varying difficulty and size, and we demonstrate that GBFS tends to match or outperform the accuracy and Feature se-lection trade-off of Random Forest Feature Selection , thecurrent state-of-the-art in nonlinear Feature showcase the ability of GBFS to naturally incorporateside-information about inter- Feature dependencies on a realworld biological classification task [1]. Here, features aregrouped into nine pre-specified bags with biological mean-ing.
6 GBFS can easily adapt to this setting and select entirefeature bags. The resulting classifier matches the best accu-racy of competing methods (trained on many features) withonlya singlebag of RELATED WORKOne of the most widely used Feature Selection algorithmsis Lasso [26]. It minimizes the squared loss withl1regu-larization on the coefficient vector, which encourages sparsesolutions. Although scalable to very large data sets, Lassomodels only linear correlations between features and labelsand cannot discover non-linear Feature dependencies.[17] propose the Minimum Redundancy Maximum Rel-evance (mRMR) algorithm, which selects a subset of themost responsive features that have high mutual informationwith labels. Their objective function also penalizes select-ing redundant features. Though elegant, computing mu-tual information when the number of instance is large isintractable, and thus the algorithm does not scale.
7 HSICL asso [32], on the other hand, introduces non-linearity bycombining multiple kernel functions that each uses a singlefeature. The resulting convex optimization problem alignsthis kernel with a perfect label kernel. The algorithm re-quires constructing kernel matrices for all features, thus itstime and memory complexity scale quadratically with inputdata set size. Moreover, both algorithms separate featureselection and classification, and require additional time andcomputation for training classifiers using the selected other works avoid expensive kernel computationwhile maintaining non-linearity. Grafting [18] combinesl1andl0regularization with a non-linear classifier based ona non-convex variant of the multi-layer perceptron. Fea-ture Selection for Ranking using Boosted Trees [15] selectsthe top features with the highest relative importance scores.
8 [27] and [9] use Random Forest. Finally, while not a fea-ture Selection method, [31] employ Gradient Boosted Treesto learn cascades of classifiers to reduce test-time cost byincorporating Feature extraction budgets into the BACKGROUNDT hroughout this paper we type vectors in bold (xi), scalarsin regular math type (korC), sets in cursive (S) and ma-1In fact, if the storage of the input data is not counted, thememory complexity of GBFS scales asO(n).trices in capital bold (F) font. Specific entries in vectors ormatrices are scalars and follow the corresponding data set consists of input vectors{x1,..,xn} Rdwith corresponding labels{y1,..,yn} Ydrawn from anunknown distribution. The labels can be binary, categor-ical (multi-class) or real-valued (regression). For the sakeof clarity, we focus on binary classificationY { 1,+1},although the algorithm can be extended to multi-class andregression as Feature Selection with thel1normLasso [26] combines linear classification andl1regulariza-tionminw (xi,yi)`(xi,yi,w) + |w|1.
9 (1)In its original formulation,`( ) is defined to be the squaredloss,`(xi,yi,w) = (w>xi yi)2. However, for the sake offeature Selection , other loss functions are possible. In the bi-nary classification setting, whereyi { 1,+1}, we use thebetter suited log-loss,`(xi,yi,w) = log(1+exp(yiw>xi)) [11]. The cappedl1norml1regularization serves two purposes: It regularizes theclassifier against overfitting, and it induces sparsity for fea-ture Selection . Unfortunately, these two effects of thel1-norm are inherently tied and there is no way to regulate theimpact of either one.[33] introduce thecappedl1norm, defined by the element-wise operationq (wi) = min(|wi|, ).(2)Its advantage over the standardl1norm is that once a fea-ture is extracted, its use is not penalized further , itpenalizes using many features does not reward small is a much better approximation of thel0norm, whichonly penalizes Feature use without interfering with the mag-nitude of the weights.
10 When is small enough, , mini|wi|, we can compute the exact number of features ex-tracted withq (w)/ . In other words, penalizingq (w) is aclose proxy for penalizing the number of extracted , the cappedl1norm is not convex and thereforenot easy to cappedl1norm can be combined with a regularl1(orl2) norm, where one can control the trade-off between featureextraction and regularization by adjusting the correspondingregularization parameters, , 0:minw (xi,yi)`(xi,yi,w) + |w|1+ q (w).(3)Hereq (w) denotes [q (w1),..,q (wd)].4. Gradient Boosted Feature SELEC-TIONThe classifier in Eq. (3) is better suited for Feature selec-tion than plainl1regularization. However, it is stilllinear,which limits the flexibility of the classifer. Standard ap-proaches for incorporating non-linearity include the kernellearning [22] and boosting [3]. HSIC Lasso [32] uses kernellearning to discover non-linear Feature interactions at a priceof quadratic memory and time complexity.