Transcription of Feature Selection for High-Dimensional Data: A Fast ...
1 Feature Selection for High-Dimensional Data: A fast Correlation- based filter SolutionLei of Computer Science & Engineering, Arizona State University, Tempe, AZ 85287-5406, USAA bstractFeature Selection , as a preprocessing step tomachine learning, is effective in reducing di-mensionality, removing irrelevant data, in-creasing learning accuracy, and improvingresult comprehensibility. However, the re-cent increase of dimensionality of data posesa severe challenge to many existing featureselection methods with respect to efficiencyand effectiveness. In this work, we intro-duce a novel concept, predominant correla-tion, and propose a fast filter method whichcan identify relevant features as well as re-dundancy among relevant features withoutpairwise correlation analysis. The efficiencyand effectiveness of our method is demon-strated through extensive comparisons withother methods using real-world data of IntroductionFeature Selection is frequently used as a preprocessingstep to machine learning.
2 It is a process of choosinga subset of original features so that the Feature spaceis optimally reduced according to a certain evaluationcriterion. Feature Selection has been a fertile field ofresearch and development since 1970 s and proven tobe effective in removing irrelevant and redundant fea-tures, increasing efficiency in learning tasks, improv-ing learning performance like predictive accuracy, andenhancing comprehensibility of learned results (Blum& Langley, 1997; Dash & Liu, 1997; Kohavi & John,1997). In recent years, data has become increas-ingly larger in both number of instances and num-ber of features in many applications such as genomeprojects (Xing et al., 2001), text categorization (Yang& Pederson, 1997), image retrieval (Rui et al., 1999),and customer relationship management (Ng & Liu,2000). This enormity may cause serious problemsto many machine learning algorithms with respectto scalability and learning performance.
3 For exam-ple, high dimensional data ( , data sets with hun-dreds or thousands of features) can contain high de-gree of irrelevant and redundant information whichmay greatly degrade the performance of learning al-gorithms. Therefore, Feature Selection becomes verynecessary for machine learning tasks when facing highdimensional data nowadays. However, this trend ofenormity on both size and dimensionality also posessevere challenges to Feature Selection algorithms. Someof the recent research efforts in Feature Selection havebeen focused on these challenges from handling a hugenumber of instances (Liu et al., 2002b) to dealing withhigh dimensional data (Das, 2001; Xing et al., 2001).This work is concerned about Feature Selection for highdimensional data. In the following, we first reviewmodels of Feature Selection and explain why a filter so-lution is suitable for high dimensional data, and thenreview some recent efforts in Feature Selection for highdimensional Selection algorithms fall into two broad cat-egories, the filter model or the wrapper model (Das,2001; Kohavi & John, 1997).
4 The filter model relieson general characteristics of the training data to se-lect some features without involving any learning al-gorithm. The wrapper model requires one predeter-mined learning algorithm in Feature Selection and usesits performance to evaluate and determine which fea-tures are selected. As for each new subset of features,the wrapper model needs to learn a hypothesis (or aclassifier). It tends to find features better suited to thepredetermined learning algorithm resulting in superiorlearning performance, but it also tends to be morecomputationally expensive than the filter model (Lan-gley, 1994). When the number of features becomesvery large, the filter model is usually chosen due to itscomputational of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, combine the advantages of both models, algorithmsin a hybrid model have recently been proposed to dealwith high dimensional data (Das, 2001; Ng, 1998; Xinget al.)
5 , 2001). In these algorithms, first, a goodnessmeasure of Feature subsets based on data characteris-tics is used to choose best subsets for a given cardinal-ity, and then, cross validation is exploited to decide afinal best subset across different cardinalities. Thesealgorithms mainly focus on combining filter and wrap-per algorithms to achieve best possible performancewith a particular learning algorithm with similar timecomplexity of filter algorithms. In this work, we focuson the filter model and aim to develop a new featureselection algorithm which can effectively remove bothirrelevant and redundant features and is less costly incomputation than the currently available section 2, we review current algorithms within thefilter model and point out their problems in the contextof high dimensionality. In section 3, we describe corre-lation measures which form the base of our method inevaluating Feature relevance and redundancy. In sec-tion 4, we first propose our method which selects goodfeatures for classification based on a novel concept,predominant correlation, and then present a fastalgorithm with less than quadratic time complexity.
6 Insection 5, we evaluate the efficiency and effectivenessof this algorithm via extensive experiments on variousreal-world data sets comparing with other representa-tive Feature Selection algorithms, and discuss the im-plications of the findings. In section 6, we concludeour work with some possible Related WorkWithin the filter model, different Feature Selection al-gorithms can be further categorized into two groups,namely, Feature weighting algorithms and subsetsearch algorithms, based on whether they evaluate thegoodness of features individually or through featuresubsets. Below, we discuss the advantages and short-comings of representative algorithms in each weighting algorithms assign weights to fea-tures individually and rank them based on their rel-evance to the target concept. There are a number ofdifferent definitions on Feature relevance in machinelearning literature (Blum & Langley, 1997; Kohavi &John, 1997). A Feature is good and thus will be se-lected if its weight of relevance is greater than a thresh-old value.
7 A well known algorithm that relies on rele-vance evaluation is Relief (Kira & Rendell, 1992). Thekey idea of Relief is to estimate the relevance of fea-tures according to how well their values distinguish be-tween the instances of the same and different classesthat are near each other. Relief randomly samples anumber (m) of instances from the training set and up-dates the relevance estimation of each Feature basedon the difference between the selected instance andthe two nearest instances of the same and oppositeclasses. Time complexity of Relief for a data set withMinstances andNfeatures isO(mMN). Withmbe-ing a constant, the time complexity becomesO(MN),which makes it very scalable to data sets with both ahuge number of instances and a very high dimension-ality. However, Relief does not help with removingredundant features. As long as features are deemedrelevant to the class concept, they will all be selectedeven though many of them are highly correlated toeach other (Kira & Rendell, 1992).
8 Many other algo-rithms in this group have similar problems as Reliefdoes. They can only capture the relevance of featuresto the target concept, but cannot discover redundancyamong features. However, empirical evidence from fea-ture Selection literature shows that, along with irrele-vant features, redundant features also affect the speedand accuracy of learning algorithms and thus should beeliminated as well (Hall, 2000; Kohavi & John, 1997).Therefore, in the context of Feature Selection for highdimensional data where there may exist many redun-dant features, pure relevance- based Feature weightingalgorithms do not meet the need of Feature selectionvery search algorithms search through candidatefeature subsets guided by a certain evaluation mea-sure (Liu & Motoda, 1998) which captures the good-ness of each subset. An optimal (or near optimal) sub-set is selected when the search stops. Some existingevaluation measures that have been shown effective inremoving both irrelevant and redundant features in-clude the consistency measure (Dash et al.)
9 , 2000) andthe correlation measure (Hall, 1999; Hall, 2000). Con-sistency measure attempts to find a minimum num-ber of features that separate classes as consistently asthe full set of features can. An inconsistency is de-fined as two instances having the same Feature valuesbut different class labels. In Dash et al. (2000), dif-ferent search strategies, namely, exhaustive, heuristic,and random search, are combined with this evalua-tion measure to form different algorithms. The timecomplexity is exponential in terms of data dimension-ality for exhaustive search and quadratic for heuristicsearch. The complexity can be linear to the number ofiterations in a random search, but experiments showthat in order to find best Feature subset, the number ofiterations required is mostly at least quadratic to thenumber of features (Dash et al., 2000). In Hall (2000),a correlation measure is applied to evaluate the good-ness of Feature subsets based on the hypothesis that agood Feature subset is one that contains features highlycorrelated to the class, yet uncorrelated to each underlying algorithm, named CFS, also exploitsheuristic search.
10 Therefore, with quadratic or highertime complexity in terms of dimensionality, existingsubset search algorithms do not have strong scalabil-ity to deal with high dimensional overcome the problems of algorithms in both groupsand meet the demand for Feature Selection for highdimensional data, we develop a novel algorithm whichcan effectively identify both irrelevant and redundantfeatures with less time complexity than subset Correlation- based MeasuresIn this section, we discuss how to evaluate the good-ness of features for classification. In general, a fea-ture isgoodif it isrelevantto the class concept butis notredundantto any of the other relevant we adopt the correlation between two variables as agoodness measure, the above definition becomes thata Feature is good if it is highly correlated to the classbut not highly correlated to any of the other other words, if the correlation between a featureand the class is high enough to make it relevant to (orpredictive of) the class and the correlation between itand any other relevant features does not reach a levelso that it can be predicted by any of the other relevantfeatures, it will be regarded as a good Feature for theclassification task.