Example: barber

L2,1-Norm Regularized Discriminative Feature Selection …

2,1- norm Regularized Discriminative FeatureSelection for Unsupervised LearningYi Yang1, Heng Tao Shen1, Zhigang Ma2, Zi Huang1, Xiaofang Zhou11 School of Information Technology & ElectricalEngineering, The University of of Information Engineering & Computer Science, University of with supervised learning forfeature Selection , it is much more difficultto select the Discriminative features in un-supervised learning due to the lack oflabel information. Traditional unsuper-vised Feature Selection algorithms usuallyselect the features which best preservethe data distribution, , manifold struc-ture, of the whole Feature set. Under theassumption that the class label of inputdata can be predicted by a linear classi-fier, we incorporate Discriminative anal-ysis and 2,1- norm minimization into ajoint framework for unsupervised featureselection. Different from existing unsu-pervised Feature Selection algorithms, ouralgorithm selects the most discriminativefeature subset from the whole Feature setin batch mode.

rithms, e.g., Fisher score [Duda et al., 2001] , robust regres-sion [Nie et al., 2010], sparse multi-output regression [Zhao et al., 2010] and trace ratio [Nie et al., 2008], usually select featuresaccordingto labels of the training data. Because dis-criminative informationis enclosed in labels, supervised fea-

Tags:

  Feature, Selection, Norm, Discriminative, Dudas, Norm regularized discriminative feature selection, Regularized

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of L2,1-Norm Regularized Discriminative Feature Selection …

1 2,1- norm Regularized Discriminative FeatureSelection for Unsupervised LearningYi Yang1, Heng Tao Shen1, Zhigang Ma2, Zi Huang1, Xiaofang Zhou11 School of Information Technology & ElectricalEngineering, The University of of Information Engineering & Computer Science, University of with supervised learning forfeature Selection , it is much more difficultto select the Discriminative features in un-supervised learning due to the lack oflabel information. Traditional unsuper-vised Feature Selection algorithms usuallyselect the features which best preservethe data distribution, , manifold struc-ture, of the whole Feature set. Under theassumption that the class label of inputdata can be predicted by a linear classi-fier, we incorporate Discriminative anal-ysis and 2,1- norm minimization into ajoint framework for unsupervised featureselection. Different from existing unsu-pervised Feature Selection algorithms, ouralgorithm selects the most discriminativefeature subset from the whole Feature setin batch mode.

2 Extensive experiment ondifferent data types demonstrates the ef-fectiveness of our many areas, such as computer vision, pattern recognitionand biological study, data are represented by high dimen-sional Feature vectors. Feature Selection aims to select a sub-set of features from the high dimensional Feature set for acompact and accurate data representation. It has twofold rolein improving the performance for data analysis. First, thedimension of selected Feature subset is much lower, makingthe subsequential computation on the input data more effi-cient. Second, the noisy features are eliminated for a betterdata representation, resulting in a more accurate clusteringand classification result. During recent years, Feature selec-tion has attracted much research attention. Several new fea-ture Selection algorithms have been proposed with a varietyof Selection algorithms can be roughly classified intotwo groups, , supervised Feature Selection and unsuper-vised Feature Selection .

3 Supervised Feature Selection algo-rithms, , Fisher score[Dudaet al., 2001], robust regres-sion[Nieet al., 2010], sparse multi-output regression[Zhaoet al., 2010]and trace ratio[Nieet al., 2008], usually selectfeatures according to labels of the training data. Because dis-criminative information is enclosed in labels, supervised fea-ture Selection is usually able to select Discriminative unsupervised scenarios, however, there is no label informa-tion directly available, making it much more difficult to selectthe Discriminative features. A frequently used criterion in un-supervised learning is to select the features which best pre-serve the data similarity or manifold structure derived fromthe whole Feature set[Heet al., 2005; Zhao and Liu, 2007;Caiet al., 2010]. However, Discriminative information is ne-glected though it has been demonstrated important in dataanalysis[Fukunaga, 1990].Most of the traditional supervised and unsupervised featureselection algorithms evaluate the importance of each featureindividually[Dudaet al.]

4 , 2001; Heet al., 2005; Zhao and Liu,2007]and select features one by one. A limitation is that thecorrelation among features is neglected[Zhaoet al., 2010;Caiet al., 2010]. More recently, researchers have appliedthe two-step approach, , spectral regression, to super-vised and unsupervised Feature Selection [Zhaoet al., 2010;Caiet al., 2010]. These efforts have shown that it is abetter way to evaluate the importance of the selected fea-tures jointly. In this paper, we propose a new unsuper-vised Feature Selection algorithm by simultaneously exploit-ing Discriminative information and Feature correlations. Be-cause we utilize local Discriminative information, the mani-fold structure is considered too. While[Zhaoet al., 2010;Caiet al., 2010]also select features in batch mode, our al-gorithm is a one-step approach and it is able to select thediscriminative features for unsupervised learning. We alsopropose an efficient algorithm to optimize the Objective FunctionIn this section, we give the objective function of the proposedUnsupervised DiscriminativeFeature Selection (UDFS) algo-rithm.

5 Later in the next section, we propose an efficient algo-rithm to optimize the objective function. It is worth mention-ing that UDFS aims to select the most Discriminative featuresfor data representation, where manifold structure is consid-ered, making it different from the existing unsupervised fea-ture Selection {x1,x2, .., xn}as the training set, wherexi Rd(1 i n)is thei-th datum andnis the total1589 Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligencenumber of training data. In this paper,Iis identity a constantm,1m Rmis a column vector with all of itselements being 1 andHm=I 1m1m1Tm Rm matrixA Rr p, its 2,1- norm is defined as A 2,1=r i=1 pj=1A2ij.(1)Suppose thentraining datax1,x2, .., xnare sampled fromcclasses and there arenisamples in thei-th class. We defineyi {0,1}c 1(1 i n)as the label vector element ofyiis 1 ifxibelongs to thej-th class, [y1,y2, .., yn]T {0,1}n cis the labelmatrix. The total scatter matrixStand between class scattermatrixSbare defined as follows[Fukunaga, 1990].

6 St=n i=1(xi )(xi )T= X XT(2)Sb= ci=1ni( i )( i )T= XGGT XT(3)where is the mean of all samples, iis the mean of samplesin thei-th class,niis the number of samples in thei-th class, X=XHnis the data matrix after being centered, andG=[G1, .., Gn]T=Y(YTY) 1/2is the scaled label matrix. Awell-known method to utilize Discriminative information isto find a low dimensional subspace in whichSbis maximizedwhileStis minimized[Fukunaga, 1990].Recently, some researchers proposed two different newalgorithms to exploit local Discriminative information[Sugiyama, 2006; Yanget al., 2010b]for classification andimage clustering, demonstrating that local Discriminative in-formation is more important than global one. Inspired by this,for each data pointxi, we construct a local setNk(xi)com-prisingxiand itsknearest neighborsxi1, .., xik. DenoteXi=[xi,xi1, .., xik]as the local data matrix. Similar to (2)and (3), the local total scatter matrixS(i)tand between classscatter matrixS(i)bofNk(xi)are defined as (i)t= Xi XiT;(4)S(i)b= XiG(i)GT(i) XiT,(5)where Xi=XiHk+1andG(i)=[Gi,Gi1.]

7 , Gik] ease of representation, we define the Selection matrixSi {0,1}n (k+1)as follows.(Si)pq= 1ifp=Fi{q};0otherwise,(6)whereFi={i, i1, .., ik}. In this paper, it remains unclearhow to defineGbecause we are focusing on unsupervisedlearning where there is no label information available. In or-der to make use of local Discriminative information, we as-sume there is a linear classifierW Rd cwhich classi-fies each data point to a class, ,Gi=WTxi. Note thatGi,Gi1, .., Gikare selected fromG, ,G(i)= haveG(i)=[Gi,Gi1, .., Gik]T=STiG=STiXTW.(7)It is worth noting that the proposed algorithm is an unsuper-vised one. In other words,Gdefined in (7) is the output ofthe algorithm, ,Gi=WTxi, but not provided by thehuman supervisors. If some rows ofWshrink to zero,Wcan be regarded as the combination coefficients for differ-ent features that best predict the class labels of the trainingdata. Next, we give the approach which learns a discrimina-tiveWfor Feature Selection .

8 Inspired by[Fukunaga, 1990;Yanget al., 2010b], we define the local Discriminative scoreDSiofxiasDSi=Tr[(S(i)t+ I) 1S(i)b]=Tr[GT(i) XiT( Xi XTi+ I) 1 XiG(i)]=Tr[WTXSi XiT( Xi XTi+ I) 1 XiSTiXTW],(8)where is a parameter and Iis added to make the term( Xi XTi+ I)invertible. Clearly, a largerDSiindicates thatWhas a higher Discriminative abilityw. r. t .the to train aWcorresponding to the highest discrimina-tive scores for all the training datax1, .., xn. Therefore wepropose to minimize (9) for Feature Selection . ni=1 Tr[GT(i)Hk+1G(i)] DSi + W 2,1(9)Considering that the data number in each local set is usuallysmall,GT(i)Hk+1G(i)is added in (9) to avoid overfitting. Theregularization term W 2,1controls the capacity ofWandalso ensures thatWis sparse in rows, making it particularlysuitable for Feature Selection . SubstitutingDSiin (9) by (8),the objective function of our UDSF is given byminWTW=I ni=1Tr{WTXSiHk+1 STiXTW [WTXSi XiT( Xi XTi+ I) 1 XiSTiXTW]}+ W 2,1(10)where the orthogonal constraint is imposed to avoid arbitraryscaling and avoid the trivial solution of all zeros.

9 Note thatthe first term of (10) is equivalent to the following1:Tr{WTX{n i=1[Si(Hk+1 XiT( Xi XTi+ I) 1 Xi)STi]}XTW}Meanwhile we haveHk+1 XiT( Xi XTi+ I) 1 Xi=Hk+1 Hk+1 XTi( Xi XTi+ I) 1 XiHk+1=Hk+1 Hk+1( XTi Xi+ I) 1( XTi Xi+ I) XTi( Xi XTi+ I) 1 XiHk+1=Hk+1 Hk+1( XTi Xi+ I) 1 XTi XiHk+1=Hk+1 Hk+1( XTi Xi+ I) 1( XTi Xi+ I I)Hk+1= Hk+1( XTi Xi+ I) 1Hk+1 Therefore, the objective function of UDFS is rewritten asminWTW=ITr(WTMW)+ W 2,1(11)1It can be also interpreted in regression view[Yanget al., 2010a].1590whereM=X[n i=1(SiHk+1( XTi Xi+ I) 1Hk+1 STi)]XT(12)Denotewias thei-th row ofW, ,W=[w1, ..wd]T,theobjective function shown in (11) can be also written asminWTW=ITr(WTMW)+ di=1 wi 2.(13)We can see that many rows of the optimalWcorrespondingto (13) shrink to zeros2. Consequently, for a datumxi,x i=WTxiis a new representation ofxiusing only a small setof selected features. Alternatively, we can rank each featurefi|di=1according to wi 2in descending order and select topranked of UDFS AlgorithmThe 2,1- norm minimization problem has been studied inseveral previous works, such as[Argyriouet al.]

10 , 2008;Nieet al., 2010; Obozinskiet al., 2008; Liuet al., 2009;Zhaoet al., 2010; Yanget al., 2011]. However, it remainsunclear how to directly apply the existing algorithms to op-timizing our objective function, where the orthogonal con-straintWTW=Iis imposed. In this section, inspired by[Nieet al., 2010], we give a new approach to solve the op-timization problem shown in (11) for Feature Selection . Wefirst describe the detailed approach of UDFS algorithm in Al-gorithm 1 as 1:The UDFS ( XTi Xi+ I) 13Mi=SiHk+1 BiHk+1 STi;4M=X n i=1Mi XT;5 Sett=0and initializeD0 Rd das an identity matrix;6repeat7Pt=M+ Dt;8Wt=[p1, .., pc]wherep1, .., pcare theeigenvectors ofPtcorresponding to the firstcsmallest eigenvalues;9 Update the diagonal matrixDt+1asDt+1= 12 w1t wdt 2 ;10t=t+1;11untilConvergence;12 Sort each featurefi|di=1according to wit 2indescending order and select the top ranked , we briefly analyze Algorithm-1 proposed in thissection. From line 1 to line 4, it computesMdefined in2 Usually, many rows of the optimalWare close to zeros.


Related search queries