Principal Components Analysis

Chapter 18 Principal Components AnalysisPrincipal Components Analysis (PCA) is one of a family of techniques for takinghigh-dimensional data, and using the dependencies between the variables to representit in a more tractable, lower-dimensional form, without losing too much is one of the simplest and most robust ways of doing suchdimensionalityreduction. It is also one of the oldest, and has been rediscovered many times inmany fields, so it is also known as the Karhunen-Lo ve transformation, the Hotellingtransformation, the method of empirical orthogonal functions, and singular valuedecomposition1. We will call it Mathematics of Principal ComponentsWe start withp-dimensional vectors, and want to summarize them by projectingdown into aq-dimensional subspace.

Our summary will be the projection of theoriginal vectors on toqdirections, theprincipal Components , which span the are several equivalent ways of deriving the Principal Components mathe-matically. The simplest one is by finding the projections which maximize the vari-ance. The first Principal component is the direction in space along which projectionshave the largest variance. The second Principal component is the direction whichmaximizes variance among all directions orthogonal to the first. Thekthcomponentis the variance-maximizing direction orthogonal to the previousk 1 arepprincipal Components in than maximizing variance, it might sound more plausible to look for theprojection with the smallest average (mean-squared) distance between the originalvectors and their projections on to the Principal Components ; this turns out to beequivalent to maximizing the , assume that the data have been centered , so that every variablehas mean 0.

If we write the centered data in a matrixx, where rows are objects and1 Strictly speaking, singular value decomposition is a matrix algebra trick which is used in the mostcommon algorithm for 18. Principal Components Analysis columns are variables, thenxTx=nv, wherevis the covariance matrix of the data.(You should check that last statement!) Minimizing Projection ResidualsWe ll start by looking for a one-dimensional projection. That is, we havep-dimensionalvectors, and we want to project them on to a line through the origin. We can specifythe line by a unit vector along it, w, and then the projection of a data vector xion tothe line is xi w, which is a scalar.

(Sanity check: this gives us the right answer whenwe project on to one of the coordinate axes.) This is the distance of the projectionfrom the origin; the actual coordinate inp-dimensional space is( xi w) w. The meanof the projections will be zero, because the mean of the vectors xiis zero:1nn i=1( xi w) w= 1nn i=1xi w w( )If we try to use our projected orimagevectors instead of our original vectors,there will be some error, because (in general) the images do not coincide with theoriginal vectors. (When do they coincide?) The difference is the error orresidualofthe projection. How big is it? For any one vector, say xi, it s xi ( w xi) w 2= xi ( w xi) w xi ( w xi) w ( )= xi xi xi ( w xi) w( ) ( w xi) w xi+( w xi) w ( w xi) w= xi 2 2( w xi)2+( w xi)2 w w( )= xi xi ( w xi)2( )since w w= w 2=1.

Add those residuals up across all the vectors:MSE( w)=1nn i=1 xi 2 ( w xi)2( )=1n n i=1 xi 2 n i=1( w xi)2 ( )The first summation doesn t depend on w, so it doesn t matter for trying to minimizethe mean squared residual. To make the MSE small, what we must do is make thesecond sum big, , we want to maximize1nn i=1( w xi)2( )which we can see is the sample mean of( w xi)2. The mean of a square is always equalto the square of the mean plus the variance:1nn i=1( w xi)2= 1nn i=1 xi w 2+Var w xi ( ) MATHEMATICS OF Principal COMPONENTS353 Since we ve just seen that the mean of the projections is zero, minimizing the residualsum of squares turns out to be equivalent to maximizing the variance of the projec-tions.

(Of course in general we don t want to project on to just one vector, but on tomultiple Principal Components . If those Components are orthogonal and have theunit vectors w1, w2,.. wk, then the image ofxiis its projection into the space spannedby these vectors,k j=1( xi wj) wj( )The mean of the projection on to each component is still zero. If we go through thesame algebra for the mean squared error, it turns[Exercise 2]out that the cross-termsbetween different Components all cancel out, and we are left with trying to maximizethe sum of the variances of the projections on to the Components .) Maximizing VarianceAccordingly, let s maximize the variance!

Writing out all the summations grows te-dious, so let s do our algebra in matrix form. If we stack ourndata vectors into ann pmatrix,x, then the projections are given byxw, which is ann 1 matrix. Thevariance is 2 w=1n i xi w 2( )=1n(xw)T(xw)( )=1nwTxTxw( )=wTxTxnw( )=wTvw( )We want to chose a unit vector wso as to maximize 2 w. To do this, we need tomake sure that we only look at unit vectors we need to constrain the constraint is that w w=1, orwTw=1. As explained in Appendix D, wecan do this by introducing a new variable, theLagrange multiplier , adding times the constraint equation to our objective function, and doing an unconstrainedoptimization.

For our projection problem, (w, ) 2w (wTw 1)( ) L =wTw 1( ) L w=2vw 2 w( )354 CHAPTER 18. Principal Components ANALYSISS etting the derivatives to zero at the optimum, we getwTw=1( )vw= w( )Thus, desired vectorwis aneigenvectorof the covariance matrixv, and the maxi-mizing vector will be the one associated with the largesteigenvalue . This is goodnews, because finding eigenvectors is something which can be done comparativelyrapidly, and because eigenvectors have many nice mathematical properties, which wecan use as know thatvis ap pmatrix, so it will havepdifferent thatvis a covariance matrix, so it is symmetric, and then linear algebra tellsus that the eigenvectors must be orthogonal to one another.

Again becausevis acovariance matrix, it is apositive matrix, in the sense that x v x 0 for any x. Thistells us that the eigenvalues ofvmust all be eigenvectors ofvare theprincipal componentsof the data. We know thatthey are all orthogonal top each other from the previous paragraph, so together theyspan the wholep-dimensional space. The first Principal component , the eigen-vector which goes the largest value of , is the direction along which the data havethe most variance. The second Principal component , the second eigenvector, isthe direction orthogonal to the first component with the most variance. Because itis orthogonal to the first eigenvector, their projections will be uncorrelated.

In fact,projections on to all the Principal Components are uncorrelated with each other. Ifwe useqprincipal Components , our weight matrixwwill be ap qmatrix, whereeach column will be a different eigenvector of the covariance matrixv. The eigen-values will give the total variance described by each component . The variance of theprojections on to the firstqprincipal Components is then qi=1 More Geometry; Back to the ResidualsSuppose that the data really areq-dimensional. Thenvwill have onlyqpositiveeigenvalues, andp qzero eigenvalues. If the data fall near aq-dimensional subspace,thenp qof the eigenvalues will be nearly we pick the topqcomponents, we can define a projection operatorPq.

Principal Components Analysis

Tags:

Information

Advertisement

Transcription of Principal Components Analysis

Related search queries

Principal Components Analysis

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries