Transcription of Principal Components Analysis
1 Chapter 18 Principal Components AnalysisPrincipal Components Analysis (PCA) is one of a family of techniques for takinghigh-dimensional data, and using the dependencies between the variables to representit in a more tractable, lower-dimensional form, without losing too much is one of the simplest and most robust ways of doing suchdimensionalityreduction. It is also one of the oldest, and has been rediscovered many times inmany fields, so it is also known as the Karhunen-Lo ve transformation, the Hotellingtransformation, the method of empirical orthogonal functions, and singular valuedecomposition1. We will call it Mathematics of Principal ComponentsWe start withp-dimensional vectors, and want to summarize them by projectingdown into aq-dimensional subspace. Our summary will be the projection of theoriginal vectors on toqdirections, theprincipal Components , which span the are several equivalent ways of deriving the Principal Components mathe-matically.
2 The simplest one is by finding the projections which maximize the vari-ance. The first Principal component is the direction in space along which projectionshave the largest variance. The second Principal component is the direction whichmaximizes variance among all directions orthogonal to the first. Thekthcomponentis the variance-maximizing direction orthogonal to the previousk 1 arepprincipal Components in than maximizing variance, it might sound more plausible to look for theprojection with the smallest average (mean-squared) distance between the originalvectors and their projections on to the Principal Components ; this turns out to beequivalent to maximizing the , assume that the data have been centered , so that every variablehas mean 0. If we write the centered data in a matrixx, where rows are objects and1 Strictly speaking, singular value decomposition is a matrix algebra trick which is used in the mostcommon algorithm for 18.
3 Principal Components Analysis columns are variables, thenxTx=nv, wherevis the covariance matrix of the data.(You should check that last statement!) Minimizing Projection ResidualsWe ll start by looking for a one-dimensional projection. That is, we havep-dimensionalvectors, and we want to project them on to a line through the origin. We can specifythe line by a unit vector along it, w, and then the projection of a data vector xion tothe line is xi w, which is a scalar. (Sanity check: this gives us the right answer whenwe project on to one of the coordinate axes.) This is the distance of the projectionfrom the origin; the actual coordinate inp-dimensional space is( xi w) w. The meanof the projections will be zero, because the mean of the vectors xiis zero:1nn i=1( xi w) w= 1nn i=1xi w w( )If we try to use our projected orimagevectors instead of our original vectors,there will be some error, because (in general) the images do not coincide with theoriginal vectors.
4 (When do they coincide?) The difference is the error orresidualofthe projection. How big is it? For any one vector , say xi, it s xi ( w xi) w 2= xi ( w xi) w xi ( w xi) w ( )= xi xi xi ( w xi) w( ) ( w xi) w xi+( w xi) w ( w xi) w= xi 2 2( w xi)2+( w xi)2 w w( )= xi xi ( w xi)2( )since w w= w 2=1. Add those residuals up across all the vectors:MSE( w)=1nn i=1 xi 2 ( w xi)2( )=1n n i=1 xi 2 n i=1( w xi)2 ( )The first summation doesn t depend on w, so it doesn t matter for trying to minimizethe mean squared residual. To make the MSE small, what we must do is make thesecond sum big, , we want to maximize1nn i=1( w xi)2( )which we can see is the sample mean of( w xi)2. The mean of a square is always equalto the square of the mean plus the variance:1nn i=1( w xi)2= 1nn i=1 xi w 2+Var w xi ( ) MATHEMATICS OF Principal COMPONENTS353 Since we ve just seen that the mean of the projections is zero, minimizing the residualsum of squares turns out to be equivalent to maximizing the variance of the projec-tions.
5 (Of course in general we don t want to project on to just one vector , but on tomultiple Principal Components . If those Components are orthogonal and have theunit vectors w1, w2,.. wk, then the image ofxiis its projection into the space spannedby these vectors,k j=1( xi wj) wj( )The mean of the projection on to each component is still zero. If we go through thesame algebra for the mean squared error, it turns[Exercise 2]out that the cross-termsbetween different Components all cancel out, and we are left with trying to maximizethe sum of the variances of the projections on to the Components .) Maximizing VarianceAccordingly, let s maximize the variance! Writing out all the summations grows te-dious, so let s do our algebra in matrix form. If we stack ourndata vectors into ann pmatrix,x, then the projections are given byxw, which is ann 1 matrix. Thevariance is 2 w=1n i xi w 2( )=1n(xw)T(xw)( )=1nwTxTxw( )=wTxTxnw( )=wTvw( )We want to chose a unit vector wso as to maximize 2 w.
6 To do this, we need tomake sure that we only look at unit vectors we need to constrain the constraint is that w w=1, orwTw=1. As explained in Appendix D, wecan do this by introducing a new variable, theLagrange multiplier , adding times the constraint equation to our objective function, and doing an unconstrainedoptimization. For our projection problem, (w, ) 2w (wTw 1)( ) L =wTw 1( ) L w=2vw 2 w( )354 CHAPTER 18. Principal Components ANALYSISS etting the derivatives to zero at the optimum, we getwTw=1( )vw= w( )Thus, desired vectorwis aneigenvectorof the covariance matrixv, and the maxi-mizing vector will be the one associated with the largesteigenvalue . This is goodnews, because finding eigenvectors is something which can be done comparativelyrapidly, and because eigenvectors have many nice mathematical properties, which wecan use as know thatvis ap pmatrix, so it will havepdifferent thatvis a covariance matrix, so it is symmetric, and then linear algebra tellsus that the eigenvectors must be orthogonal to one another.
7 Again becausevis acovariance matrix, it is apositive matrix, in the sense that x v x 0 for any x. Thistells us that the eigenvalues ofvmust all be eigenvectors ofvare theprincipal componentsof the data. We know thatthey are all orthogonal top each other from the previous paragraph, so together theyspan the wholep-dimensional space. The first Principal component, the eigen- vector which goes the largest value of , is the direction along which the data havethe most variance. The second Principal component, the second eigenvector, isthe direction orthogonal to the first component with the most variance. Because itis orthogonal to the first eigenvector, their projections will be uncorrelated. In fact,projections on to all the Principal Components are uncorrelated with each other. Ifwe useqprincipal Components , our weight matrixwwill be ap qmatrix, whereeach column will be a different eigenvector of the covariance matrixv.
8 The eigen-values will give the total variance described by each component. The variance of theprojections on to the firstqprincipal Components is then qi=1 More Geometry; Back to the ResidualsSuppose that the data really areq-dimensional. Thenvwill have onlyqpositiveeigenvalues, andp qzero eigenvalues. If the data fall near aq-dimensional subspace,thenp qof the eigenvalues will be nearly we pick the topqcomponents, we can define a projection operatorPq. Theimages of the data are thenxPq. Theprojection residualsarex xPqorx(I Pq).(Notice that the residuals here are vectors, not just magnitudes.) If the data reallyareq-dimensional, then the residuals will be zero. If the data areapproximately q-dimensional, then the residuals will be small. In any case, we can define theR2of theprojection as the fraction of the original variance kept by the image vectors,R2 qi=1 i pj=1 j( )just as theR2of a linear regression is the fraction of the original variance of thedependent variable kept by the fitted : ifn<p, there are onlyndistinct eigenvectors and MATHEMATICS OF Principal COMPONENTS355 Theq=1 case is especially instructive.
9 We know that the residual vectors are allorthogonal to the projections. Suppose we ask for the first Principal component ofthe residuals. This will be the direction of largest variance which is perpendicular tothe first Principal component. In other words, it will be the second Principal com-ponent of the data. This suggests a recursive algorithm for finding all the principalcomponents: thekthprincipal component is the leading component of the residu-als after subtracting off the firstk 1 Components . In practice, it is faster to useeigenvector-solvers to get all the Components at once fromv, but this idea is correctin is a good place to remark that if the data really fall in aq-dimensional sub-space, thenvwill have onlyqpositive eigenvalues, because after subtracting off thosecomponents there will be no residuals. The otherp qeigenvectors will all haveeigenvalue 0. If the data cluster around aq-dimensional subspace, thenp qof theeigenvalues will be very small, though how small they need to be before we can ne-glect them is a tricky on to the first two or three Principal Components can be visualized;however they may not be enough to really give a good summary of the data.
10 Usually,to get anR2of 1, you need to use allpprincipal many principalcomponents you should use depends on your data, and how big anR2you need. Insome fields, you can get better than 80% of the variance described with just two orthree Components . A sometimes-useful device is to plot 1 R2versus the number ofcomponents, and keep extending the curve it until it flattens Statistical Inference, or NotYou may have noticed, and even been troubled by, the fact that I have said nothingat all yet like assume the data are drawn at random from some distribution , or assume the different rows of the data frame are statistically independent . This isbecause no such assumption is required for Principal does is say these data can be summarized using projections along these directions . It says noth-ing about the larger population or stochastic process the data came from; it doesn teven suppose the latter , we couldadda statistical assumption and see how PCA behaves underthose conditions.