Example: marketing

A Simple Explanation of Partial Least Squares

A Simple Explanation of Partial Least SquaresKee Siong NgApril 27, 20131 IntroductionPartial Least Squares (PLS) is a widely used technique in chemometrics, especially in the casewhere the number of independent variables is significantly larger than the number of data are many articles on PLS [HTF01, GK86] but the mathematical details of PLS do notalways come out clearly in these treatments. This paper is an attempt to describe PLS in preciseand Simple mathematical Notation and TerminologyDefinition [ ] be an mmatrix. The mean-centered matrixB:= [x1 xm],where xiis the mean value forxi, has zero sample mean.

(The reason the sample covariance matrix has n 1 in the denominator rather than nis to correct ... The simplest case of linear regression yields some geometric intuition on the coe cient. Suppose we have a univariate model with no intercept: y = x + : Then the least-squares estimate ^ of is given by ^ = hx;yi

Tags:

  Tesla, Square, Reasons, Partial, Intuition, Partial least squares

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of A Simple Explanation of Partial Least Squares

1 A Simple Explanation of Partial Least SquaresKee Siong NgApril 27, 20131 IntroductionPartial Least Squares (PLS) is a widely used technique in chemometrics, especially in the casewhere the number of independent variables is significantly larger than the number of data are many articles on PLS [HTF01, GK86] but the mathematical details of PLS do notalways come out clearly in these treatments. This paper is an attempt to describe PLS in preciseand Simple mathematical Notation and TerminologyDefinition [ ] be an mmatrix. The mean-centered matrixB:= [x1 xm],where xiis the mean value forxi, has zero sample mean.

2 We will mostly work with mean-centeredmatrices in this a mean-centeredn mmatrix andYis a mean-centeredn pmatrix. The sample covariance matrix ofXandYis given bycov(X,Y) :=1n variance ofXis given byvar(X) :=cov(X,X).(The reason the sample covariance matrix hasn 1 in the denominator rather thannis to correctfor the fact that we are using sample mean instead of true population mean to do the centering.)S:=var(X) is symmetric. The diagonal entrySj,jis called the variance ofxj. The total varianceof the data inXis given by the trace ofS:tr(S) = jSj,j. The valueSi,j,i6=j, is called thecovariance ofxiandxj.

3 The correlation betweenXandYis defined bycorr(X,Y) := var(X)cov(X,Y) var(Y)(1)3 Problems with Ordinary Least SquaresTo understand the motivation for using PLS in high-dimensional chemometrics data, it is impor-tant to understand how and why ordinary Least Squares fail in the case where we have a largenumber of independent variables and they are highly correlated. Readers who are already familiarwith this topic can skip to the next a design matrixXand the response vectory, the Least square estimate of the parameter in the linear modely=X + is given by the normal equation = (XTX) 1 XTy.

4 (2)Fact simplest case of linear regression yields some geometric intuition on the we have a univariate model with no intercept:y=x + .Then the Least - Squares estimate of is given by = x,y x,x .This is easily seen as a special case of (2). It is also easily established by differentiating i(yi xi)2with respect to and solving the resulting KKT equations. Geometrically, y= x,y x,x xis theprojection ofyonto the by Successive OrthogonalisationThe problem with using ordinary Least squareson high-dimensional data is clearly brought out in a linear regression procedure called Regressionby Successive Orthogonalisation.

5 This section is built on the material covered in [HTF01].Definition vectorsuandvare said to be orthogonal if u,v = 0; ,the vectors areperpendicular. A set of vectors is said to be orthogonal if every pair of (non-identical) vectorsfrom the set is orthogonal. A matrix is orthogonal if the set of its column vectors are an orthogonal matrixX, we haveX 1= of a matrixX= [x1x2 xm] can be done using the Gram-Schmidtprocess. Writingproju(v) := u,v u,u u,the procedure transformsXinto an orthogonal matrixU= [u1u2 um] via these steps:u1:=x1u2:=x2 proju1(x2)u3:=x3 proju1(x3) proju2(x3).

6 Um:=xm n 1 j=1projuj(xm)The Gram-Schmidt process also gives us theQRfactorisation ofX, whereQis made up of theorthogonaluivectors normalised to unit vectors as necessary, and the upper triangularRmatrixis obtained from theprojui(xj) coefficients. Gram-Schmidt is known to be numerically unstable; abetter procedure to do orthogonalisation andQRfactorisation is the Householder transformation is the dual of Gram-Schmidt in the following sense: Gram-SchmidtcomputesQand getsRas a side product; Householder computesRand getsQas a side product[GBGL08].2 Fact the column vectors of the design matrixX= [x1x2 xm] forms an orthogonal set,then it follows from (2) that T=[ x1,y x1,x1 x2,y x2,x2.]

7 Xm,y xm,xm ],(3)sinceXTX=diag( x1,x1 ,.., xm,xm ). In other words, is made up of the univariate esti-mates. This means when the input variables are orthogonal, they have no effect on each other sparameter estimates in the way to perform regression known as the Gram-Schmidt procedure for multipleregression is to first decompose the design matrixXintoX=U , whereU= [u1u2 um]is the orthogonal matrix obtained from Gram-Schmidt, and is the upper triangular matrixdefined by l,l= 1and l,j= ul,xj ul,ul forl < j, and then solve the associated regression problemU =yusing (3).

8 The following showsthe relationship between and the inX =y:X =y= U =y= =UTy= .Since m,m= 1, we have (m) = (m) = um,y um,um .(4)Fact anyxjcan be shifted into the last position in the design matrixX, Equation (4) tellsus something useful: The regression coefficient (j) ofxjis the univariate estimate of regressingyon the residual of regressingxjonx1,x2,..,xj 1,xj+1,..,xn. Intuitively, (j) represents theadditional contribution ofxjony, afterxjhas been adjusted forx1,x2,..,xj 1,xj+1,.., the above, we can now see how multiple linear regression can break in practice. Ifxnishighly correlated with some of the otherxk s, the residual vectorunwill be close to zero and,from (4), the regression coefficient (m) will be very unstable.

9 Indeed, this will be true for all thevariables in the correlated Principal Component RegressionPartial Least Squares and the closely related principal component regression technique are bothdesigned to handle the case of a large number of correlated independent variables, which is commonin chemometrics. To understand Partial Least Squares , it helps to first get a handle on principalcomponent regression, which we now idea behind principal component regression is to first perform a principal componentanalysis (PCA) on the design matrix and then use only the firstkprincipal components to do theregression.

10 To understand how it works, it helps to first understand matrixAis said to beorthogonally diagonalisableif there are an orthogonalmatrixPand a diagonal matrixDsuch thatA=PDPT=PDP nmatrixAis orthogonally diagonalisable if and only ifAis a symmetric matrix( ,AT=A).3 Fact the Spectral Theorem for Symmetric Matrices [Lay97], we know that ann nsymmetric matrixAhasnreal eigenvalues, counting multiplicities, and that the correspondingeigenvectors are orthogonal. (Eigenvectors are not orthogonal in general.) A symmetric matrixAcan thus be orthogonally diagonalised this wayA=UDUT,whereUis made up of the eigenvectors ofAandDis the diagonal matrix made up of theeigenvalues result we will need relates to optimisation of quadratic forms under a certain form a symmetric matrix.


Related search queries