Example: barber

Gaussian Processes for Machine Learning

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning , the MIT Press, 2006,ISBN 2006 Massachusetts Institute of AMathematical Joint, Marginal and Conditional ProbabilityLet then(discrete or continuous) random variablesy1,..,ynhave ajointjoint probabilityprobabilityp(y1,..,yn), orp(y) for , one ought to distin-guish between probabilities (for discrete variables) and probability densities forcontinuous variables. Throughout the book we commonly use the term prob-ability to refer to both. Let us partition the variables inyinto two groups,yAandyB, whereAandBare two disjoint sets whose union is the set{1.}

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. 2006 Massachusetts Institute of Technology.c www ...

Tags:

  Processes, Machine, Learning, Gaussian processes for machine learning, Gaussian

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Gaussian Processes for Machine Learning

1 C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning , the MIT Press, 2006,ISBN 2006 Massachusetts Institute of AMathematical Joint, Marginal and Conditional ProbabilityLet then(discrete or continuous) random variablesy1,..,ynhave ajointjoint probabilityprobabilityp(y1,..,yn), orp(y) for , one ought to distin-guish between probabilities (for discrete variables) and probability densities forcontinuous variables. Throughout the book we commonly use the term prob-ability to refer to both. Let us partition the variables inyinto two groups,yAandyB, whereAandBare two disjoint sets whose union is the set{1.}

2 ,n},so thatp(y) =p(yA,yB). Each group may contain one or more ofyAis given bymarginal probabilityp(yA) = p(yA,yB)dyB.( )The integral is replaced by a sum if the variables are discrete valued. Noticethat if the setAcontains more than one variable, then the marginal probabilityis itself a joint probability whether it is referred to as one or the other dependson the context. If the joint distribution is equal to the product of the marginals,independencethen the variables are said to beindependent, otherwise they function is defined asconditional probabilityp(yA|yB) =p(yA,yB)p(yB),( )defined forp(yB)>0, as it is not meaningful to condition on an impossibleevent.

3 IfyAandyBare independent, then the marginalp(yA) and the condi-tionalp(yA|yB) are can deal with more general cases where the density function does not exist by usingthe distribution E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning , the MIT Press, 2006,ISBN 2006 Massachusetts Institute of BackgroundUsing the definitions of bothp(yA|yB) andp(yB|yA) we obtainBayes Bayes ruletheoremp(yA|yB) =p(yA)p(yB|yA)p(yB).( )Since conditional distributions are themselves probabilities, one can use all ofthe above also when further conditioning on other variables.

4 For example, insupervised Learning , one often conditions on the inputs throughout, which wouldlead to a version of Bayes rule with additional conditioning onXin allfour probabilities in eq. ( ); see eq. ( ) for an example of Gaussian IdentitiesThe multivariate Gaussian (or Normal) distribution has a joint probability den- Gaussian definitionsity given byp(x|m, ) = (2 ) D/2| | 1/2exp( 12(x m)> 1(x m)),( )wheremis themeanvector (of lengthD) and is the (symmetric, positivedefinite)covariancematrix (of sizeD D). As a shorthand we writex N(m, ).Letxandybe jointly Gaussian random vectors[xy] N([ x y],[A CC>B])=N([ x y],[ A C C> B] 1),( )then themarginaldistribution ofxand theconditionaldistribution ofxgivenconditioning andmarginalizingyarex N( x,A),andx|y N( x+CB 1(y y), A CB 1C>)orx|y N( x A 1 C(y y), A 1).

5 ( )See, von Mises [1964, sec. ], and eqs. ( - ).The product of two Gaussians gives another (un-normalized) GaussianproductsN(x|a,A)N(x|b,B) =Z 1N(x|c,C)( )wherec=C(A 1a+B 1b) andC= (A 1+B 1) that the resulting Gaussian has a precision (inverse variance) equal tothe sum of the precisions and a mean equal to the convex sum of the means,weighted by the precisions. The normalizing constant looks itself like a Gaussian (inaorb)Z 1= (2 ) D/2|A+B| 1/2exp( 12(a b)>(A+B) 1(a b)).( )To prove eq. ( ) simply write out the (lengthy) expressions by introducingeq. ( ) and eq.

6 ( ) into eq. ( ), and expand the terms inside the exp toC. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning , the MIT Press, 2006,ISBN 2006 Massachusetts Institute of Matrix Identities201verify equality. Hint: it may be helpful to expandCusing the matrix inversionlemma, eq. ( ),C= (A 1+B 1) 1=A A(A+B) 1A=B B(A+B) generate samplesx N(m,K) with arbitrary meanmand covariancegenerating multivariateGaussian samplesmatrixKusing a scalar Gaussian generator (which is readily available in manyprogramming environments) we proceed as follows: first, compute the Choleskydecomposition (also known as the matrix square root )Lof the positive def-inite symmetric covariance matrixK=LL>, whereLis a lower triangularmatrix, see section Then generateu N(0,I) by multiple separate callsto the scalar Gaussian generator.

7 Computex=m+Lu, which has the desireddistribution with meanmand covarianceLE[uu>]L>=LL>=K(by theindependence of the elements ofu).In practice it may be necessary to add a small multiple of the identitymatrix Ito the covariance matrix for numerical reasons. This is because theeigenvalues of the matrixKcan decay very rapidly (see section for aclosely related analytical result) and without this stabilization the Choleskydecomposition fails. The effect on the generated samples is to add additionalindependent noise of variance . From the context can usually be chosen tohave inconsequential effects on the samples, while ensuring numerical Matrix IdentitiesThematrix inversion lemma, also known as the Woodbury, Sherman & Morri-matrix inversion lemmason formula (see Press et al.)

8 [1992, p. 75]) states that(Z+UWV>) 1=Z 1 Z 1U(W 1+V>Z 1U) 1V>Z 1,( )assuming the relevant inverses all exist. HereZisn n,Wism mandUandVare both of sizen m; consequently ifZ 1is known, and a low rank ( < n)perturbation is made toZas in left hand side of eq. ( ), considerable speedupcan be achieved. A similar equation exists for determinantsdeterminants|Z+UWV>|=|Z||W||W 1+V>Z 1U|.( )Let the invertiblen nmatrixAand its inverseA 1be partitioned intoinversion of apartitioned matrixA=(P QR S),A 1=( P Q R S),( )wherePand Paren1 n1matrices andSand Saren2 n2matrices withn=n1+n2.

9 The submatrices ofA 1are given in Press et al. [1992, p. 77] as P=P 1+P 1 QMRP 1 Q= P 1QM R= MRP 1 S=M whereM= (S RP 1Q) 1,( )C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning , the MIT Press, 2006,ISBN 2006 Massachusetts Institute of Backgroundor equivalently P=N Q= NQS 1 R= S 1RN S=S 1+S 1 RNQS 1 whereN= (P QS 1R) 1.( ) Matrix DerivativesDerivatives of the elements of an inverse matrix:derivative of inverse K 1= K 1 K K 1,( )where K is a matrix of elementwise derivatives. For the log determinant of aderivative of logdeterminantpositive definite symmetric matrix we have log|K|= tr(K 1 K ).

10 ( ) Matrix NormsThe Frobenius norm A Fof an1 n2matrixAis defined as A 2F=n1 i=1n2 j=1|aij|2= tr(AA>),( )[Golub and Van Loan, 1989, p. 56]. Cholesky DecompositionThe Cholesky decomposition of a symmetric, positive definite matrixAdecom-posesAinto a product of a lower triangular matrixLand its transposeLL>=A,( )whereLis called the Cholesky factor. The Cholesky decomposition is usefulfor solving linear systems with symmetric, positive definite coefficient matrixA. To solveAx=bforx, first solve the triangular systemLy=bby forwardsolving linear systemssubstitution and then the triangular systemL>x=yby back the backslash operator, we write the solution asx=L>\(L\b), wherethe notationA\bis the vectorxwhich solvesAx=b.