### Transcription of Gaussian processes - Stanford University

1 **Gaussian** processesChuong B. Do (updated by Honglak Lee)November 22, 2008 Many of the classical machine learning algorithms that we talked about during the firsthalf of this course fit the following pattern: given a training set of examples sampledfrom some unknown distribution,1. solve a convex optimization problem in order to identify the single best fit model forthe data, and2. use this estimated model to make best guess predictionsfor future test input these notes, we will talk about a different flavor of learning algorithms, known asBayesian methods. Unlike classical learning algorithm, Bayesian algorithms do not at-tempt to identify best-fit models of the data (or similarly, make best guess predictionsfor new test inputs). Instead, they compute a posterior distribution over models (or similarly,compute posterior predictive distributions for new test inputs).

2 These distributions providea useful way to quantify our uncertainty in model estimates,and to exploit our knowledgeof this uncertainty in order to make more robust predictionson new test focus onregressionproblems, where the goal is to learn a mapping from some inputspaceX=Rnofn-dimensional vectors to an output spaceY=Rof real-valued particular, we will talk about a kernel-based fully Bayesian regression algorithm, knownas **Gaussian** **process** regression. The **material** covered in these notes draws heavily on manydifferent topics that we discussed previously in class (namely, the probabilistic interpretationof linear regression1, Bayesian methods2, kernels3, and properties of multivariate Gaussians4).The organization of these notes is as follows. In Section 1, we provide a brief reviewof multivariate **Gaussian** distributions and their properties.

3 In Section 2, we briefly reviewBayesian methods in the context of probabilistic linear regression. The central ideas under-lying **Gaussian** **processes** are presented in Section 3, and we derive the full **Gaussian** processregression model in Section course lecture notes on Supervised Learning, Discriminative Algorithms. 2 See course lecture notes on Regularization and Model Selection. 3 See course lecture notes on Support Vector Machines. 4 See course lecture notes on Factor Analysis. 11 Multivariate GaussiansA vector-valued random variablex Rnis said to have amultivariate normal (orGaussian) distributionwith mean Rnand covariance matrix Sn++ifp(x; , ) =1(2 )n/2| |1/2exp 12(x )T 1(x ) .(1)We write this asx N( , ). Here, recall from the section notes on linear algebra thatSn++refers to the space of symmetric positive definiten speaking, **Gaussian** random variables are extremely useful in machine learningand statistics for two main reasons.

4 First, they are extremely common when modeling noise in statistical algorithms. Quite often, noise can be considered to be the accumulation of alarge number of small independent random perturbations affecting the measurement **process** ;by the Central Limit Theorem, summations of independent random variables will tend to look **Gaussian** . Second, **Gaussian** random variables are convenient for many analyticalmanipulations, because many of the integrals involving **Gaussian** distributions that arise inpractice have simple closed form solutions. In the remainder of this section, we will reviewa number of useful properties of multivariate a random vectorx Rnwithx N( , ). Suppose also that the variables inxhave been partitioned into two setsxA= [x1 xr]T RrandxB= [xr+1 xn]T Rn r(and similarly for and ), such thatx= xAxB = A B = AA AB BA BB.

5 Here, AB= TBAsince =E[(x )(x )T] = T. The following properties density function normalizes, ,Zxp(x; , )dx= property, though seemingly trivial at first glance, turns out to be immenselyuseful for evaluating all sorts of integrals, even ones which appear to have no relationto probability distributions at all (see Appendix )! marginal densities,p(xA) =ZxBp(xA, xB; , )dxBp(xB) =ZxAp(xA, xB; , )dxA5 There are actually cases in which we would want to deal with multivariate **Gaussian** distributions where is positive semidefinite but not positive definite ( , is not full rank). In such cases, 1does not exist,so the definition of the **Gaussian** density given in (1) does notapply. For instance, see the course lecturenotes on Factor Analysis. 2are **Gaussian** :xA N( A, AA)xB N( B, BB). conditional densitiesp(xA|xB) =p(xA, xB; , )RxAp(xA, xB; , )dxAp(xB|xA) =p(xA, xB; , )RxBp(xA, xB; , )dxBare also **Gaussian** :xA|xB N A+ AB 1BB(xB B), AA AB 1BB BA xB|xA N B+ BA 1AA(xA A), BB BA 1AA AB.

6 A proof of this property is given in Appendix (See also Appendix for an easierversion of the derivation.) sum of independent **Gaussian** random variables (with the samedimensionality),y N( , ) andz N( , ), is also **Gaussian** :y+z N( + , + ).2 Bayesian linear regressionLetS={(x(i), y(i))}mi=1be a training set of examples from some unknown standard probabilistic interpretation of linear regression states thaty(i)= Tx(i)+ (i),i= 1, .. , mwhere the (i)are noise variables with independentN(0, 2) distributions. It followsthaty(i) Tx(i) N(0, 2), or equivalently,P(y(i)|x(i), ) =1 2 exp (y(i) Tx(i))22 2 .For notational convenience, we defineX= (x(1))T (x(2))T .. (x(m))T Rm n~y= y(1)y(2)..y(m) Rm~ = (1) (2).. (m) 5 4 3 2 1012345 3 2 10123 Bayesian linear regression, 95% confidence regionFigure 1: Bayesian linear regression for a one-dimensionallinear regression problem,y(i)= x(i)+ (i), with (i) N(0,1) noise.

7 The green region denotes the 95% confidenceregion for predictions of the model. Note that the (vertical) width of the green region islargest at the ends but narrowest in the middle. This region reflects the uncertain in theestimates for the parameter . In contrast, a classical linear regression model would displaya confidence region of constant width, reflecting only theN(0, 2) noise in the Bayesian linear regression, we assume that aprior distributionover parameters isalso given; a typical choice, for instance, is N(0, 2I). Using Bayes s rule, we obtain theparameter posterior,p( |S) =p( )p(S| )R p( )p(S| )d =p( )Qmi=1p(y(i)|x(i), )R p( )Qmi=1p(y(i)|x(i), )d .(2)Assuming the same noise model on testing points as on our training points, the output ofBayesian linear regression on a new test pointx is not just a single guess y , but ratheran entire probability distribution over possible outputs,known as theposterior predictivedistribution:p(y |x , S) =Z p(y |x , )p( |S)d.

8 (3)For many types of models, the integrals in (2) and (3) are difficult to compute, and hence,we often resort to approximations, such as MAP estimation (see course lecture notes on Regularization and Model Selection ).In the case of Bayesian linear regression, however, the integrals actually are tractable! Inparticular, for Bayesian linear regression, one can show (after much work!) that |S N 1 2A 1XT~y, A 1 y |x , S N 1 2xT A 1XT~y, xT A 1x + 2 4whereA=1 2 XTX+1 2I. The derivation of these formulas is somewhat , from these equations, we get at least a flavor of what Bayesian methods are all about: theposterior distribution over the test outputy for a test inputx is a **Gaussian** distribution this distribution reflects the uncertainty in our predictionsy = Tx + arising from boththe randomness in and the uncertainty in our choice of parameters.

9 In contrast, classicalprobabilistic linear regression models estimate parameters directly from the training databut provide no estimate of how reliable these learned parameters may be (see Figure 1).3 **Gaussian** processesAs described in Section 1, multivariate **Gaussian** distributions are useful for modeling finitecollections of real-valued variables because of their niceanalytical the extension of multivariate Gaussians to infinite-sized collections of real-valued variables. In particular, this extension will allowus to think of **Gaussian** **processes** asdistributions not just over random vectors but in fact distributions overrandom Probability distributions over functions with finite domainsTo understand how one might paramterize probability distributions over functions, considerthe following simple example.

10 LetX={x1, .. , xm}be any finite set of elements. Now,consider the setHof all possible functions mapping fromXtoR. For instance, one exampleof a functionf0( ) His given byf0(x1) = 5,f0(x2) = ,f0(x2) = 7,.. ,f0(xm 1) = ,f0(xm) = the domain of anyf( ) Hhas onlymelements, we can always representf( )compactly as anm-dimensional vector,~f= f(x1)f(x2) f(xm) T. In order to specifya probability distribution over functionsf( ) H, we must associate some probabilitydensity with each function inH. One natural way to do this is to exploit the one-to-onecorrespondence between functionsf( ) Hand their vector representations,~f. In particular,if we specify that~f N(~ , 2I), then this in turn implies a probability distribution overfunctionsf( ), whose probability density function is given byp(h) =mYi=11 2 exp 12 2(f(xi) i)2.