Transcription of Gaussian Processes for Machine Learning
1 C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning , the MIT Press, 2006,ISBN 2006 Massachusetts Institute of 2 RegressionSupervised Learning can be divided into regression and classification the outputs for classification are discrete class labels, regression isconcerned with the prediction of continuous quantities. For example, in a fi-nancial application, one may attempt to predict the price of a commodity asa function of interest rates, currency exchange rates, availability and this chapter we describe Gaussian process methods for regression problems.
2 Classification problems are discussed in chapter are several ways to interpret Gaussian process (GP) regression can think of a Gaussian process as defining a distribution over functions,and inference taking place directly in the space of functions, thefunction-spacetwo equivalent viewsview. Although this view is appealing it may initially be difficult to grasp,so we start our exposition in section with the equivalentweight-space viewwhich may be more familiar and accessible to many, and continue in with the function-space view.
3 Gaussian Processes often have characteristicsthat can be changed by setting certain parameters and in section we discusshow the properties change as these parameters are varied. The predictionsfrom a GP model take the form of a full predictive distribution; in section discuss how to combine a loss function with the predictive distributionsusing decision theory to make point predictions in an optimal way. A practicalcomparative example involving the Learning of the inverse dynamics of a robotarm is presented in section We give some theoretical analysis of Gaussianprocess regression in section , and discuss how to incorporate explicit basisfunctions into the models in section As much of the material in this chaptercan be considered fairly standard.
4 We postpone most references to the historicaloverview in section Weight-space ViewThe simple linear regression model where the output is a linear combination ofthe inputs has been studied and used extensively. Its main virtues are simplic-C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning , the MIT Press, 2006,ISBN 2006 Massachusetts Institute of of implementation and interpretability. Its main drawback is that it onlyallows a limited flexibility; if the relationship between input and output can-not reasonably be approximated by a linear function, the model will give this section we first discuss the Bayesian treatment of the linear then make a simple enhancement to this class of models by projecting theinputs into a high-dimensionalfeature spaceand applying the linear modelthere.
5 We show that in some feature spaces one can apply the kernel trick tocarry out computations implicitly in the high dimensional space; this last stepleads to computational savings when the dimensionality of the feature space islarge compared to the number of data have a training setDofnobservations,D={(xi,yi)|i= 1,..,n},training setwherexdenotes an input vector (covariates) of dimensionDandydenotesa scalar output or target (dependent variable); the column vector inputs forallncases are aggregated in theD ndesign matrix1X, and the targetsdesign matrixare collected in the vectory, so we can writeD= (X,y).
6 In the regressionsetting the targets are real values. We are interested in making inferences aboutthe relationship between inputs and targets, the conditional distribution ofthe targets given the inputs (but we are not interested in modelling the inputdistribution itself). The Standard Linear ModelWe will review the Bayesian analysis of the standard linear regression modelwith Gaussian noisef(x) =x>w,y=f(x) + ,( )wherexis the input vector,wis a vector of weights (parameters) of the linearmodel,fis the function value andyis the observed target value.
7 Often a biasbias, offsetweight or offset is included, but as this can be implemented by augmenting theinput vectorxwith an additional element whose value is always one, we do notexplicitly include it in our notation. We have assumed that the observed valuesydiffer from the function valuesf(x) by additive noise, and we will furtherassume that this noise follows an independent, identically distributed Gaussiandistribution with zero mean and variance 2n N(0, 2n).( )This noise assumption together with the model directly gives rise to thelikeli-likelihoodhood, the probability density of the observations given the parameters, which is1In statistics texts the design matrix is usually taken to be the transpose of our definition,but our choice is deliberate and has the advantage that a data point is a standard (column) E.
8 Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning , the MIT Press, 2006,ISBN 2006 Massachusetts Institute of Weight-space View9factored over cases in the training set (because of the independence assumption)to givep(y|X,w) =n i=1p(yi|xi,w) =n i=11 2 nexp( (yi x>iw)22 2n)=1(2 2n)n/2exp( 12 2n|y X>w|2)=N(X>w, 2nI),( )where|z|denotes the Euclidean length of vectorz. In the Bayesian formalismwe need to specify apriorover the parameters, expressing our beliefs about thepriorparameters before we look at the observations.
9 We put a zero mean Gaussianprior with covariance matrix pon the weightsw N(0, p).( )The r ole and properties of this prior will be discussed in section ; for nowwe will continue the derivation with the prior as in the Bayesian linear model is based on the posterior distributionposteriorover the weights, computed by Bayes rule, (see eq. ( ))2posterior =likelihood priormarginal likelihood, p(w|y,X) =p(y|X,w)p(w)p(y|X),( )where the normalizing constant, also known as the marginal likelihood (see pagemarginal likelihood19), is independent of the weights and given byp(y|X) = p(y|X,w)p(w)dw.
10 ( )The posterior in eq. ( ) combines the likelihood and the prior, and captureseverything we know about the parameters. Writing only the terms from thelikelihood and prior which depend on the weights, and completing the square we obtainp(w|X,y) exp( 12 2n(y X>w)>(y X>w))exp( 12w> 1pw) exp( 12(w w)>(1 2nXX>+ 1p)(w w)),( )where w= 2n( 2nXX>+ 1p) 1Xy, and we recognize the form of theposterior distribution as Gaussian with mean wand covariance matrixA 1p(w|X,y) N( w=1 2nA 1Xy, A 1),( )whereA= 2nXX>+ 1p.