Example: biology

Deep Gaussian Processes

Deep Gaussian ProcessesAndreas C. DamianouNeil D. LawrenceDept. of Computer Science & Sheffield Institute for Translational Neuroscience,University of Sheffield, UKAbstractIn this paper we introduce deep Gaussian process (GP) models. Deep GPs are a deep belief net-work based on Gaussian process mappings. Thedata is modeled as the output of a multivariateGP. The inputs to that Gaussian process are thengoverned by another GP. A single layer model isequivalent to a standard GP or the GP latent vari-able model (GP-LVM). We perform inference inthe model by approximate variational marginal-ization. This results in a strict lower bound on themarginal likelihood of the model which we usefor model selection (number of layers and nodesper layer).

representational power of a Gaussian process in the same role is significantly greater than that of an RBM. For the GP the corresponding likelihood is over a continuous vari-able, but it is a nonlinear function of the inputs, p(yjx) = N yjf(x);˙2; where N j ;˙2 is a Gaussian density with mean and variance ˙2. In this case the likelihood is ...

Tags:

  Process, Gaussian, Gaussian process

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Deep Gaussian Processes

1 Deep Gaussian ProcessesAndreas C. DamianouNeil D. LawrenceDept. of Computer Science & Sheffield Institute for Translational Neuroscience,University of Sheffield, UKAbstractIn this paper we introduce deep Gaussian process (GP) models. Deep GPs are a deep belief net-work based on Gaussian process mappings. Thedata is modeled as the output of a multivariateGP. The inputs to that Gaussian process are thengoverned by another GP. A single layer model isequivalent to a standard GP or the GP latent vari-able model (GP-LVM). We perform inference inthe model by approximate variational marginal-ization. This results in a strict lower bound on themarginal likelihood of the model which we usefor model selection (number of layers and nodesper layer).

2 Deep belief networks are typically ap-plied to relatively large data sets using stochas-tic gradient descent for optimization. Our fullyBayesian treatment allows for the application ofdeep models even when data is scarce. Model se-lection by our variational bound shows that a fivelayer hierarchy is justified even when modellinga digit data set containing only 150 IntroductionProbabilistic modelling with neural network architecturesconstitute a well studied area of machine learning. The re-cent advances in the domain of deep learning [Hinton andOsindero, 2006, Bengio et al., 2012] have brought this kindof models again in popularity. Empirically, deep modelsseem to have structural advantages that can improve thequality of learning in complicated data sets associated withabstract information [Bengio, 2009].

3 Most deep algorithmsrequire a large amount of data to perform learning, how-ever, we know that humans are able to perform inductivereasoning (equivalent to concept generalization) with onlya few examples [Tenenbaum et al., 2006]. This provokesAppearing in Proceedings of the16thInternational Conference onArtificial Intelligence and Statistics (AISTATS) 2013, Scottsdale,AZ, USA. Volume 31 of JMLR: W&CP 31. Copyright 2013 bythe question as to whether deep structures and the learningof abstract structure can be undertaken insmallerdata smaller data sets, questions of generalization arise: todemonstrate such structures are justified it is useful to havean objective measure of the model s traditional approach to deep learning is based aroundbinary latent variables and the restricted Boltzmann ma-chine (RBM) [Hinton, 2010].

4 Deep hierarchies are con-structed by stacking these models and various approxi-mate inference techniques (such as contrastive divergence)are used for estimating model parameters. A significantamount of work has then to be done with annealed impor-tance sampling if even thelikelihood1of a data set underthe RBM model is to be estimated [Salakhutdinov and Mur-ray, 2008]. When deeper hierarchies are considered, the es-timate is only of a lower bound on the data likelihood. Fit-ting such models to smaller data sets and using Bayesianapproaches to deal with the complexity seems completelyfutile when faced with these emergence of the Boltzmann machine (BM) at the coreof one of the most interesting approaches to modern ma-chine learning is very much a case of a the field going backto the future: BMs rose to prominence in the early 1980s,but the practical implications associated with their train-ing led to their neglect until families of algorithms weredeveloped for the RBM model with its reintroduction as aproduct of experts in the late nineties [Hinton, 1999].

5 The computational intractabilities of Boltzmann machinesled to other families of methods, in particular kernel meth-ods such as the support vector machine (SVM), to be con-sidered for the domain of data classification. Almost con-temporaneously to the SVM, Gaussian process (GP) mod-els [Rasmussen and Williams, 2006] were introduced as afully probabilistic substitute for the multilayer perceptron(MLP), inspired by the observation [Neal, 1996] that, un-der certain conditions, a GPisan MLP withinfiniteunits inthe hidden layer. MLPs also relate to deep learning models:deep learning algorithms have been used to pretrain autoen-coders for dimensionality reduction [Hinton and Salakhut-1We use emphasis to clarify we are referring to the model like-lihood, not the marginal likelihood required in Bayesian Gaussian Processesdinov, 2006].

6 Traditional GP models have been extendedto more expressive variants, for example by consideringsophisticated covariance functions [Durrande et al., 2011,G onen and Alpaydin, 2011] or by embedding GPs in morecomplex probabilistic structures [Snelson et al., 2004, Wil-son et al., 2012] able to learn more powerful representa-tions of the data. However, all GP-based approaches con-sidered so far do not lead to a principled way of obtainingtruly deep architectures and, to date, the field of deep learn-ing remains mainly associated with RBM-based conditional probability of a single hidden unit in anRBM model, given its parents, is written asp(y|x) = (w>x)y(1 (w>x))(1 y),where hereyis the output variable of the RBM,xisthe set of inputs being conditioned on and (z) = (1 +exp( z)) 1.

7 The conditional density of the output de-pends only on a linear weighted sum of the inputs. Therepresentational power of a Gaussian process in the samerole is significantly greater than that of an RBM. For theGP the corresponding likelihood is over a continuous vari-able, but it is a nonlinear function of the inputs,p(y|x) =N(y|f(x), 2),whereN( | , 2)is a Gaussian density with mean andvariance 2. In this case the likelihood is dependent on amapping function,f( ), rather than a set of intermediateparameters,w. The approach in Gaussian process mod-elling is to place a prior directly over the classes of func-tions (which often specifies smooth, stationary nonlinearfunctions) and integrate them out.

8 This can be done an-alytically. In the RBM the model likelihood is estimatedand maximized with respect to the parameters,w. For theRBM marginalizingwis not analytically tractable. Wenote in passing that the two approaches can be mixed ifp(y|x) = (f(x))y(1 (f(x))(1 y),which recovers aGP classification model. Analytic integration is no longerpossible though, and a common approach to approximateinference is the expectation propagation algorithm [see and Williams, 2006]. However, we don t con-sider this idea further in this in deep models requires marginalization ofxasthey are typically treated aslatentvariables2, which in thecase of the RBM are binary variables.)

9 The number of theterms in the sum scales exponentially with the input dimen-sion rendering it intractable for anything but the smallestmodels. In practice, sampling and, in particular, the con-trastive divergence algorithm, are used for training. Simi-larly, marginalizingxin the GP is analytically intractable,even for simple prior densities like the Gaussian . In theGP-LVM [Lawrence, 2005] this problem is solved through2 They can also be treated as observed, in the upper mostlayer of the hierarchy where we might include the data with respect to the variables (instead of the pa-rameters, which are marginalized) and these models havebeen combined in stacks to form the hierarchical GP-LVM[Lawrence and Moore, 2007] which is a maximum a pos-teriori (MAP) approach for learning deep GP models.

10 Forthis MAP approach to work, however, a strong prior is re-quired on the top level of the hierarchy to ensure the algo-rithm works and MAP learning prohibits model selectionbecause no estimate of the marginal likelihood is are two main contributions in this paper. Firstly, weexploit recent advances in variational inference [Titsias andLawrence, 2010] to marginalize the latent variables in thehierarchy variationally. Damianou et al. [2011] has alreadyshown how using these approaches two Gaussian processmodels can be stacked. This paper goes further to showthat through variational approximations any number of GPmodels can be stacked to give truly deep hierarchies. Thevariational approach gives us a rigorous lower bound on themarginallikelihood of the model, allowing it to be usedfor model selection.


Related search queries