Example: tourism industry

Bayesian Modelling

Bayesian ModellingZoubin GhahramaniDepartment of EngineeringUniversity of Cambridge, 2012La PalmaAn Information Revolution? We are in an era of abundant data: Society:the web, social networks, mobile networks,government, digital archives Science:large-scale scientific experiments, biomedicaldata, climate data, scientific literature Business:e-commerce, electronic trading, advertising,personalisation We need tools for Modelling , searching, visualising, andunderstanding large data ToolsOur Modelling tools should: Faithfully representuncertaintyin our model structureand parameters andnoisein our data Be automated andadaptive Exhibitrobustness Scale wellto large data setsProbabilistic Modelling A model describes data that one could observe from a system If we use the mathematics of probability theory to express allforms of uncertainty and noise associated with our.

Modeling vs toolbox views of Machine Learning Machine Learning seeks to learn models of data: de ne a space of possible ... likelihood of P( ) prior probability of P( jD) posterior of given D Prediction: P(xjD;m) = Z ... The posterior for N data points is also conjugate (by de nition), with hyperparameters + Nand + P ns(x

Tags:

  Posterior, Bayesian, Likelihood

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Bayesian Modelling

1 Bayesian ModellingZoubin GhahramaniDepartment of EngineeringUniversity of Cambridge, 2012La PalmaAn Information Revolution? We are in an era of abundant data: Society:the web, social networks, mobile networks,government, digital archives Science:large-scale scientific experiments, biomedicaldata, climate data, scientific literature Business:e-commerce, electronic trading, advertising,personalisation We need tools for Modelling , searching, visualising, andunderstanding large data ToolsOur Modelling tools should: Faithfully representuncertaintyin our model structureand parameters andnoisein our data Be automated andadaptive Exhibitrobustness Scale wellto large data setsProbabilistic Modelling A model describes data that one could observe from a system If we use the mathematics of probability theory to express allforms of uncertainty and noise associated with our.

2 Theninverse probability( Bayes rule) allows us to inferunknown quantities, adapt our models, make predictions andlearn from RuleP(hypothesis|data) =P(data|hypothesis)P(hypothesis)P(data)R ev d Thomas Bayes (1702 1761) Bayes rule tells us how to do inference about hypotheses from data. Learning and prediction can be seen as forms of vs toolbox views of Machine Learning Machine Learning seeks to learn models of data: define a space of possiblemodels; learn the parameters and structure of the models from data; makepredictions and decisions Machine Learning is a toolbox of methods for processing data: feed the datainto one of many possible methods; choose methods that have good theoreticalor empirical performance.

3 Make predictions and decisionsPlan Introduce Foundations The Intractability Problem Approximation Tools Advanced Topics Limitations and DiscussionDetailed Plan [Some parts will be skipped] Introduce Foundations Some canonical problems: classification,regression, density estimation Representing beliefs and the Cox axioms The Dutch Book Theorem Asymptotic Certainty and Consensus Occam s Razor and Marginal Likelihoods Choosing Priors Objective Priors:Noninformative, Jeffreys, Reference Subjective Priors Hierarchical Priors Empirical Priors Conjugate Priors The Intractability Problem Approximation Tools Laplace s Approximation Bayesian Information Criterion (BIC) Variational Approximations Expectation Propagation MCMC Exact Sampling Advanced Topics Feature Selection and ARD Bayesian Discriminative Learning (BPM vs SVM)

4 From Parametric to Nonparametric Methods Gaussian Processes Dirichlet Process Mixtures Limitations and Discussion Reconciling Bayesian and Frequentist Views Limitations and Criticisms of Bayesian Methods DiscussionSome Canonical Machine Learning Problems Linear Classification Polynomial Regression Clustering with Gaussian Mixtures (Density Estimation)Linear ClassificationData:D={(x(n),y(n))}forn= 1,..,Ndata pointsx(n) <Dy(n) {+1, 1}xoxxxxxxooooooxxxxoModel:P(y(n)= +1| ,x(n)) = 1 ifD d=1 dx(n)d+ 0 00 otherwiseParameters: <D+1 Goal:To infer from the data and to predict future labelsP(y|D,x)Polynomial RegressionData:D={(x(n),y(n))}forn= 1,..,Nx(n) <y(n) <0246810 20 10010203040506070 Model:y(n)=a0+a1x(n)+a2x(n) +amx(n)m+ where N(0, 2)Parameters: = (a0.)

5 ,am, )Goal:To infer from the data and to predict future outputsP(y|D,x,m)Clustering with Gaussian Mixtures(Density Estimation)Data:D={x(n)}forn= 1,..,Nx(n) <DModel:x(n) m i=1 ipi(x(n))wherepi(x(n)) =N( (i), (i))Parameters: =(( (1), (1))..,( (m), (m)), )Goal:To infer from the data, predict the densityp(x|D,m), and infer whichpoints belong to the same Machine LearningEverything follows from two simple rules:Sum rule:P(x) = yP(x,y)Product rule:P(x,y) =P(x)P(y|x)P( |D) =P(D| )P( )P(D)P(D| ) likelihood of P( )prior probability of P( |D) posterior of givenDPrediction:P(x|D,m) = P(x| ,D,m)P( |D,m)d Model Comparison:P(m|D) =P(D|m)P(m)P(D)P(D|m) = P(D| ,m)P( |m)d That s it!Questions Why be Bayesian ?

6 Where does the prior come from? How do we do these integrals?Representing Beliefs (Artificial Intelligence)Consider a robot. In order to behave intelligentlythe robot should be able to represent beliefs aboutpropositions in the world: my charging station is at location (x,y,z) my rangefinder is malfunctioning that stormtrooper is hostile We want to represent thestrengthof these beliefs numerically in the brain of therobot, and we want to know what mathematical rules we should use to manipulatethose Beliefs IILet s useb(x)to represent the strength of belief in (plausibility of) b(x) 1b(x) = 0xis definitelynot trueb(x) = 1xis definitelytrueb(x|y)strength of belief thatxis true given that we knowyis trueCox Axioms (Desiderata).

7 Strengths of belief (degrees of plausibility) are represented by real numbers Qualitative correspondence with common sense Consistency If a conclusion can be reasoned in several ways, then each way should lead to the same answer. The robot must always take into account all relevant evidence. Equivalent states of knowledge are represented by equivalent plausibility :Belief functions ( (x),b(x|y),b(x,y)) must satisfy the rules ofprobability theory, including sum rule, product rule and therefore Bayes rule.(Cox 1946; Jaynes, 1996; van Horn, 2003)The Dutch Book TheoremAssume you are willing to accept bets with odds proportional to the strength of yourbeliefs. That is,b(x) = that you will accept a bet:{xis truewin $1xis falselose$9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule,there exists a set of simultaneous bets (called a Dutch Book ) which you arewilling to accept, and for whichyou are guaranteed to lose money, no matterwhat the only way to guard against Dutch Books to to ensure that your beliefs arecoherent.}

8 Satisfy the rules of CertaintyAssume that data setDn, consisting ofndata points, was generated from sometrue , then under some regularity conditions, as long asp( )>0limn p( |Dn) = ( )In theunrealizable case, where data was generated from somep (x)which cannotbe modelled by any , then the posterior will converge tolimn p( |Dn) = ( )where minimizesKL(p (x),p(x| )): = argmin p (x) logp (x)p(x| )dx= argmax p (x) logp(x| )dxWarning: careful with the regularity conditions, these are just sketches of the theoretical resultsAsymptotic ConsensusConsider two Bayesians withdifferent priors,p1( )andp2( ),who observe thesame both Bayesians agree on the set of possible and impossible values of :{ :p1( )>0}={ :p2( )>0}Then, in the limit ofn , the posteriors,p1( |Dn)andp2( |Dn)will converge(in uniform distance between distibutions (P1,P2) = supE|P1(E) P2(E)|)coin toss demo.

9 BayescoinModel Selection0510 2002040M = 00510 2002040M = 10510 2002040M = 20510 2002040M = 30510 2002040M = 40510 2002040M = 50510 2002040M = 60510 2002040M = 7 Bayesian Occam s Razor and Model SelectionCompare model classes, , using posterior probabilities givenD:p(m|D) =p(D|m)p(m)p(D), p(D|m) = p(D| ,m)p( |m)d Interpretations of the Marginal likelihood ( model evidence ): The probability thatrandomly selectedparameters from the prior would generateD. Probability of the data under the model,averagingover all possible parameter values. log2(1p(D|m))is the number ofbits of surpriseat observing dataDunder classes that are too simple are unlikelyto generate the data classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at simpletoo complex"just right"All possible data sets of size nP(D|m)DBayesian Model Selection:Occam s Razor at Work0510 2002040M = 00510 2002040M = 10510 2002040M = 20510 2002040M = 30510 2002040M = 40510 2002040M = 50510 2002040M = 60510 2002040M = (Y|M)Model EvidenceFor example, for quadratic polynomials (m= 2):y=a0+a1x+a2x2+ , where N(0, 2)and parameters = (a0a1a2 )demo: polybayesdemo.

10 RunsimpleOn Choosing Priors Objective Priors: noninformative priors that attempt to capture ignorance andhave good frequentist properties. Subjective Priors: priors should capture our beliefs as well as possible. Theyare subjective but not arbitrary. Hierarchical Priors: multiple levels of priors:p( ) = d p( | )p( )= d p( | ) d p( | )p( ) Empirical Priors: learn some of the parameters of the prior from the data( Empirical Bayes )Subjective PriorsPriors should capture our beliefs as well as we are not do we know our beliefs? Think about the problems domain (no black box view of machine learning) Generate data from the prior. Does it match expectations?


Related search queries