Example: quiz answers

Tutorial on Estimation and Multivariate Gaussians

Tutorial on Estimation and MultivariateGaussiansSTAT 27725/CMSC 25400: Machine LearningShubhendu Trivedi - Technological InstituteOctober 2015 Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Things we will look at today Maximum Likelihood Estimation ML for Bernoulli Random Variables Maximizing a Multinomial Likelihood: LagrangeMultipliers Multivariate Gaussians Properties of Multivariate Gaussians Maximum Likelihood for Multivariate Gaussians (Time permitting) Mixture ModelsTutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 The Principle of Maximum LikelihoodSuppose we haveNdata pointsX={x1,x2.}

Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400. The Principle of Maximum Likelihood We want to pick MLi.e. the best value of that explains the ... Cookbook, "turn the crank" method "Optimal" for large data sizes Disadvantages of ML Estimation Not optimal for small sample sizes Can be computationally challenging ...

Tags:

  Methods, Estimation

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Tutorial on Estimation and Multivariate Gaussians

1 Tutorial on Estimation and MultivariateGaussiansSTAT 27725/CMSC 25400: Machine LearningShubhendu Trivedi - Technological InstituteOctober 2015 Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Things we will look at today Maximum Likelihood Estimation ML for Bernoulli Random Variables Maximizing a Multinomial Likelihood: LagrangeMultipliers Multivariate Gaussians Properties of Multivariate Gaussians Maximum Likelihood for Multivariate Gaussians (Time permitting) Mixture ModelsTutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 The Principle of Maximum LikelihoodSuppose we haveNdata pointsX={x1,x2.}

2 ,xN}(or{(x1,y1),(x2,y2),..,(xN,yN)})Supp ose we know the probability distribution function thatdescribes the datap(x; )(orp(y|x; ))Suppose we want to determine the parameter(s) Pick so as toexplainyour data bestWhat does this mean?Suppose we had two parameter values (or vectors) 1and suppose you were topretendthat 1was really the truevalue parameterizingp. What would be the probability thatyou would get the dataset that you have? Call thisP1 IfP1is very small, it means that such a dataset is veryunlikely to occur, thus perhaps 1was not a good guessTutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 The Principle of Maximum LikelihoodWe want to pick the best value of that explains thedata youhaveThe plausibility of given data is measured by the likelihoodfunction p(x; )Maximum Likelihood principle thus suggests we pick thatmaximizes the likelihood functionThe procedure: Write the log likelihood function:logp(x; )(we ll seelater why log) Want to maximize - So differentiatelogp(x.)

3 And set to zero Solve for that satisfies the equation. This is MLTutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 The Principle of Maximum LikelihoodAs an aside: Sometimes we have an initial guess for BEFORE seeing the dataWe then use the data torefineour guess of using BayesTheoremThis is called MAP (Maximum a posteriori) Estimation (we llsee an example)Advantages of ML Estimation : Cookbook, turn the crank method Optimal for large data sizesDisadvantages of ML Estimation Not optimal for small sample sizes Can be computationally challenging (numerical methods ) Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400A Gentle Introduction: Coin TossingTutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Problem: estimating bias in coin tossA single coin toss sequence ofncoin tosses produces a sequence of values.

4 N= 4T,H,T,HH,H,T,TT,T,T,HA probabilistic model allows us to model the uncertainlyinherent in the process (randomness in tossing a coin), as wellas our uncertainty about the properties of the source (fairnessof the coin). Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Probabilistic modelFirst, for convenience, convertH 1,T 0. We have arandom variableXtaking values in{0,1}Bernoulli distribution with parameter :Pr(X= 1; ) = .We will write for simplicityp(x)orp(x; )instead ofPr(X=x; )The parameter [0,1]specifies the bias of the coin Coin is fair if =12 Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Reminder: probability distributionsDiscrete random variableXtaking values in setX={x1,x2.}

5 }Probability mass functionp:X [0,1]satisfies the law oftotal probability: x Xp(X=x) = 1 Hence, for Bernoulli distribution we knowp(0) = 1 p(1; ) = 1 . Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Sequence probabilityNow consider two tosses of the same coin, X1,X2 We can consider a number of probability distributions:Joint distributionp(X1,X2)Conditional distributionsp(X1|X2),p(X2|X1),Marginal distributionsp(X1),p(X2)We already know the marginal distributions:p(X1= 1; ) p(X2= 1; ) = What about the conditional? Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Sequence probability (contd)We will assume the sequence is -independentlyidentically , by definition, meansp(X1|X2) =p(X1), p(X2|X1) =p(X2) , the conditional is the same as marginal - knowing thatX2wasHdoes not tell us anything , we can compute the joint distribution, using chain ruleof probability.

6 P(X1,X2) =p(X1)p(X2|X1) =p(X1)p(X2) Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Sequence probability (contd)p(X1,X2) =p(X1)p(X2|X1) =p(X1)p(X2)More generally, for sequence ofntosses,p(x1,..,xn; ) =n i=1p(xi; ).Example: =13. Then,p(H,T,H; ) =p(H; )2p(T; ) =(13)2 23= : the order of outcomes does not matter, only thenumber ofHs on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 The parameter Estimation problemGiven a sequence ofncoin tossesx1,..,xn {0,1}n, wewant to estimate the bias .Consider two coins, each tossed 6 times:coin 1H,H,T,H,H,Hcoin 2T,H,T,T,H,HWhat do you believe about 1vs.

7 2?Need to convert this intuition into a precise procedureTutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Maximum Likelihood estimatorWe have consideredp(x; )as a function ofx, parametrizedby .We can also view it as a function of . This is called for estimator: choose a value of that maximizes thelikelihood given the observed on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400ML for BernoulliLikelihood of an sequenceX= [x1,..,xn]:L( ) =p(X; ) =n i=1p(xi; )=n i=1 xi(1 )1 xilog-likelihood:l( ) = logp(X; ) =n i=1[xilog + (1 xi) log(1 )]Due to monotonicity oflog, we haveargmax p(X; ) = argmax logp(X; )We will usually work with log-likelihood (why?)

8 Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400ML for Bernoulli (contd)ML estimate is ML= argmax { ni=1[xilog + (1 xi) log(1 )]}To find it, set the derivative to zero: logp(X; ) =1 n i=1xi 11 n j=1(1 xj) = 01 = nj=1(1 xj) ni=1xi ML=1nn i=1xiML estimate is simply the fraction of times thatHcame on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Are we done? ML=1nn i=1xiExample:H,T,H,T ML=12 How about:H H H H? ML= 1 Does this make sense?Suppose we record a very large number of 4-toss sequencesfor a coin with true = can expect to seeH,H,H,Habout 1/16 of all sequences!

9 A more extreme case: consider a single toss. MLwill be either 0 or on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Bayes ruleTo proceed, we will need to use Bayes ruleWe can write the joint probability of two RV in two ways,using chain rule:p(X,Y) =p(X)p(Y|X) =p(Y)p(X|Y).From here we get theBayes rule:p(X|Y) =p(X)p(Y|X)p(Y) Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Bayes rule and estimationNow consider to be a RV. We havep( |X) =p(X| )p( )p(X)Bayes rule convertspriorprobabilityp( )(our belief about prior to seeing any data) toposteriorp( |X), using thelikelihoodp(X| ).

10 Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 MAP estimationp( |X) =p(X| )p( )p(X)Themaximum a-posteriori(MAP) estimate is defined as MAP= argmax p( |X)Note:p(X)does not depend on , so if we only care aboutfinding the MAP estimate, we can writep( |X) p(X| )p( )What sp( )? Tutorial on Estimation and Multivariate GaussiansSTAT 27725/CMSC 25400 Choice of priorBayesian approach: try to reflect ourbeliefabout Utilitarian approach: choose a prior which is computationallyconvenient Later in class:regularization- choose a prior that leadsto better prediction performanceOne possibility: uniformp( ) 1for all [0,1].


Related search queries