Example: stock market

Mixture Models - Carnegie Mellon University

Chapter 20 Mixture Two Routes to Mixture From Factor Analysis to Mixture ModelsIn factor analysis, the origin myth is that we have a fairly small number,qof realvariables which happen to be unobserved ( latent ), and the much larger numberpof variables we do observe arise as linear combinations of these factors, plus mythology is that it s possible for us (or for Someone) tocontinuouslyadjust thelatent variables, and the distribution of observables likewise changes if the latent variables are not continuous but ordinal, or even categorical? Thenatural idea would be that each value of the latent variable would give a differentdistribution of the From Kernel Density Estimates to Mixture ModelsWe have also previously looked at kernel density estimation, where we approximatethe true distribution by sticking a small (1nweight) co

20.2 Estimating Parametric Mixture Models From intro stats., we remember that it’s generally a good idea to estimate distribu-tions using maximum likelihood, when we can. How could we do that here? Remember that the likelihood is the probability (or probability density) of ob-serving our data, as a function of the parameters.

Tags:

  Model, Parametric

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Mixture Models - Carnegie Mellon University

1 Chapter 20 Mixture Two Routes to Mixture From Factor Analysis to Mixture ModelsIn factor analysis, the origin myth is that we have a fairly small number,qof realvariables which happen to be unobserved ( latent ), and the much larger numberpof variables we do observe arise as linear combinations of these factors, plus mythology is that it s possible for us (or for Someone) tocontinuouslyadjust thelatent variables, and the distribution of observables likewise changes if the latent variables are not continuous but ordinal, or even categorical? Thenatural idea would be that each value of the latent variable would give a differentdistribution of the From Kernel Density Estimates to Mixture ModelsWe have also previously looked at kernel density estimation, where we approximatethe true distribution by sticking a small (1nweight) copy of a kernel pdf at each ob-served data point and adding them up.

2 With enough data, this comes arbitrarilyclose to any (reasonable) probability density, but it does have some drawbacks. Sta-tistically, it labors under the curse of dimensionality. Computationally, we have torememberallof the data points, which is a lot. We saw similar problems when welooked at fully non- parametric regression, and then saw that both could be amelio-rated by using things like additive Models , which impose more constraints than, say,unrestricted kernel smoothing. Can we do something like that with density estima-tion?Additive modeling for densities is not as common as it is for regression it sharder to think of times when it would be natural and well-defined1 but we can1 Remember that the integral of a probability density over all space must be 1, while the integral of a re-gression function doesn t have to be anything in particular.

3 If we had an additive density,f(x)= jfj(xj),ensuring normalization is going to be very tricky; we d need j fj(xj)dx1dx2dxp=1. It would beeasier to ensure normalization while making thelog-density additive, but that assumes the features TWO ROUTES TO Mixture MODELS391do things to restrict density estimation. For instance, instead of putting a copy ofthe kernel ateverypoint, we might pick a small numberK nof points, which wefeel are somehow typical or representative of the data, and put a copy of the kernel ateach one (with weight1K). This uses less memory, but it ignores the other data points,and lots of them are probably very similar to those points we re taking as differences between prototypes and many of their neighbors are just matters ofchance or noise.

4 Rather than remembering all of those noisy details, why not collapsethose data points, and just remember their common distribution? Different regionsof the data space will have different shared distributions, but we can just Mixture ModelsMore formally, we say that a distributionfis amixtureofKcomponentdistribu-tionsf1,f2, ..fKiff(x)=K k=1 kfk(x)( )with the kbeing themixing weights, k>0, k k=1. Eq. is a completestochastic model , so it gives us a recipe for generating new data points: first pick adistribution, with probabilities given by the mixing weights, and then generate oneobservation according to that distribution.

5 Symbolically,Z Mult( 1, 2,.. K)( )X|Z fZ( )where I ve introduced the discrete random variableZwhich says which componentXis drawn haven t said what kind of distribution thefks are. In principle, we could makethese completely arbitrary, and we d still have a perfectly good Mixture model . Inpractice, a lot of effort is given over toparametric mixturemodels, where thefkare all from the same parametric family, but with different parameters for instancethey might all be Gaussians with different centers and variances, or all Poisson dis-tributions with different means, or all power laws with different exponents.

6 (It s notstrictly necessary that they all be of the same kind.) We ll write the parameter, orparameter vector, of thekthcomponent as k, so the model becomesf(x)=K k=1 kf(x; k)( )The over-all parameter vector of the Mixture model is thus =( 1, 2,.. K, 1, 2,.. K).Let s consider two extremes. WhenK=1, we have a simple parametric distribu-tion, of the usual sort, and density estimation reduces to estimating the parameters,by maximum likelihood or whatever else we feel like. On the other hand whenindependent of each 20. Mixture MODELSK=n, the number of observations, we have gone back towards kernel density es-timation.

7 IfKis fixed asngrows, we still have a parametric model , and avoid thecurse of dimensionality, but a Mixture of (say) ten Gaussians is more flexible than asingle Gaussian thought it may still be the case that the true distribution just can tbe written as a ten-Gaussian Mixture . So we have our usual bias-variance or accuracy-precision trade-off using many components in the Mixture lets us fit many distri-butions very accurately, with low approximation error or bias, but means we havemore parameters and so we can t fit any one of them as precisely, and there s morevariance in our GeometryIn Chapter 18, we looked at principal components analysis, which finds linear struc-tures withqspace (lines, planes, hyper-planes.)

8 Which are good approximationsto ourp-dimensional data,q p. In Chapter 19, we looked at factor analysis,where which imposes a statistical model for the distribution of the data around thisq-dimensional plane (Gaussian noise), and a statistical model of the distribution ofrepresentative points on the plane (also Gaussian). This set-up is implied by themythology of linear continuous latent variables, but can arise in other , we know from geometry that it takesq+1 points to define aq-dimensionalplane, and that in general anyq+1 points on the plane will do. This means that ifwe use a Mixture model withq+1 components, we will also get data which clustersaround aq-dimensional plane.

9 Furthermore, by adjusting the mean of each compo-nent, and their relative weights, we can make the global mean of the Mixture what-ever we like. And we can even match the covariance matrix of anyq-factor model byusing a Mixture withq+1 components2. Now, this Mixture distribution will hardlyever be exactly the same as the factor model s distribution mixtures of Gaussiansaren t Gaussian, the Mixture will usually (but not always) be multimodal while thefactor distribution is always unimodal but it will have the same geometry, thesame mean and the same covariances, so we will have to look beyond those to tellthem apart.

10 Which, frankly, people hardly ever IdentifiabilityBefore we set about trying to estimate our probability Models , we need to make surethat they are identifiable that if we have distinct representations of the model , theymake distinct observational claims. It is easy to let there be too many parameters, orthe wrong choice of parameters, and lose identifiability. If therearedistinct repre-sentations which are observationally equivalent, we either need to change our model ,change our representation, or fix on auniquerepresentation by some convention. With additive regression,E[Y|X=x]= + jfj(xj), we can add arbitraryconstants so long as they cancel out.


Related search queries