Example: air traffic controller

The EM Algorithm for Gaussian Mixtures

The EM Algorithm for Gaussian MixturesProbabilistic Learning: Theory and Algorithms, CS 274 AFinite Mixture ModelsWe are given a data setD={x1, .. , xN}wherexiis ad-dimensional vector measurement. Assume thatthe points are generated in an IID fashion from an underlying densityp(x). We further assume thatp(x)isdefined as a finite mixture model withKcomponents:p(x| ) =K k=1 kpk(x|zk, k)where: Thepk(x|zk, k)aremixture components,1 k K. Each is a density or distribution defined overp(x), with parameters k. z= (z1, .. , zK)is a vector ofKbinary indicator variables that are mutually exclusive and exhaustive( , one and only one of thezk s is equal to 1, and the others are 0).zis aK-ary random variablerepresenting the identity of the mixture component that generatedx. It is convenient for mixturemodels to representzas a vector ofKindicator variables. The k=p(zk)are the mixture weights, representing the probability that a randomly selectedxwasgenerated by componentk, where Kk=1 k= complete set of parameters for a mixture model withKcomponents is ={ 1.}

the covariance matrix of the whole data set for each of the initial K covariance matrices) or could be chosen via some heuristic method (such as by using the k-means algorithm to cluster the data first and then defining weights based on k-means memberships).

Tags:

  Matrix, Covariance, Covariance matrix

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of The EM Algorithm for Gaussian Mixtures

1 The EM Algorithm for Gaussian MixturesProbabilistic Learning: Theory and Algorithms, CS 274 AFinite Mixture ModelsWe are given a data setD={x1, .. , xN}wherexiis ad-dimensional vector measurement. Assume thatthe points are generated in an IID fashion from an underlying densityp(x). We further assume thatp(x)isdefined as a finite mixture model withKcomponents:p(x| ) =K k=1 kpk(x|zk, k)where: Thepk(x|zk, k)aremixture components,1 k K. Each is a density or distribution defined overp(x), with parameters k. z= (z1, .. , zK)is a vector ofKbinary indicator variables that are mutually exclusive and exhaustive( , one and only one of thezk s is equal to 1, and the others are 0).zis aK-ary random variablerepresenting the identity of the mixture component that generatedx. It is convenient for mixturemodels to representzas a vector ofKindicator variables. The k=p(zk)are the mixture weights, representing the probability that a randomly selectedxwasgenerated by componentk, where Kk=1 k= complete set of parameters for a mixture model withKcomponents is ={ 1.}

2 , K, 1, .. , K}Membership WeightsWe can compute the membership weight of data pointxiin clusterk, given parameters aswik=p(zik= 1|xi, ) =pk(xi|zk, k) k Km=1pm(xi|zm, m) m,1 k K,1 i on the EM Algorithm for Gaussian Mixtures : CS 274A, Probabilistic Learning2 This follows from a direct application of Bayes membership weights above reflect our uncertainty, givenxiand , about which of theKcompo-nents generated vectorxi. Note that we are assuming in our generative mixture model that eachxiwasgenerated by a single component so these probabilities reflect our uncertainty about which componentxicame from, not any mixing in the generative Mixture ModelsForx Rdwe can define a Gaussian mixture model by making each of theKcomponents a Gaussiandensity with parameters kand k. Each component is a multivariate Gaussian densitypk(x| k) =1(2 )d/2| k|1/2e 12(x k)t 1k(x k)with its own parameters k={ k, k}.

3 The EM Algorithm for Gaussian Mixture ModelsWe define the EM (Expectation-Maximization) Algorithm for Gaussian Mixtures as follows. The algorithmis an iterative Algorithm that starts from some initial estimate of ( , random), and then proceeds toiteratively update until convergence is detected. Each iteration consists of an E-step and an :Denote the current parameter values as . Computewik(using the equation above for membershipweights) for all data pointsxi,1 i Nand all mixture components1 k K. Note that for each datapointxithe membership weights are defined such that Kk=1wik= 1. This yields anN Kmatrix ofmembership weights, where each of the rows sum to :Now use the membership weights and the data to calculate new parameter values. LetNk= Ni=1wik, , the sum of the membership weights for thekth component this is the effective number ofdata points assigned to , newk=NkN,1 k are the new mixture weights.

4 Newk=(1Nk)N i=1wik xi1 k updated mean is calculated in a manner similar to how we could compute a standard empirical average,except that theith data vectorxihas a fractional weightwik. Note that this is a vector equation since newkandxiare bothd-dimensional on the EM Algorithm for Gaussian Mixtures : CS 274A, Probabilistic Learning3 newk=(1Nk)N i=1wik (xi newk)(xi newk)t1 k we get an equation that is similar in form to how we would normally compute an empirical covariancematrix, except that the contribution of each data point is weighted bywik. Note that this is a matrix equationof dimensionalityd don each equations in the M-step need to be computed in this order, , first compute theKnew s, thentheKnew k s, and finally theKnew k we have computed all of the new parameters, the M-step is complete and we can now go backand recompute the membership weights in the E-step, then recompute the parameters again in the E-step,and continue updating the parameters in this manner.

5 Each pair of E and M steps is considered to be and Convergence Issues for EMThe EM Algorithm can be started by either initializing the Algorithm with a set of initial parameters and thenconducting an E-step, or by starting with a set of initial weights and then doing a first M-step. The initialparameters or weights can be chosen randomly ( selectKrandom data points as initial means and selectthe covariance matrix of the whole data set for each of the initialKcovariance matrices) or could be chosenvia some heuristic method (such as by using the k-means Algorithm to cluster the data first and then definingweights based on k-means memberships).Convergence is generally detected by computing the value of the log-likelihood after each iteration andhalting when it appears not to be changing in a significant manner from one iteration to the next. Note thatthe log-likelihood (under the IID assumption) is defined as follows:logl( ) =N i=1logp(xi| ) =N i=1(logK k=1 kpk(xi|zk, k))wherepk(xi|zk, k)is the Gaussian density for thekth mixture component.


Related search queries