The EM Algorithm

The EM AlgorithmAjit SinghNovember 20, 20051 IntroductionExpectation-Maximization (EM) is a technique used in point estimation. Given a set of observablevariablesXand unknown (latent) variablesZwe want to estimate parameters in a (Binomial Mixture Model).You have two coins with unknown probabilities ofheads, denotedpandqrespectively. The first coin is chosen with probability and the secondcoin is chosen with probability 1 . The chosen coin is flipped once and the result is {1,1,0,1,0,0,1,0,0,0,1,1}(Heads = 1, Tails = 0). LetZi {0,1}denote which coin wasused on each example we added latent variablesZifor reasons that will become apparent. The parameterswe want to estimate are = (p,q, ). Two criteria for point estimation are maximum likelihoodand maximum a posteriori: ML= arg max logp(x| ) MAP= arg max logp(x, )= arg max [logp(x| ) + logp( )]Our presentation will focus on the maximum likelihood case (ML-EM); the maximum a posterioricase (MAP-EM) is very NotationXObserved variablesZLatent (unobserved) variables (t)The estimate of the parameters at iterationt.

`( )The marginal log-likelihood logp(x| )logp(x,z| ) The complete log-likelihood, , when we know the value (z|x, ) Averaging distribution, a free distribution that EM gets to ( | (t)) The expected complete log-likelihood zq(z|x, ) logp(x,z| )H(q)Entropy of the distributionq(z|x, ).1In MAP-EM the M-step is a MAP estimate, instead of an ML DerivationWe could directly maximize`( ) = zlogp(x,z| ) using a gradient method ( , gradient as-cent, conjugate gradient, quasi-Newton) but sometimes the gradient is hard to compute, hard toimplement, or we do not want to bother adding in a black-box optimization the following inequality`( ) = logp(x| ) = log zp(x,z| )(1)= log zq(z|x, )p(x,z| )q(z|x. )(2) zq(z|x, ) logp(x,z| )q(z|x, ) F(q, )(3)whereq(z|x, ) is an arbitrary density overZ.

This inequality is foundational to what are called variational methods in the machine learning literature2. Instead of maximizing`( ) directly, EMmaximizes the lower-boundF(q, ) via coordinate ascent:E-step:q(t+1)= arg maxqF(q, (t))(4)M-step: (t+1)= arg max F(q(t+1), )(5)Starting with some initial value of the parameters (0), one cycles between the E and M-stepsuntil (t)converges to a local maxima . Computing equation 4 directly involves fixing = (t)andoptimizating over the space of distributions, which looks painful. However, it is possible to showthatq(t+1)=p(z|x, (t)). We can stop worrying aboutqas a variable over the space of distributions,since we know the optimalqis a distribution that depends on (t). To compute equation 5 we fixqand note that`( ) F(q, )(6)= zq(z|x, ) logp(x,z| )q(z|x, )(7)= zq(z|x, ) logp(x,z| ) zq(z|x, ) logq(z|x, )(8)=Q( | (t)) +H(q)(9)so maximizingF(q, ) is equivalent to maximizing the expected complete log-likelihood.

Obscuringthese details, which explain what EM is doing, we can re-express equations 4 and 5 asE-step: ComputeQ( | (t)) =Ep(z|x, (t))[logp(x,z| )](10)M-step: (t+1)= arg max Ep(z|x, (t))[logp(x,z| )](11)2If you feel compelled to tart it up, you can call equation 3 Gibbs inequality andF(q, ) the negative variationalfree Limitations of EMEM is useful for several reasons: conceptual simplicity, ease of implementation, and the fact thateach iteration improves`( ). The rate of convergence on the first few steps is typically quite good,but can become excruciatingly slow as you approach a local optima . Generally, EM works bestwhen the fraction of missing information is small3and the dimensionality of the data is not toolarge. EM can require many iterations, and higher dimensionality can dramatically slow down Using the EM algorithmApplying EM to example we start by writing down the expected complete log-likelihoodQ( | (t)) =E[logn i=1[ pxi(1 p)1 xi]zi[(1 )qxi(1 q)1 xi]1 zi]=n i=1E[zi|xi, (t)][log +xilogp+ (1 xi) log (1 p)]+ (1 E[zi|xi, (t)])[log (1 ) +xilogq+ (1 xi) log (1 q)]Next we computeE[zi|xi, (t)] (t)i=E[zi|xi, (t)] =p(zi= 1|xi, (t))=p(xi|zi, (t))p(zi= 1| (t))p(xi| (t))= [p(t)]xi[(1 p(t))]1 xi (t)[p(t)]xi[(1 p(t)]1 xi+ (1 (t))[q(t)]xi[(1 q(t))]1 xiMaximizingQ( | (t)) yields the update equations Q( | (t)) = 0 = (t+1)=1n i (t)i Q( | (t)) p= 0 = p(t+1)= i (t)ixi i (t)i Q( | (t)) q= 0 = q(t+1)= i(1 (t)i)xi i(1 (t)i))

Constrained OptimizationSometimes the M-step is a constrained maximization, which means that there are constraints onvalid solutions not encoded in the function itself. An example of a constrained optimization is to3 The statement fraction of missing information is small can be quantified using Fisher (p1,p2,..,pn) = n i=1pilog2pi(12)such thatn i=1pi= 1(13)Such problems can be solved using the method of Lagrange multipliers. To maximize a functionf(p1,..,pn) on the open setp= (p1,..,pn) Rnsubject to the constraintg(p) = 0 it sufficesto maximize the unconstrained function (p, ) =f(p) g(p)To solve equation 12 we encode the constraint asg(p) = ipi 1 and maximize (p, ) = n i=1pilog2pi (n i=1pi 1)in the unusual unconstrained manner, by solving the system of equations (p, ) pi= 0, (p, ) = 0which leads to the solutionpi= : The idea of EM as coordinate ascent was first presented in A View ofthe EM Algorithm that Justifies Incremental, Sparse, and other Variants , by Neal Hinton.

This presentation is also indebited to an unpublished manuscript by Jordan andC.

The EM Algorithm

Tags:

Information

Transcription of The EM Algorithm

Related search queries

The EM Algorithm

Tags:

Information

Documents from same domain

Related documents

Related search queries