Example: barber

Lecture10: Expectation-Maximization Algorithm

ECE 645: Estimation TheorySpring 2015 Instructor: Prof. Stanley H. ChanLecture 10: Expectation-Maximization Algorithm (LaTeX prepared by Shaobo Fang)May 4, 2015 This lecture note is based on ECE 645 (Spring 2015) by Prof. StanleyH. Chan in the School of Electricaland Computer Engineering at Purdue MotivationConsider a set of data points with their classes labeled, and assume that each class is a Gaussian as shownin Figure 1(a). Given this set of data points, finding the means of twoGaussian can be done easily byestimating the sample mean, as the class labels are imagine that the classes are not labeled as shown in Figure 1(b).

Lecture10: Expectation-Maximization Algorithm (LaTeXpreparedbyShaoboFang) May4,2015 This lecture note is based on ECE 645 (Spring 2015) by Prof. Stanley H. Chan in the School of Electrical and Computer Engineering at Purdue University. 1 Motivation Consider a set of data points with their classes labeled, and assume that each class is a ...

Tags:

  Maximization, Algorithm, Expectations, Expectation maximization algorithm

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Lecture10: Expectation-Maximization Algorithm

1 ECE 645: Estimation TheorySpring 2015 Instructor: Prof. Stanley H. ChanLecture 10: Expectation-Maximization Algorithm (LaTeX prepared by Shaobo Fang)May 4, 2015 This lecture note is based on ECE 645 (Spring 2015) by Prof. StanleyH. Chan in the School of Electricaland Computer Engineering at Purdue MotivationConsider a set of data points with their classes labeled, and assume that each class is a Gaussian as shownin Figure 1(a). Given this set of data points, finding the means of twoGaussian can be done easily byestimating the sample mean, as the class labels are imagine that the classes are not labeled as shown in Figure 1(b).

2 How should we determine themean for each of the classes then? In order to solve this problem, we could use an iterative approach: firstmake a guess of the class label for each data point, then compute the means and update the guess of theclass labels again. We repeat until the means problem of estimating parameters in the absence of labels is known as unsupervised learning. Thereare many unsupervised learning methods. We will focus on the Expectation maximization (EM) Algorithm . 10123456 10 8 6 4 202468 Class Labelled Class 1 Class 2 10123456 10 8 6 4 202468 Class UnlabelledFigure 1: Estimation of parameters becomes trivial given the labelledclasses2 The , random variable;y= realization ,xcomplete ,z, missing data.

3 Note thatX= (Y,Z).4. : unknown deterministic parameter. (t):tthestimate of the in the EM (y| ) is the distribution ofYgiven . (X| ) is a random variable taking value off(X| ) (Remember:f( | ) is a function and thus we canput any argument intof( | ) and evaluate its output.) |y, [g(X)] =Rg(x)fX|y, (x|y, )dxis the conditional expectation ofg(X) givenY=yand .8. ( ) = logf(y| ) is the log-likelihood. Note that ( ) depends StepsThe EM- Algorithm consists of two : Givenyand pretending for the moment that (t)is correct, formulate the distribution forthe complete datax:f(x|y, (t)).

4 Then, we calculate the Q-function:Q( | (t))def=EX|y, (t)[logf(X| )]=Zlogf(x| )f(x|y, (t)) : MaximizeQ( | (t)) with regard to : (t+1)= argmax Q( | (t))Properties ofQ( | (t))1. Ideally, if we have the distribution of the complete datax, then finding the parameter can be doneby maximizingf(x| ). However, the complete data is only a virtual thing we created to solved theproblem. In reality we never knowx. All we know is its distributionf(x| ), which depends on whatwe know aboutx. So one way to handle this uncertainty is to compute the average.

5 This average isthe Another way of looking atQ( | (t)). We can treat logf(X| ) as a function of two variablesh(X, ).Maximizing over is problematic because it depends onX. So by taking expectationEX[h(X, )] wecan eliminate the dependency ( | (t)) can be thought of a local approximation of the log-likelihood function ( ): Here, by local we meant thatQ( | (t)) stays close to its previous estimate (t). In fact ifQ( | (t)) Q( (t)| (t)),then ( ) ( (t)).3 Estimating Mean with Partial ObservationLet us consider the first example of the EM Algorithm .

6 Suppose thatwe generated a sequence ofnrandomvariablesYi N( , 2) fori= 1, .. , n. Imagine that we have only observedY= [Y1, Y2, .. , Ym] wherem < n. How should we estimate based onY?Intuitively, the estimated should be the sample mean of themobservationsb =1mPmi=1Yi. However,in this example we would like to derive the EM Algorithm and see if the EM Algorithm would match withour :To start the EM Algorithm , we first need to specify the missing data and the complete data. Inthis problem, the missing data isZ= [Ym+1.]

7 , Yn], and the complete data isX= [Y,Z]. The distributionofXis:logf(X| ) = n2log(2 2) nXi=1(Yi )22 2.(1)2 Therefore, the Q function isQ( | (t))def=EX|Y, (t)[logf(X| )]=EX|Y, (t)" n2log(2 2) mXi=1(Yi )22 2 nXi=m+1(Yi )22 2#= n2log(2 2) mXi=1(yi )22 2 nXi=m+1EX|Y, (t)[(Yi )2]2 last expectation can be evaluated asEYi|Y, (t)[(Yi )2] =EYi|Y, (t)[Y2i 2Yi + 2]= [( (t))2+ 2 2 (t) + 2].Therefore, the Q function isQ( | (t)) = n2log(2 2) mXi=1(yi )22 2 n m2 2[( (t))2+ 2 2 (t) + 2].In the M-step, we need to maximize the Q-function.

8 To this end, we set Q( | (t)) = 0,which yields that (t+1)=Pmi=1yi+ (n m) (t) is not difficult to show that ast , (t) ( ). Hence, ( )=Pmi=1yin+ 1 mn ( ),which yields ( )=1mmXi= result says that as the EM Algorithm converges, the estimatedparameter converges to the sample meanusing the availablemsamples, which is quite Gaussian Mixture With Known Mean And VarianceOur next example of the EM Algorithm to estimate the mixture weightsof a Gaussian mixture with knownmean and variance. A Gaussian mixture is defined asf(y| ) =kXi=1 iN(y| i, 2i),(2)where = [ 1.]

9 , k] is called the mixture weight. The mixture weight satisfies the condition thatkXi=1 i= goal is to derive the EM- Algorithm for .3 Solution: We first need to define the missing data. For this problem, we observe that the observed data isY= [y1, y2, , yn]. The missing data can be defined as the label for eachyj, so thatZ= [Z1, Z2, .. , Zn],withZj {1, .. , k}. Consequently, the complete data isX= [X1, X2, , Xn], whereXj= (yj, Zj).The distribution of the complete data can be computed asf(xj| ) =f(yj, zj| ) = zjN(yj| zj, 2zj),Thus, the Q function isQ( | (t)) =EX|,Y, (t){logf(X|, )}=EZ|,y, (t){logf(Z,y|, )}=EZ|,y, (t) lognYj=1 zjN(yj|, zj, 2zj) =nXj=1 EZj|yj, (t)nlog zj+ logN(yj|, zj, 2zj) expectation can be evaluated asEZj|yj, (t){log zj}=Xzjlog zjP(Zj=zj|yj, (t))=kXi=1log iP(Zj=i|yj, (t))|{z}def= (t) summing over allj s, we can further define (t)i=nXj=1 (t)ij=nXj=1P(Zj=i|yj, (t))=nXj=1 (t)iN(yj| i, 2i)Pki=1 (t)iN(yj| i, 2i)

10 Therefore, the Q function becomesQ( | (t)) =nXj=1kXi=1log (t)ij i+C=kXi=1log (t)i i+C,for some constantCindependent of . Maximizing over yields (t+1)= argmax kXi=1 (t)ilog i= (t)iPki=1 (t)i,where the last equality is due to Gibbs inequality. To summarize the EM Algorithm is given in the : Gaussian Mixture with known mean and varianceResult: Estimated fort= 1, do (t)i=nXj=1 (t)iN(yj| i, 2i)Pki=1 (t)iN(yj| i, 2i) (t)i= (t)iPki=1 (t)iendRemark:To solve argmax Pki=1 (t)ilog i, we use the Gibbs inequality. Gibbs inequality states that forall and such thatPni=1 i= 1,Pni=1 i= 1, 0 i 1 and 0 i 1, it holds thatnXi=1 ilog i nXi=1 ilog i,(3)with the equality holds when i= ifor alli.


Related search queries