Part IV Generative Learning algorithms

CS229 Lecture notesAndrew NgPart IVGenerative Learning algorithmsSo far, we ve mainly been talking about Learning algorithms that modelp(y|x; ), the conditional distribution ofygivenx. For instance, logisticregression modeledp(y|x; ) ash (x) =g( Tx) wheregis the sigmoid func-tion. In these notes, we ll talk about a different type of Learning a classification problem in which we want to learn to distinguishbetween elephants (y= 1) and dogs (y= 0), based on some features ofan animal. Given a training set, an algorithm like logistic regression orthe perceptron algorithm (basically) tries to find a straight line that is, adecision boundary that separates the elephants and dogs. Then, to classifya new animal as either an elephant or a dog, it checks on which side of thedecision boundary it falls, and makes its prediction s a different approach.

First, looking at elephants, we can buildamodel of what elephants look like. Then, looking at dogs, we can build aseparate model of what dogs look like. Finally, to classify a new animal,wecan match the new animal against the elephant model, and match it againstthe dog model, to see whether the new animal looks more like the elephantsor more like the dogs we had seen in the training that try to learnp(y|x) directly (such as logistic regression),or algorithms that try to learn mappings directly from the space of inputsXto the labels{0,1}, (such as the perceptron algorithm) are calleddiscrim-inativelearning algorithms . Here, we ll talk about algorithms that insteadtry to modelp(x|y) (andp(y)). These algorithms are calledgenerativelearning algorithms . For instance, ifyindicates whether an example is adog (0) or an elephant (1), thenp(x|y= 0) models the distribution of dogs features, andp(x|y= 1) models the distribution of elephants modelingp(y) (called theclass priors) andp(x|y), our algorithm12can then use Bayes rule to derive the posterior distribution onygivenx:p(y|x) =p(x|y)p(y)p(x).

Here, the denominator is given byp(x) =p(x|y= 1)p(y= 1) +p(x|y=0)p(y= 0) (you should be able to verify that this is true from the standardproperties of probabilities), and thus can also be expressed in terms of thequantitiesp(x|y) andp(y) that we ve learned. Actually, if were calculatingp(y|x) in order to make a prediction, then we don t actually need to calculatethe denominator, sincearg maxyp(y|x) = arg maxyp(x|y)p(y)p(x)= arg maxyp(x|y)p(y).1 Gaussian discriminant analysisThe first Generative Learning algorithm that we ll look at is Gaussian discrim-inant analysis (GDA). In this model, we ll assume thatp(x|y) is distributedaccording to a multivariate normal distribution . Let s talk briefly about theproperties of multivariate normal distributions before moving on tothe GDAmodel The multivariate normal distributionThe multivariate normal distribution inn-dimensions, also called the multi-variate Gaussian distribution , is parameterized by amean vector Rnand acovariance matrix Rn n, where 0 is symmetric and positivesemi-definite.

Also written N( , ) , its density is given by:p(x; , ) =1(2 )n/2| |1/2exp 12(x )T 1(x ) .In the equation above, | | denotes the determinant of the matrix .For a random variableXdistributedN( , ), the mean is (unsurpris-ingly) given by :E[X] =Zxx p(x; , )dx= Thecovarianceof a vector-valued random variableZis defined as Cov(Z) =E[(Z E[Z])(Z E[Z])T]. This generalizes the notion of the variance of a3real-valued random variable. The covariance can also be defined as Cov(Z) =E[ZZT] (E[Z])(E[Z])T. (You should be able to prove to yourself that thesetwo definitions are equivalent.) IfX N( , ), thenCov(X) = .Here re some examples of what the density of a Gaussian distributionlooks like: 3 2 10123 3 2 3 2 10123 3 2 3 2 10123 3 2 left-most figure shows a Gaussian with mean zero (that is, the 2x1zero-vector) and covariance matrix =I(the 2x2 identity matrix).

A Gaus-sian with zero mean and identity covariance is also called thestandard nor-mal distribution . The middle figure shows the density of a Gaussian withzero mean and = ; and in the rightmost figure shows one with , = see that as becomes larger, the Gaussian becomes more spread-out, and as it becomes smaller, the distribution becomes more compressed. Let s look at some more examples. 3 2 10123 3 2 3 2 10123 3 2 3 2 10123 3 2 figures above show Gaussians with mean 0, and with covariancematrices respectively = 1 00 1 ; = 1 1 ;. = 1 1 .The leftmost figure shows the familiar standard normal distribution , and wesee that as we increase the off-diagonal entry in , the density becomes more compressed towards the 45 line (given byx1=x2). We can see this moreclearly when we look at the contours of the same three densities:4 3 2 10123 3 2 10123 3 2 10123 3 2 10123 3 2 10123 3 2 10123 Here s one last set of examples generated by varying : 3 2 10123 3 2 10123 3 2 10123 3 2 10123 3 2 10123 3 2 10123 The plots above used, respectively, = 1 1 ; = 1 1.

= 3 1 .From the leftmost and middle figures, we see that by decreasing theoff-diagonal elements of the covariance matrix, the density now becomes com-pressed again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).As our last set of examples, fixing =I, by varying , we can also movethe mean of the density around. 3 2 10123 3 2 3 2 10123 3 2 3 2 10123 3 2 figures above were generated using =I, and respectively = 10 ; = ; = . The Gaussian Discriminant Analysis modelWhen we have a classification problem in which the input featuresxarecontinuous-valued random variables, we can then use the GaussianDiscrim-inant Analysis (GDA) model, which modelsp(x|y) using a multivariate nor-mal distribution .

The model is:y Bernoulli( )x|y= 0 N( 0, )x|y= 1 N( 1, )Writing out the distributions, this is:p(y) = y(1 )1 yp(x|y= 0) =1(2 )n/2| |1/2exp 12(x 0)T 1(x 0) p(x|y= 1) =1(2 )n/2| |1/2exp 12(x 1)T 1(x 1) Here, the parameters of our model are , , 0and 1. (Note that whilethere re two different mean vectors 0and 1, this model is usually appliedusing only one covariance matrix .) The log-likelihood of the data is givenby ( , 0, 1, ) = logmYi=1p(x(i), y(i); , 0, 1, )= logmYi=1p(x(i)|y(i); 0, 1, )p(y(i); ).6By maximizing with respect to the parameters, we find the maximum like-lihood estimate of the parameters (see problem set 1) to be: =1mmXi=11{y(i)= 1} 0=Pmi=11{y(i)= 0}x(i)Pmi=11{y(i)= 0} 1=Pmi=11{y(i)= 1}x(i)Pmi=11{y(i)= 1} =1mmXi=1(x(i) y(i))(x(i) y(i)) , what the algorithm is doing can be seen in as follows: 2 101234567 7 6 5 4 3 2 101 Shown in the figure are the training set, as well as the contours of thetwo Gaussian distributions that have been fit to the data in each of thetwo classes.

Note that the two Gaussians have contours that arethe sameshape and orientation, since they share a covariance matrix , butthey havedifferent means 0and 1. Also shown in the figure is the straight linegiving the decision boundary at whichp(y= 1|x) = On one side ofthe boundary, we ll predicty= 1 to be the most likely outcome, and on theother side, we ll predicty= Discussion: GDA and logistic regressionThe GDA model has an interesting relationship to logistic weview the quantityp(y= 1|x; , 0, 1, ) as a function ofx, we ll find that it7can be expressed in the formp(y= 1|x; , , 0, 1) =11 + exp( Tx),where is some appropriate function of , , 0, is exactly the formthat logistic regression a discriminative algorithm used to modelp(y=1|x).When would we prefer one model over another?

GDA and logistic regres-sion will, in general, give different decision boundaries when trained on thesame dataset. Which is better?We just argued that ifp(x|y) is multivariate gaussian (with shared ),thenp(y|x) necessarily follows a logistic function. The converse, however,is not true; ,p(y|x) being a logistic function does not implyp(x|y) ismultivariate gaussian. This shows that GDA makesstrongermodeling as-sumptions about the data than does logistic regression. It turns out thatwhen these modeling assumptions are correct, then GDA will find better fitsto the data, and is a better model. Specifically, whenp(x|y) is indeed gaus-sian (with shared ), then GDA isasymptotically efficient. Informally,this means that in the limit of very large training sets (largem), there is noalgorithm that is strictly better than GDA (in terms of, say, how accuratelythey estimatep(y|x)).

In particular, it can be shown that in this setting,GDA will be a better algorithm than logistic regression; and more generally,even for small training set sizes, we would generally expect GDA to contrast, by making significantly weaker assumptions, logistic regres-sion is also morerobustand less sensitive to incorrect modeling are many different sets of assumptions that would lead top(y|x) takingthe form of a logistic function. For example, ifx|y= 0 Poisson( 0), andx|y= 1 Poisson( 1), thenp(y|x) will be logistic. Logistic regression willalso work well on Poisson data like this. But if we were to use GDA on suchdata and fit Gaussian distributions to such non-Gaussian data then theresults will be less predictable, and GDA may (or may not) do summarize: GDA makes stronger modeling assumptions, and is moredata efficient ( , requires less training data to learn well ) when themod-eling assumptions are correct or at least approximately correct.

Part IV Generative Learning algorithms

Tags:

Information

Transcription of Part IV Generative Learning algorithms

Related search queries

Part IV Generative Learning algorithms

Tags:

Information

Documents from same domain

Related documents

Related search queries