Example: barber

Maximum Likelihood, Logistic Regression, and Stochastic ...

Maximum Likelihood, Logistic Regression, and Stochastic Gradient TrainingCharles 10, 20141 Principle of Maximum likelihoodConsider a family of probability distributions defined by a set of parameters .The distributions may be either probability mass functions (pmfs) or probabilitydensity functions (pdfs). Suppose that we have a random sample drawn froma fixed but unknown member of this family. The random sample is a trainingset ofnexamplesx1toxn. An example may also be called an observation, anoutcome, an instance, or a data point. In general eachxjis a vector of values, and is a vector of real-valued parameters. For example, for a Gaussian distribution = , 2 .We assume that the examples are independent, so the probability of the set isthe product of the probabilities of the individual examples:f(x1,..,xn; ) = jf (xj; ).The notation above makes us think of the distribution as fixed and the examplesxjas unknown, or varying.

The distributions may be either probability mass functions (pmfs) or probability density functions (pdfs). Suppose that we have a random sample drawn from a fixed but unknown member of this family. The random sample is a training set of nexamples x 1 to x n. An example may also be called an observation, an outcome, an instance, or a data point.

Tags:

  Functions, Mass, Mass functions

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Maximum Likelihood, Logistic Regression, and Stochastic ...

1 Maximum Likelihood, Logistic Regression, and Stochastic Gradient TrainingCharles 10, 20141 Principle of Maximum likelihoodConsider a family of probability distributions defined by a set of parameters .The distributions may be either probability mass functions (pmfs) or probabilitydensity functions (pdfs). Suppose that we have a random sample drawn froma fixed but unknown member of this family. The random sample is a trainingset ofnexamplesx1toxn. An example may also be called an observation, anoutcome, an instance, or a data point. In general eachxjis a vector of values, and is a vector of real-valued parameters. For example, for a Gaussian distribution = , 2 .We assume that the examples are independent, so the probability of the set isthe product of the probabilities of the individual examples:f(x1,..,xn; ) = jf (xj; ).The notation above makes us think of the distribution as fixed and the examplesxjas unknown, or varying.

2 However, we can think of the training data as fixedand consider alternative parameter values. This is the point of view behind thedefinition of the likelihood function:L( ;x1,..,xn) =f(x1,..,xn; ).Note that iff(x; )is a probability mass function, then the likelihood is alwaysless than one, but iff(x; )is a probability density function, then the likelihoodcan be greater than one, since densities can be greater than principle of Maximum likelihood says that given the training data, weshould use as our model the distributionf( ; )that gives the greatest possibleprobability to the training data. Formally, =argmax L( ;x1,..,xn).The value is called the Maximum likelihood estimator (MLE) of . In generalthe hat notation indicates an estimated quantity; if necessary we will use notationlike MLEto indicate the nature of an Examples of maximizing likelihoodAs a first example of finding a Maximum likelihood estimator, consider estimatingthe parameter of a Bernoulli distribution.

3 A random variable with this distributionis a formalization of a coin toss. The value of the random variable is 1 withprobability and 0 with probability1 . LetXbe a Bernoulli random variable,and letxbe an outcome ofX. We haveP(X=x) =[ ifx= 11 ifx= 0]Usually, we use the notationP( )for a probability mass , and the notationp( )fora probability density. For mathematical convenience writeP(X)asP(X=x) = x(1 )1 that the training data arex1throughxnwhere eachxi {0,1}. Thelikelihood function isL( ;x1,..,xn) =f(x1,..,xn; ) =n i=1P(X=xi) = h(1 )n hwhereh= ni=1xi. The maximization is performed over the possible scalarvalues0 can do the maximization by setting the derivative with respect to equalto zero. The derivative isdd h(1 )n h=h h 1(1 )n h+ h(n h)(1 )n h 1( 1)= h 1(1 )n h 1[h(1 ) (n h) ]2which has solutions = 0, = 1, and =h/n. The solution which is a maximumis clearly =h/nwhile = 0and = 1are minima.

4 So we have the maximumlikelihood estimate = log likelihood function, writtenl( ), is simply the logarithm of the likeli-hood functionL( ). Because logarithm is a monotonic strictly increasing function,maximizing the log likelihood is precisely equivalent to maximizing the likeli-hood, and also to minimizing the negative log an example of maximizing the log likelihood, consider the problem ofestimating the parameters of a univariate Gaussian distribution. This distributionisf(x; , 2) =1 2 exp[ (x )22 2].The log likelihood for one examplexisl( , 2;x) = logL( , 2;x) = log log 2 (x )22 that we have training data{x1,..,xn}. The Maximum log likelihoodestimates are , 2 =argmax , 2 [ nlog nlog 2 12 2n i=1(xi )2].The expression in square brackets is to be optimized simultaneously over twovariables. Fortunately, it can be simplified it into two sequential univariate opti-mizations.

5 Eliminating the minus signs makes both these optimizations be mini-mizations. The first is =argmin n i=1(xi )2while the second is 2=argmin 2[nlog +12 2T]whereT= ni=1(xi )2. In order to do the first minimization, write(xi )2as(xi x+ x )2. Thenn i=1(xi )2=n i=1(xi x)2+ 2( x )n i=1(xi x) +n( x ) first term ni=1(xi x)2does not depend on so it is irrelevant to the min-imization. The second term equals zero, because ni=1(xi x) = 0. The thirdterm is always positive, so it is clear that it is minimized when = perform the second minimization, work out the derivative symbolically andthen work out when it equals zero: [nlog +12 2T] =n 1+12( 2 3)T= 1(n T 2)= 0if 2= likelihood estimators are typically reasonable, but they may have is-sues. Consider the Gaussian variance estimator 2 MLE= ni=1(xi x)2/nandthe case wheren= 1. In this case 2 MLE= 0. This estimate is guaranteed to betoo small.

6 Intuitively, the estimate is optimistically assuming that all future datapointsx2and so on will can be proved that in general the Maximum likelihood estimate of the vari-ance of a Gaussian is too small, on average:E[1nn i=1(xi x)2; , 2] =n 1n 2< phenomenon can be considered an instance of overfitting: the observedspread around the observed mean xis less than the unknown true spread 2aroundthe unknown true mean .3 Conditional likelihoodAn important extension of the idea of likelihood isconditionallikelihood. Re-member that the notationp(y|x)is an abbreviation for the conditional probabilityp(Y=y|X=x)whereYandXare random variables. The conditional like-lihood of given dataxandyisL( ;y|x) =p(y|x) =f(y|x; ). Intuitively,Yfollows a probability distribution that is different for differentx. Technically,for eachxthere is a different functionf(y|x; ), but all these functions share thesame parameters.

7 We assume thatxitself is never unknown, so there is no needto have a probabilistic model of training data consisting of xi,yi pairs, the principle of Maximum con-ditional likelihood says to choose a parameter estimate that maximizes the prod-uct if(yi|xi; ). Note that we do not need to assume that thexiare independent4in order to justify the conditional likelihood being a product; we just need to as-sume that theyiare independent when each is conditioned on its ownxi. Forany specific value ofx, can then be used to compute probabilities for alternativevaluesyofY. By assumption, we never want to predict values thatYis a binary (Bernoulli) outcome and thatxis a real-valuedvector. We can assume that the probability thatY= 1is a nonlinear function of alinear function ofx. Specifically, we assume the conditional modelp(Y= 1|x; , ) = ( +d j=1 jxj) =11 + exp [ + dj=1 jxj]where (z) = 1/(1 +e z)is the nonlinear function.

8 This model is called logisticregression. We usejto index over the feature valuesx1toxdof a single exampleof dimensionalityd, since we useibelow to index over training examples1ton. If necessary, the notationxijmeans thejth feature value of theith sure to understand the distinction between a feature and a value of a a feature is a random variable, while a value of a feature is a possibleoutcome of the random variable. Features may also be called attributes, predictors,or independent variables. The dependent random variableYis sometimes calleda dependent Logistic regression model is easier to understand in the formlogp1 p= +d j=1 jxjwherepis an abbreviation forp(Y= 1|x; , ). The ratiop/(1 p)is calledthe odds of the eventY= 1givenX=x, andlog[p/(1 p)]is called the logodds. Since probabilities range between 0 and 1, odds range between 0 and+ and log odds range unboundedly between and+ . A linear expression ofthe form + j jxjcan also take unbounded values, so it is reasonable to usea linear expression as a model for log odds, but not as a model for odds or forprobabilities.

9 Essentially, Logistic regression is the simplest reasonable model fora random yes/no outcome whose probability depends linearly on each featurej,exp( jxj)is a multiplicative scaling factor on the oddsp/(1 p). If the predictorxjis binary, thenexp( j)is the extra odds of havingthe outcomeY= 1rather thanY= 0whenxj= 1, compared to whenxj= the predictorxjis real-valued, thenexp( j)is the extra odds of having the5outcomeY= 1when the value ofxjincreases by one unit. A major limitationof the basic Logistic regression model is that the probabilitypmust either increasemonotonically, or decrease monotonically, as a function of each predictorxj. Thebasic model does not allow the probability to depend in a U-shaped way on the training set{ x1,y1 ,.., xn,yn }, we learn a Logistic regressionclassifier by maximizing the log joint conditional likelihood. This is the sum ofthe log conditional likelihood for each training example:LCL=n i=1logL( ;yi|xi) =n i=1logf(yi|xi; ).

10 Given a single training example xi,yi , the log conditional likelihood islogpiifthe true labelyi= 1andlog(1 pi)ifyi= 0, wherepi=p(y= 1|xi; ).To simplify the following discussion, assume from now on that = 0andx0= 1for every examplex, so the parameter vector is Rd+1. By group-ing together the positive and negative training examples, we can write the totalconditional log likelihood asLCL= i:yi=1logpi+ i:yi=0log(1 pi).The partial derivative ofLCLwith respect to parameter jis i:yi=1 jlogpi+ i:yi=0 jlog(1 pi).For an individual training example x,y , if its labely= 1the partial derivative is jlogp=1p jpwhile ify= 0it is jlog(1 p) =11 p( jp).Lete= exp[ dj=0 jxj]sop= 1/(1 +e)and1 p= (1 +e 1)/(1 +e) =e/(1 +e). With this notation we have jp= ( 1)(1 +e) 2 je6= ( 1)(1 +e) 2(e) j[ j jxj]= ( 1)(1 +e) 2(e)( xj)=11 +ee1 +exj=p(1 p) jlogp= (1 p)xjand jlog(1 p) = the entire training set the partial derivative of the log conditional likelihoodwith respect to jis jLCL= i:yi=1(1 pi)xij+ i:yi=0 pixij= i(yi pi)xijwherexijis the value of thejth feature of theith training example.


Related search queries