Example: barber

Logistic Regression - Pennsylvania State University

Logistic RegressionLogistic RegressionJia LiDepartment of StatisticsThe Pennsylvania State UniversityEmail: jialiJia Li jialiLogistic RegressionLogistic RegressionPreserve linear classification the Bayes rule: G(x) = arg maxkPr(G=k|X=x).IDecision boundary between classkandlis determined by theequation:Pr(G=k|X=x) =Pr(G=l|X=x).IDivide both sides byPr(G=l|X=x) and take log. Theabove equation is equivalent tologPr(G=k|X=x)Pr(G=l|X=x)= Li jialiLogistic RegressionISince we enforce linear boundary, we can assumelogPr(G=k|X=x)Pr(G=l|X=x)=a(k,l)0+ p j=1a(k,l) Logistic Regression , there are restrictive relations betweena(k,l)for different pairs of (k,l).Jia Li jialiLogistic RegressionAssumptionslogPr(G= 1|X=x)Pr(G=K|X=x)= 10+ T1xlogPr(G= 2|X=x)Pr(G=K|X=x)= 20+ (G=K 1|X=x)Pr(G=K|X=x)= (K 1)0+ TK 1xJia Li jialiLogistic RegressionIFor any pair (k,l):logPr(G=k|X=x)Pr(G=l|X=x)= k0 l0+ ( k l) of parameters: (K 1)(p+ 1).

Logistic Regression Fitting Logistic Regression Models I Criteria: find parameters that maximize the conditional likelihood of G given X using the training data. I Denote p k(x i;θ) = Pr(G = k |X = x i;θ). I Given the first input x 1, the posterior probability of its class being g 1 is Pr(G = g 1 |X = x 1). I Since samples in the training data set are independent, the

Tags:

  Using, Logistics, Regression, Logistic regression

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Logistic Regression - Pennsylvania State University

1 Logistic RegressionLogistic RegressionJia LiDepartment of StatisticsThe Pennsylvania State UniversityEmail: jialiJia Li jialiLogistic RegressionLogistic RegressionPreserve linear classification the Bayes rule: G(x) = arg maxkPr(G=k|X=x).IDecision boundary between classkandlis determined by theequation:Pr(G=k|X=x) =Pr(G=l|X=x).IDivide both sides byPr(G=l|X=x) and take log. Theabove equation is equivalent tologPr(G=k|X=x)Pr(G=l|X=x)= Li jialiLogistic RegressionISince we enforce linear boundary, we can assumelogPr(G=k|X=x)Pr(G=l|X=x)=a(k,l)0+ p j=1a(k,l) Logistic Regression , there are restrictive relations betweena(k,l)for different pairs of (k,l).Jia Li jialiLogistic RegressionAssumptionslogPr(G= 1|X=x)Pr(G=K|X=x)= 10+ T1xlogPr(G= 2|X=x)Pr(G=K|X=x)= 20+ (G=K 1|X=x)Pr(G=K|X=x)= (K 1)0+ TK 1xJia Li jialiLogistic RegressionIFor any pair (k,l):logPr(G=k|X=x)Pr(G=l|X=x)= k0 l0+ ( k l) of parameters: (K 1)(p+ 1).

2 IDenote the entire parameter set by ={ 10, 1, 20, 2,.., (K 1)0, K 1}.IThe log ratio of posterior probabilities are calledlog-oddsorlogit Li jialiLogistic RegressionIUnder the assumptions, the posterior probabilities are givenby:Pr(G=k|X=x) =exp( k0+ Tkx)1 + K 1l=1exp( l0+ Tlx)fork= 1,..,K 1Pr(G=K|X=x) =11 + K 1l=1exp( l0+ Tlx).IForPr(G=k|X=x) given above, obviouslyISum up to 1: Kk=1Pr(G=k|X=x) = simple calculation shows that the assumptions are Li jialiLogistic RegressionComparison with LR on IndicatorsISimilarities:IBoth attempt to estimatePr(G=k|X=x).IBoth have linear classification :ILinear Regression on indicator matrix: approximatePr(G=k|X=x) by a linear function (G=k|X=x) is not guaranteed to fall between 0 and 1and to sum up to Regression :Pr(G=k|X=x) is anonlinearfunctionofx. It is guaranteed to range from 0 to 1 and to sum up to Li jialiLogistic RegressionFitting Logistic Regression ModelsICriteria: find parameters that maximize the conditionallikelihood ofGgivenXusing the training (xi; ) =Pr(G=k|X=xi; ).

3 IGiven the first inputx1, the posterior probability of its classbeingg1isPr(G=g1|X=x1).ISince samples in the training data set are independent, theposterior probability for theNsamples each having classgi,i= 1,2,..,N, given their inputsx1,x2, ..,xNis:N i=1Pr(G=gi|X=xi).Jia Li jialiLogistic RegressionIThe conditional log-likelihood of the class labels in thetraining data set isL( ) =N i=1logPr(G=gi|X=xi)=N i=1logpgi(xi; ).Jia Li jialiLogistic RegressionBinary ClassificationIFor binary classification, ifgi= 1, denoteyi= 1; ifgi= 2,denoteyi= (x; ) =p(x; ), thenp2(x; ) = 1 p1(x; ) = 1 p(x; ).ISinceK= 2, the parameters ={ 10, 1}.We denote = ( 10, 1) Li jialiLogistic RegressionIIfyi= 1, ,gi= 1,logpgi(x; ) = logp1(x; )= 1 logp(x; )=yilogp(x; ).Ifyi= 0, ,gi= 2,logpgi(x; ) = logp2(x; )= 1 log(1 p(x; ))= (1 yi) log(1 p(x; )).

4 Since eitheryi= 0 or 1 yi= 0, we havelogpgi(x; ) =yilogp(x; ) + (1 yi) log(1 p(x; )).Jia Li jialiLogistic RegressionIThe conditional likelihoodL( ) =N i=1logpgi(xi; )=N i=1[yilogp(xi; ) + (1 yi) log(1 p(xi; ))]ITherep+ 1 parameters in = ( 10, 1) a column vector form for : = 10 11 1,p .Jia Li jialiLogistic RegressionIHere we add the constant term 1 toxto accommodate 1x,1x, ,p .Jia Li jialiLogistic RegressionIBy the assumption of Logistic Regression model:p(x; ) =Pr(G= 1|X=x) =exp( Tx)1 + exp( Tx)1 p(x; ) =Pr(G= 2|X=x) =11 + exp( Tx)ISubstitute the above inL( ):L( ) =N i=1[yi Txi log(1 +e Txi)].Jia Li jialiLogistic RegressionITo maximizeL( ), we set the first order partial derivatives ofL( ) to zero. L( ) 1j=N i=1yixij N i=1xije Txi1 +e Txi=N i=1yixij N i=1p(x; )xij=N i=1xij(yi p(xi; ))for allj= 0,1.

5 , Li jialiLogistic RegressionIIn matrix form, we write L( ) =N i=1xi(yi p(xi; )).ITo solve the set ofp+ 1 nonlinear equations L( ) 1j= 0,j= 0,1,..,p, use the Newton-Raphson Newton-Raphson algorithm requires thesecond-derivatives or Hessian matrix: 2L( ) T= N i=1xixTip(xi; )(1 p(xi; )).Jia Li jialiLogistic RegressionIThe element on thejth row andnth column is (counting from0): L( ) 1j 1n= N i=1(1 +e Txi)e Txixijxin (e Txi)2xijxin(1 +e Txi)2= N i=1xijxinp(xi; ) xijxinp(xi; )2= N i=1xijxinp(xi; )(1 p(xi; )).Jia Li jialiLogistic RegressionIStarting with old, a single Newton-Raphson update is new= old ( 2L( ) T) 1 L( ) ,where the derivatives are evaluated at Li jialiLogistic RegressionIThe iteration can be expressed compactly in matrix the column vector theN (p+ 1) input the N-vector of fitted probabilities withith elementp(xi; old).

6 ILetWbe anN Ndiagonal matrix of weights withithelementp(xi; old)(1 p(xi; old)).IThen L( ) =XT(y p) 2L( ) T= Li jialiLogistic RegressionIThe Newton-Raphson step is new= old+ (XTWX) 1XT(y p)= (XTWX) 1 XTW(X old+W 1(y p))= (XTWX) 1 XTWz,wherez,X old+W 1(y p).IIfzis viewed as a response andXis the input matrix, newisthe solution to a weighted least square problem: new arg min (z X )TW(z X ).IRecall that linear Regression by least square is to solvearg min (z X )T(z X ).Izis referred to as theadjusted algorithm is referred to asiteratively reweighted Li jialiLogistic RegressionPseudo setting its elements toyi={1 ifgi= 10 ifgi= 2,i= 1,2,.., setting its elements top(xi; ) =e Txi1 +e Txii= 1,2,.., the diagonal matrixW. Theith diagonal element isp(xi; )(1 p(xi; )),i= 1,2,.., X +W 1(y p).6. (XTWX) the stopping criteria is met, stop; otherwise go back to Li jialiLogistic RegressionComputational EfficiencyISinceWis anN Ndiagonal matrix, direct matrixoperations with it may be very modified pseudo code is provided Li jialiLogistic setting its elements toyi={1 ifgi= 10 ifgi= 2,i= 1,2.}}

7 , setting its elements top(xi; ) =e Txi1 +e Txii= 1,2,.., theN (p+ 1) matrix Xby multiplying theith row ofmatrixXbyp(xi; )(1 p(xi; )),i= 1,2,..,N:X= xT1xT2 xTN X= p(x1; )(1 p(x1; ))xT1p(x2; )(1 p(x2; ))xT2 p(xN; )(1 p(xN; ))xTN 5. + (XT X) 1XT(y p). the stopping criteria is met, stop; otherwise go back to step Li jialiLogistic RegressionExampleDiabetes data setIInputXis two the two principalcomponents of the original 8 1: without diabetes; Class 2: with Logistic Regression , we obtain = ( , , ) Li jialiLogistic RegressionIThe posterior probabilities are:Pr(G= 1|X=x) = + (G= 2|X=x) =11 + classification rule is: G(x) ={1 02 <0 Jia Li jialiLogistic RegressionSolid line: decision boundary obtained by Logistic Regression . Dashline: decision boundary obtained by trainingdata setclassification errorrate: : : Li jialiLogistic RegressionMulticlass Case (K 3)IWhenK 3, is a (K-1)(p+1)-vector: = 10 1 20 (K 1)0 K 1 = 10 1p (K 1) (K 1)p Jia Li jialiLogistic RegressionILet l=( l0 l).}

8 IThe likelihood function becomesL( ) =N i=1logpgi(xi; )=N i=1log(e Tgixi1 + K 1l=1e Tlxi)=N i=1[ Tgixi log(1 +K 1 l=1e Tlxi)]Jia Li jialiLogistic RegressionINote: the indicator functionI( ) equals 1 when the argumentis true and 0 order derivatives: L( ) kj=N i=1[I(gi=k)xij e Tkxixij1 + K 1l=1e Tlxi]=N i=1xij(I(gi=k) pk(xi; ))Jia Li jialiLogistic RegressionISecond order derivatives: 2L( ) kj mn=N i=1xij 1(1 + K 1l=1e Tlxi)2 [ e TkxiI(k=m)xin(1 +K 1 l=1e Tlxi) +e Tkxie Tmxixin]=N i=1xijxin( pk(xi; )I(k=m) +pk(xi; )pm(xi; ))= N i=1xijxinpk(xi; )[I(k=m) pm(xi; )].Jia Li jialiLogistic RegressionIMatrix the concatenated indicator vector of dimensionN (K 1).y= 1 yk= I(g1=k)I(g2=k)..I(gN=k) 1 k K 1 Ipis the concatenated vector of fitted probabilities of dimensionN (K 1).p= 1 pk= pk(x1; )pk(x2; ).

9 Pk(xN; ) 1 k K 1 Jia Li jialiLogistic RegressionI Xis anN(K 1) (p+ 1)(K 1) matrix: X= X0 00X 0 00 X Jia Li jialiLogistic RegressionIMatrixWis anN(K 1) N(K 1) square matrix:W= W11W12 W1(K 1)W21W22 W2(K 1) W(K 1),1W(K 1),2 W(K 1),(K 1) IEach submatrixWkm, 1 k,m K 1, is anN Ndiagonal , theith diagonal element inWkkispk(xi; old)(1 pk(xi; old)).IWhenk6=m, theith diagonal element inWkmis pk(xi; old)pm(xi; old).Jia Li jialiLogistic RegressionISimilarly as with binary classification L( ) = XT(y p) 2L( ) T= XTW formula for updating newin the binary classification caseholds for multiclass. new= ( XTW X) 1 XTWz,wherez, X old+W 1(y p). Or simply: new= old+ ( XTW X) 1 XT(y p).Jia Li jialiLogistic RegressionComputation IssuesIInitialization: one option is to use = is not guaranteed, but usually is the , the log-likelihood increases after each iteration, butovershooting can the rare cases that the log-likelihood decreases, cut stepsize by Li jialiLogistic RegressionConnection with LDAIU nder the model of LDA:logPr(G=k|X=x)Pr(G=K|X=x)= log k K 12( k+ K)T 1( k K)+xT 1( k K)=ak0+ model of LDA satisfies the assumption of the linearlogistic linear Logistic model only specifies the conditionaldistributionPr(G=k|X=x).

10 No assumption is madeaboutPr(X).Jia Li jialiLogistic RegressionIThe LDA model specifies the joint distribution (X) is a mixture of Gaussians:Pr(X) =K k=1 k (X; k, ).where is the Gaussian density Logistic Regression maximizes the conditional likelihoodofGgivenX:Pr(G=k|X=x).ILDA maximizes the joint likelihood ofGandX:Pr(X=x,G=k).Jia Li jialiLogistic RegressionIIf the additional assumption made by LDA is appropriate,LDA tends to estimate the parameters more efficiently byusing more information about the without class labels can be used under the model is not robust to gross Logistic Regression relies on fewer assumptions, it seems tobe more practice, Logistic Regression and LDA often give Li jialiLogistic RegressionSimulationIAssume inputXis classes have equal priors and the class-conditionaldensities ofXare shifted versions of each conditional density is a mixture of two normals:IClass 1 (red): ( 2,14) + (0,1).


Related search queries