Example: barber

Logistic Regression

Logistic RegressionInstructor: Ping LiDepartment of Statistics and BiostatiticsDepartment of Computer ScienceRutgers University20151 Calculus Review: DerivativesSimple derivatives:[logx] =1x,[xn] =nxn 1,[ex] =ex,[ax] =axlogaChain rule:[f(h(x))] =f (h(x))h (x) log ax2+e2x =1(ax2+e2x) ax2+e2x =2ax+ 2e2x(ax2+ex)Multivariate derivatives:f(x, y) =ax+xny+cy2, f(x, y) x=axloga+nxn 1y, f(x, y) y=xn+ 2cy2 Quick Review of Numerical OptimizationSlides 4 - 15 are for reviewing some basic stuff about numerical optimization, which isessential in modern Likelihood Estimation (MLE)Observationsxi,i= 1ton, are samples from a distribution with probability densityfunctionfX(x; 1, 2, .., k),where j,j= 1tok, are parameters to be maximum likelihood estimator seeks the to maximize the joint likelihood =argmax nYi=1fX(xi; )Or, equivalently, to maximize thelogjoint likelihood =argmax nXi=1logfX(xi; )This is aconvexoptimization Example: Normal DistributionIfX N , 2 , thenfX x; , 2 =1 2 e (x )22 2 Fix 2= 1,x= x; , 2 logfX

Logistic Regression Logistic regression is one of the most widely used statistical tools for predicting cateogrical outcomes. General setup for binary logistic regression n observations: {xi,yi},i = 1 to n. xi can be a vector. yi ∈ {0,1}. For example, “1” = “YES” and “0” = “NO”. Define p(xi) = Pr(yi = 1|xi) = π(xi)

Tags:

  Logistics, Binary, Binary logistic

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Logistic Regression

1 Logistic RegressionInstructor: Ping LiDepartment of Statistics and BiostatiticsDepartment of Computer ScienceRutgers University20151 Calculus Review: DerivativesSimple derivatives:[logx] =1x,[xn] =nxn 1,[ex] =ex,[ax] =axlogaChain rule:[f(h(x))] =f (h(x))h (x) log ax2+e2x =1(ax2+e2x) ax2+e2x =2ax+ 2e2x(ax2+ex)Multivariate derivatives:f(x, y) =ax+xny+cy2, f(x, y) x=axloga+nxn 1y, f(x, y) y=xn+ 2cy2 Quick Review of Numerical OptimizationSlides 4 - 15 are for reviewing some basic stuff about numerical optimization, which isessential in modern Likelihood Estimation (MLE)Observationsxi,i= 1ton, are samples from a distribution with probability densityfunctionfX(x; 1, 2, .., k),where j,j= 1tok, are parameters to be maximum likelihood estimator seeks the to maximize the joint likelihood =argmax nYi=1fX(xi; )Or, equivalently, to maximize thelogjoint likelihood =argmax nXi=1logfX(xi; )This is aconvexoptimization Example: Normal DistributionIfX N , 2 , thenfX x; , 2 =1 2 e (x )22 2 Fix 2= 1,x= x; , 2 logfX x; , 2 2 fX (x; , 2) 2 1012 3 2 1 log fX(x; , 2)It is Not concave, but it is a-log convex, , a unique MLE solution of Exact MLE samples,xi N( , 2),i= x1, x2.

2 , xn; , 2 =nXi=1logfX(xi; , 2)= 12 2nXi=1(xi )2 12nlog(2 2) l =12 22nXi=1(xi ) = 0 = =1nnXi=1xi l 2=12 4nXi=1(xi )2 n2 2= 0 = 2=1nnXi=1(xi ) FunctionsA functionf(x)is convex if the second derivativef (x) 0. 2 101201234xf(x)=x2f(x) = (x)=xlog(x)f(x) = xlog(x)f(x) =x2= f = 2>0, ,f(x) =x2is convex for (x) =xlogx= f =1x, ,f(x) =xlogxis convex ifx > are widely used in statistics and data mining as loss functions,= computationally tractable algorithms: least square, Logistic panel:f(x) =1 2 e x22is-log convex, 2[ logf(x)] x2= 1> f(x) 1001020304002468xlog f(x)Right panel: a mixture of normals is not-log convexf(x) =1 2 e x22+1 2 10e (x 10)2200 The mixture of normals is an extremely useful model in general, only a local minimum can be Descentyy=f(x)xxx012x34xxProcedure:Start with an initial f (x0), where is the step the processxt+1=xt f (xt).

3 Until some criterion is met, ,f(xt+1) f(xt)The meaning of steepest is more clear in the two-dimensional Example of Steepest Descent:f(x) =x2f(x) =x2. The minimum is attained atx= 0,f (x) = steepest descent iteration formulaxt+1=xt f (xt) =xt 2 10 50510 Iteration = = = the step size is important (even whenf(x)is convex).Too small = slow convergence, , many iterations,Too large = oscillations, , also many Descent in PracticeSteepest descent is one of the most widely techniques in realworld It is extremely simple It only requires knowing the first derivative It is numerically stable (for above reasons) For real applications, it is often affordable to use very small In machine learning, is often calledlearning rate It is used in Neural Nets and Gradient Boosting (MART)11 Newton s MethodRecall the goal is to find thex to minimizef(x).

4 Iff(x)is convex, it is equivalent to finding thex such thatf (x ) = (x) =f (x). Take Taylor expansion about the optimum solutionx :h(x ) =h(x) + (x x)h (x) + negligible higher order termsBecauseh(x ) =f (x ) = 0, we know approximately0 h(x) + (x x)h (x) = x x h(x)h (x)=x f (x)f (x)12 The procedure of Newton s MethodStart with an initial guessx0 Updatex1=x0 f (x0)f (x0)Repeatxt+1=xt f (xt)f (xt)Until some stopping criterion is reached, ,xt+1 example:f(x) = (x c) (x) = 2(x c),f (x) = f (x0)f (x0)= x1=x0 2(x0 c)2=cBut we already know thatx=cminimizesf(x) = (x c) s method may find the minimum solution using only one Example of Newton s Method:f(x) =xlogxf (x) = logx+ 1,f (x) = +1=xt logxt+11 x0 = = 10 10x0 = 1 10 10 Whenx0is close to optimum solution, the convergence is very fastWhenx0is far from the optimum, the convergence is slow initiallyWhenx0is badly chosen, no convergence.

5 This example requires0< x0< Descent forf(x) =xlogxf (x) = logx+ 1,xt+1=xt (logxt+ 1)0102030400246810 Iterationxt x0 = 0, = = 10, = = 10, = ofx0, convergence is guaranteed iff(x)is be oscillating if step size is too largeConvergence is slow near the optimum AnalysisTo assess the quality of the estimator of , it is common to usebias,variance, andMSE(mean square error):Bias :E( ) Var :E E( ) 2=E( 2) E2( )MSE :E 2=V ar+Bias2 The last equality is known as thebias variance trade-off. For unbiased estimators, it isdesirable to have smaller variance as possible. As the sample size increases, the MLE (undercertain conditions) becomes unbiased and achieves the smallest variance.

6 Therefore, theMLE is often a desirable estimator. However, in some cases, biased estimators may achievesmaller MSE than the Expectations and Variances of Common DistributionsThe derivations of variances are not required in this course. Nevertheless, it is useful to knowthe expectations and variances of common distributions. Binomial:X binomail(n, p),E(X) =np,V ar(X) =np(1 p). Normal:X N( , 2),E(X) = ,V ar(X) = 2. Chi-square:X 2(k),E(X) =k,V ar(X) = 2k. Exponential:X exp( ),E(X) =1 ,V ar(X) =1 2. Poisson:X P ois( ),E(X) = ,V ar( ).17 Multinomial DistributionThe multinomial is a natural extension to the binomial and denote the observations by(n1, n2, .., nc), which follow ac-cellmultinomial distribution with the underlying probabilities( 1, 2.)

7 , c)(withPci=1 i= 1). Denoten=Pci=1ni. We write(n1, n2, .., nc) M ultinomial(n, 1, 2, .., c)The expectations are (fori= 1tocandi6=j)E(ni) =n i, V ar(ni) =n i(1 i), Cov(ninj) = n i that the cells are negatively correlated (why?).18 Logistic RegressionLogistic Regression is one of the most widely used statistical tools for predicting setup for binary Logistic regressionnobservations:{xi, yi}, i= be a {0,1}. For example, 1 = YES and 0 = NO .Definep(xi) =Pr(yi= 1|xi) = (xi) ,Pr(yi= 0|xi) = 1 p(xi).19 The major assumption of Logistic regressionlogp(xi)1 p(xi)= 0+ 1xi,1+..+ pxi,p=pXj=0 jxi, , we treatxi,0= 1. We can also use vector notation to writelogp(xi; )1 p(xi; )=xi.

8 Here, we viewxias a row-vector and as a model in vector notationp(xi; ) =exi 1 +exi ,1 p(xi; ) =11 +exi ,Log likelihood for theith observation:li( |xi) =(1 yi) log [1 p(xi; )] +yilogp(xi; )= logp(xi; )ifyi= 1log [1 p(xi; )]ifyi= 0To understand this, consider binomial with only one samplebinomial(1, p(xi))( ,Bernouli). Whenyi= 1, the log likelihood islogp(xi)and whenyi= 0, the log likelihood islog (1 p(xi)). These two formulas can be written into log likelihood fornobservations:l( |x1, .., xn) =nXi=1li( |xi)=nXi=1(1 yi) log [1 p(xi; )] +yilogp(xi; )=nXi=1yilogp(xi; )1 p(xi; )+ log [1 p(xi; )]=nXi=1yixi log 1 +exi The remaining task is to solve the optimization problem by Regression with Only One VariableBasic assumptionlogit( (xi))= logp(xi; )1 p(xi; )= 0+ 1xiJoint Log likelihoodl( |x1.)

9 , xn) =nXi=1 yi( 0+xi 1) log 1 +e 0+xi 1 Next, we solve the optimization problem for maximizing the joint likelihood, given the derivatives l( ) 0=nXi=1yi p(xi), l( ) 1=nXi=1xi(yi p(xi)),Second derivatives 2l( ) 20= nXi=1p(xi) (1 p(xi)), 2l( ) 21= nXi=1x2ip(xi) (1 p(xi)), 2l( ) 0 1= nXi=1xip(xi) (1 p(xi))Solve the MLE by Newton s Method or steepest descent (two-dim problem).24 Logistic Regression without Intercept ( 0= 0)The simplified modellogit( (xi))= logp(xi)1 p(xi)= xiEquivalently,p(xi) =e xi1 +e xi= (xi),1 p(xi) =11 +e xi,Joint log likelihood fornobservations:l( |x1, .., xn) =nXi=1xiyi log 1 +e xi 25 First derivativel ( ) =nXi=1xi(yi p(xi)),Second derivativel ( ) = nXi=1x2ip(xi) (1 p(xi)),Newton s Method updating formula t+1= t l ( t)l ( t)Steepest descent (in factascent) updating formula t+1= t+ l ( t)26A Numerical Example of Logistic RegressionDatax={8,14, 7,6,5,6, 5,1,0, 17}y={1,1,0,0,1,0,1,0,0,0}Log likelihood function 1 60 40 200 l( ) t = 0 = s MethodSteepest Descent0102030 t = 0 = s MethodSteepest DescentSteepest descent is quite sensitive to the step size.

10 Too large leads to , 0means the initial value of .2805101520 6 4 20246 Iteration t = 0 = s MethodSteepest Descent01020300510 Iteration t = 0 =10 Newton s MethodSteepest DescentNewton s Method is sensitive to the starting point 0. May not converge at starting point (mostly) only affects computing time of steepest descent. In general, with multiple variables, we need to use the matrix formulation, which in fact iseasier to implement in s Method for Logistic Regression with 0and 1 Analogous to the one variable case, theNewton s updateformula is new= old " 2l( ) T 1 l( ) # oldwhere = 0 1 , l( ) =" ni=1yi p(xi) ni=1xi(yi p(xi))#= T y1 p(x1)y2 p(x2)..yn p(xn) =XT(y p)30 2l( ) T = Pni=1p(xi) (1 p(xi)) Pni=1xip(xi) (1 p(xi)) Pni=1xip(xi) (1 p(xi)) Pni=1x2ip(xi) (1 p(xi)) = XTWXW= p(x1)(1 p(x1)) (x2)(1 p(x2)) p(xn)(1 p(xn)) 31 Multivariate Logistic Regression Solution in Matrix FormNewton update formula new= old " 2l( ) T 1 l( ) # oldwhere, in a matrix form l( ) =nXi=1xi(yi p(xi; )) =XT(y p) 2l( ) T= nXi=1xTixip(xi; ) (1 p(xi; )) = XTWXWe can write the update formula in a matrix form new= XTWX 1 XTWz,z=X old+W 1(y p)32X= 1x1,1x1, x1,p1x2,1x2, x2, ,1xn, xn,p Rn (p+1)W= p1(1 p1) (1 p2) pn(1 pn) Rn nwherepi=p(xi; old).


Related search queries