Multiclass Logistic Regression

Multiclass Logistic RegressionSargur N. SrihariUniversity at Buffalo, State University of New YorkUSA Topics in Linear classification using Probabilistic Discriminative Models Generative basis functions in linear Regression (two-class) Reweighted Least Squares (IRLS) Logistic Link Functions2 SrihariMachine LearningTopics in Multiclass Logistic Regression Multiclass classification Problem SoftmaxRegression SoftmaxRegression Implementation Softmaxand Training One-hot vector representation Objective function and gradient Summary of concepts in Logistic Regression Example of 3-class Logistic RegressionMachine LearningSrihari3 Multi-class classification problemMachine LearningSrihari4 CategoriesK=10 ExamplesN=100 SoftmaxRegression In the two-class case p(C1| )=y( )= (wT +b)where =[ 1,.., M]T,w =[w1,.., wM]Tanda=wT +bis the activation For Kclasses, we work with soft-max function instead of Logistic sigmoid (Softmaxregression)whereak=wkT +bk, k =1.

,K We learn a set of Kweight vectors {w1,.., wK}and biases b Arranging weight vectors as a matrix W5 Srihari p(Ck| )=yk( )=exp(ak)exp(aj)j Machine Learninga =WT +b y=softmax(a)yi=exp(ai)exp(aj)j=13 W= = wk=[wk1,.., wkM]Tanda={a1,..aK}SoftmaxRegression Implementation6a =WTx+b y=softmax(a)yi=exp(ai)exp(aj)j=13 Network ComputesIn matrix multiplication notation3-class Logistic Regression with 3 inputsAn example W=W1W2W3 = ,2W1,3W2, ,3W3,1W3,2W3,3 Softmaxand Training We use maximum likelihood to determine the parameters {wk}, k=1,..K The expwithin softmaxworks very well when training using log-likelihood Log-likelihood can undo the expof softmax Input aialways has a direct contribution to cost Because this term cannot saturate, learning can proceed even if second term becomes very small First term encourages aito be pushed up Second term encourages all ato be pushed down7 logsoftmax(a)i=ai logexp(ajj ) softmax(a)i=exp(ai)exp(aj)j Derivatives The Multiclass Logistic Regression model is For maximum likelihood we will need the derivatives ofykwrtall of the activations aj These are given by where Ikjare the elements of the identity matrixMachine LearningSrihari8 yk aj=yk(Ikj yj) p(Ck| )=yk( )=exp(ak)exp(aj)j 9 One-hot vector representation Classes C1.

CK represented by 1-of-Kscheme One-hot vector: class Ckis a K-dim vector or[t1,..,tK]T, ti {0,1} WithK=6, class C3is (0,0,1,0,0,0)Twitht1=t2=t4=t5=t6=0, & t3=1 The class probabilities obey If p(tk=1)= kthen p(Ck)= ktkk=1K where =( 1,.., K) , probability of C3is p(Ckk=1K )=tkk=1K =1 p([0,0,1,0,0,0])= 3 Why use one-hot representation?If we used numerical categories 1,2, would impute can now usesimpler Bernoulli instead of multinoulliTarget Matrix, TMachine LearningSrihari10 T= Classes have values1, .., K Each represented as a K-dimensional binary vector We have Nlabeled samples So instead of target vectortwe have a target matrix TClassesSamplesNote that tnkcorresponds to samplenand class kObjective Function & Gradient Likelihood of observations is Where, for feature vector nSrihari11 p(T|w1.)

,wK)=p(Ck| n)tn,kk=1K n=1N =ynktnkk=1K n=1N Machine Learning ynk=yk( n)=exp(wkT n)exp(wjT n)j T= Objective Function: negative log-likelihood Known as cross-entropy error for multi-class Gradient of error function wrtparameter wj E(w1,..,wK)= lnp(T|w1,..,wK)= tnklnynkk=1K n=1N wjE(w1,..,wK)= (ynj tnj) nn=1N Error x Feature Vector using tnkk =1 yk aj=yk(Ikj yj)where Ikj are elements of the identity matrix yk( )=exp(ak)exp(aj)j ak=wkT Gradient Descent Has same form for gradient as for the sum of squares error function with the linear model and cross-entropy error for the Logistic Regression model , product of the error (ynj-tnj)times the basis function n We can use the sequential algorithm in which inputs are presented one at a time in which the weight vector is updated usingMachine LearningSrihari12 w +1=w EnNewton-Raphsonupdate gives IRLS Hessian matrix comprises blocks of size M xM Blockj,kis given by No of blocks is also M xM, each corresponding to a pair of classes (with redundancy)

Hessian matrix is positive-definite, therefore error function has a unique minimum Batch Algorithm based on Newton-RaphsonSrihari13 wk wjE(w1,..,wK)= ynk(Ikj ynj) nn=1N nTMachine LearningSummary of Logistic Regression concepts Definition of gradient and Hessian Gradient and Hessian in Linear Regression Gradient and Hessian in 2-class Logistic RegressionMachine LearningSrihari14 Definitions of Gradient and Hessian First derivative of a scalar function E(w)with respect to a vector w=[w1,w2]T is a vector called the Gradient of E(w) Second derivative of E(w) is a matrix called the Hessian Jacobianmatrix consists of first derivatives of a vector-valued function wrta vector E(w)=ddwE(w)= E w1 E w2 H= E(w)=d2dw2E(w)= 2E w12 2E w1 w2 2E w2 w1 2E w22 If there are M elements in the vectorthen Gradient is aMx1 vectorHessian is amatrix withM2elementsUse of Gradient & Hessian in MLError surface for M=2a paraboloidwith a single global minimum For Stochastic Gradient Descent we need En(w) where E(w)=En(w)n For Newton-Raphson update we need both E(w) and H= E(w) E(w)=12[wT (xn) tn]2n=1N where w=(w0,w1.)

WM 1)T (xn)= 0xn() 1xn().. M 1xn()()TFor Linear Regression (sum-of-squared error):Training samples: n=1,..NInputs: M x 1vectorsOutputs t=(t1,.. tN)T w(new)=w(old) H 1 E(w) w( +1)=w( ) En(w)For Logistic Regression (cross-entropy error): E(w)= lnp(t|w)= tnlnyn+(1 tn)ln(1 yn){}n=1N where yn= wT (xn)() =F--)x()x()x()x(..)x()x(1020111110 NMNM ffffff 0(x)=1, dummy featureGradient and Hessian for Linear Regression Sum-of squared Errors (equivalent to maximum likelihood) Gradient of E Hessian of E Newton-Raphson w(new) = w(old)-( T )-1{ T w(old)- Tt}= ( T )-1 Tt17 E(w)=12[wT (xn) tn]2n=1N E(w)=[wT n tn] nn=1N = T w Tt =F--)x()x()x()x(..)x()x(1020111110 NMNM ffffff (xn)= 0xn() 1xn().. M 1xn()()Tt=(t1,.. tN) H= E(w)= T Which is the same solution as with Gradient DescentwML=-( T )-1 TtGradient & Hessian: 2-class Logistic Regression Cross-Entropy Error Gradient of E Hessian of E18 =F--)x()x()x()x(.

X()x(1020111110 NMNM ffffff E(w)= lnp(t|w)= tnlnyn+(1 tn)ln(1 yn){}n=1N where yn= wT (xn)() E(w)=(yn tn) (xn)= T(y t)n=1N H= E(w)=yn(1 yn) (xn) T(xn)= TR n=1N RisN x N diagonal matrix with elementsRnn=yn(1-yn)=wT (xn)(1-wT (xn)) y=(y1,..yN)T t=(t1,.. tN)T (xn)= 0xn() 1xn().. M 1xn()()THessian is not constant and depends on wthrough RSince His positive-definite ( , for arbitrary u, uTHu>0) error function is a concave function of w and so has a unique minimumGradient & Hessian: Multi-class Logistic Regression Cross-Entropy Error Gradient of E Hessian of E19 =F--)x()x()x()x(..)x()x(1020111110 NMNM ffffff E(w1,..,wK)= lnp(T|w1,..,wK)= tnklnynkk=1K n=1N wjE(w1,..,wK)=(ynj tnj) (xnn=1N ) T= ynk=yk( (xn))=exp(wkT (xn))exp(wjT (xn))j wk wjE(w1,..,wK)= ynk(Ikj ynj) (xn)n=1N T(xn) (xn)= 0xn() 1xn().)

M 1xn()()TEach element of the Hessian needs M multiplications and additionsSince there areM 2elements in the matrix the computation is O(M3)An Example of 3-class Logistic Regression Input Data20 0(x)=1, dummy featureThree-class Logistic Regression Three weight vectors (Initial) Gradient Hessian (9x9 with some 3 x 3 blocks repeated)Final Weight Vector, Gradient and Hessian (3-class) Weight Vector Gradient HessianNumber of iterations : 6 Error (Initial and Final).

Multiclass Logistic Regression

Tags:

Information

Advertisement

Transcription of Multiclass Logistic Regression

Related search queries

Multiclass Logistic Regression

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries