Example: bankruptcy

CHAPTER Logistic Regression - Stanford University

Speech and language Processing. Daniel Jurafsky & James H. Martin. Copyrightc 2019. Allrights reserved. Draft of October 2, Regression And how do you know that these fine begonias are not of equal importance? Hercule Poirot, in Agatha Christie sThe Mysterious Affair at StylesDetective stories are as littered with clues as texts are with words. Yet for thepoor reader it can be challenging to know how to weigh the author s clues in orderto make the crucial classification task: deciding this CHAPTER we introduce an algorithm that is admirably suited for discoveringthe link between features or cues and some particular outcome: Logistic , Logistic Regression is one of the most important analytic tools in the socialand natural sciences.

In natural language processing, logistic regression is the base-line supervised machine learning algorithm for classification, and also has a very close relationship with neural networks. As we will see in Chapter 7, a neural net- ... Components of a probabilistic machine learning classifier: Like naive Bayes, ...

Tags:

  Language, Logistics, Regression, Neural, Probabilistic, Logistic regression, A neural

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of CHAPTER Logistic Regression - Stanford University

1 Speech and language Processing. Daniel Jurafsky & James H. Martin. Copyrightc 2019. Allrights reserved. Draft of October 2, Regression And how do you know that these fine begonias are not of equal importance? Hercule Poirot, in Agatha Christie sThe Mysterious Affair at StylesDetective stories are as littered with clues as texts are with words. Yet for thepoor reader it can be challenging to know how to weigh the author s clues in orderto make the crucial classification task: deciding this CHAPTER we introduce an algorithm that is admirably suited for discoveringthe link between features or cues and some particular outcome: Logistic , Logistic Regression is one of the most important analytic tools in the socialand natural sciences.

2 In natural language processing, Logistic Regression is the base-line supervised machine learning algorithm for classification, and also has a veryclose relationship with neural networks. As we will see in CHAPTER 7, a neural net-work can be viewed as a series of Logistic Regression classifiers stacked on top ofeach other. Thus the classification and machine learning techniques introduced herewill play an important role throughout the Regression can be used to classify an observation into one of two classes(like positive sentiment and negative sentiment ), or into one of many the mathematics for the two-class case is simpler, we ll describe this specialcase of Logistic Regression first in the next few sections, and then briefly summarizethe use ofmultinomial Logistic regressionfor more than two classes in Section ll introduce the mathematics of Logistic Regression in the next few let s begin with some high-level and Discriminative Classifiers.

3 The most important difference be-tween naive Bayes and Logistic Regression is that Logistic Regression is adiscrimina-tiveclassifier while naive Bayes is are two very different frameworks for howto build a machine learning model. Consider a visualmetaphor: imagine we re trying to distinguish dogimages from cat images. A generative model wouldhave the goal of understanding what dogs look likeand what cats look like. You might literally ask sucha model to generate , draw, a dog. Given a testimage, the system then asks whether it s the cat model or the dog model that betterfits (is less surprised by) the image, and chooses that as its discriminative model, by contrast, is only try-ing to learn to distinguish the classes (perhaps with-out learning much about them).

4 So maybe all thedogs in the training data are wearing collars and thecats aren t. If that one feature neatly separates theclasses, the model is satisfied. If you ask such amodel what it knows about cats all it can say is thatthey don t wear LOGISTICREGRESSIONMore formally, recall that the naive Bayes assigns a classcto a documentdnotby directly computingP(c|d)but by computing a likelihood and a prior c=argmaxc Clikelihood P(d|c)prior P(c)( )Agenerative modellike naive Bayes makes use of thislikelihoodterm, whichgenerativemodelexpresses how to generate the features of a documentif we knew it was of class contrast adiscriminative modelin this text categorization scenario attemptsdiscriminativemodeltodirectlycom puteP(c|d).

5 Perhaps it will learn to assign a high weight to documentfeatures that directly improve its ability todiscriminatebetween possible classes,even if it couldn t generate an example of one of the of a probabilistic machine learning classifier:Like naive Bayes, Logistic Regression is a probabilistic classifier that makes use of supervised machinelearning. Machine learning classifiers require a training corpus ofMinput/outputpairs(x(i),y(i)). (We ll use superscripts in parentheses to refer to individual instancesin the training set for sentiment classification each instance might be an individualdocument to be classified). A machine learning system for classification then hasfour components:1.

6 Afeature representationof the input. For each input observationx(i), thiswill be a vector of features[x1,x2,..,xn]. We will generally refer to featureifor inputx(j)asx(j)i, sometimes simplified asxi, but we will also see thenotationfi,fi(x), or, for multiclass classification,fi(c,x).2. A classification function that computes y, the estimated class, viap(y|x). Inthe next section we will introduce thesigmoidandsoftmaxtools for An objective function for learning, usually involving minimizing error ontraining examples. We will introduce thecross-entropy loss function4. An algorithm for optimizing the objective function. We introduce thestochas-tic gradient Regression has two phases:training:we train the system (specifically the weightswandb) using stochasticgradient descent and the cross-entropy :Given a test examplexwe computep(y|x)and return the higher probabilitylabely=1 ory= Classification: the sigmoidThe goal of binary Logistic Regression is to train a classifier that can make a binarydecision about the class of a new input observation.

7 Here we introduce thesigmoidclassifier that will help us make this a single input observationx, which we will represent by a vector offeatures[x1,x2,..,xn](we ll show sample features in the next subsection). The clas-sifier outputycan be 1 (meaning the observation is a member of the class) or 0(the observation is not a member of the class). We want to know the probabilityP(y=1|x)that this observation is a member of the class. So perhaps the CLASSIFICATION:THE SIGMOID3is positive sentiment versus negative sentiment , the features represent countsof words in a document, andP(y=1|x)is the probability that the document haspositive sentiment, while andP(y=0|x)is the probability that the document hasnegative Regression solves this task by learning, from a training set, a vector ofweightsand abias term.

8 Each weightwiis a real number, and is associated with oneof the input featuresxi. The weightwirepresents how important that input feature isto the classification decision, and can be positive (meaning the feature is associatedwith the class) or negative (meaning the feature is not associated with the class).Thus we might expect in a sentiment task the wordawesometo have a high positiveweight, andabysmalto have a very negative weight. Thebias term, also called thebias termintercept, is another real number that s added to the weighted make a decision on a test instance after we ve learned the weights intraining the classifier first multiplies eachxiby its weightwi, sums up the weightedfeatures, and adds the bias termb.

9 The resulting single numberzexpresses theweighted sum of the evidence for the (n i=1wixi)+b( )In the rest of the book we ll represent such sums using thedot productnotation fromdot productlinear algebra. The dot product of two vectorsaandb, written asa bis the sum ofthe products of the corresponding elements of each vector. Thus the following is anequivalent formation to Eq. :z=w x+b( )But note that nothing in Eq. forceszto be a legal probability, that is, to liebetween 0 and 1. In fact, since weights are real-valued, the output might even benegative;zranges from to .Figure sigmoid functiony=11+e ztakes a real value and maps it to the range[0,1].

10 It is nearly linear around 0 but outlier values get squashed toward 0 or create a probability, we ll passzthrough thesigmoidfunction, (z). Thesigmoidsigmoid function (named because it looks like ans) is also called thelogistic func-tion, and gives Logistic Regression its name. The sigmoid has the following equation,logisticfunctionshown graphically in Fig. :y= (z) =11+e z( )The sigmoid has a number of advantages; it takes a real-valued number and mapsit into the range[0,1], which is just what we want for a probability. Because it is4 CHAPTER5 LOGISTICREGRESSION nearly linear around 0 but has a sharp slope toward the ends, it tends to squash outliervalues toward 0 or 1.


Related search queries