Transcription of CS229 Lecture Notes
1 CS229 Lecture NotesAndrew Ng(updates by Tengyu Ma)Supervised learningLet s start by talking about a few examples of supervised learning we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:Living area (feet2)Price (1000$s) can plot this data:50010001500200025003000350040004500 500001002003004005006007008009001000hous ing pricessquare feetprice (in $1000)1CS229 Fall 20182 Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?To establish notation for future use, we ll usex(i)to denote the input variables (living area in this example), also called inputfeatures, andy(i)to denote the output ortargetvariable that we are trying to predict(price).
2 A pair (x(i),y(i)) is called atraining example, and the datasetthat we ll be using to learn a list ofntraining examples{(x(i),y(i));i=1,..,n} is called atraining set. Note that the superscript (i) in thenotation is simply an index into the training set, and has nothing to do withexponentiation. We will also useXdenote the space of input values, andYthe space of output values. In this example,X=Y= describe the supervised learning problem slightly more formally, ourgoal is, given a training set, to learn a functionh:X 7 Yso thath(x) is a good predictor for the corresponding value ofy.
3 For historical reasons, thisfunctionhis called ahypothesis. Seen pictorially, the process is thereforelike this:Training set house.)(living area ofLearning algorithmhpredicted yx(predicted price)of house)When the target variable that we re trying to predict is continuous, suchas in our housing example, we call the learning problem aregressionprob-lem. Whenycan take on only a small number of discrete values (such asif, given the living area, we wanted to predict if a dwelling is a house or anapartment, say), we call it ILinear RegressionTo make our housing example more interesting, let s consider a slightly richerdataset in which we also know the number of bedrooms in each house:Living area (feet2)#bedroomsPrice (1000$s) , thex s are two-dimensional vectors inR2.
4 For instance,x(i)1is theliving area of thei-th house in the training set, andx(i)2is its number ofbedrooms. (In general, when designing a learning problem, it will be up toyou to decide what features to choose, so if you are out in Portland gatheringhousing data, you might also decide to include other features such as whethereach house has a fireplace, the number of bathrooms, and so on. We ll saymore about feature selection later, but for now let s take the features asgiven.)To perform supervised learning, we must decide how we re going to rep-resent functions/hypotheseshin a computer.
5 As an initial choice, let s saywe decide to approximateyas a linear function ofx:h (x) = 0+ 1x1+ 2x2 Here, the i s are theparameters(also calledweights) parameterizing thespace of linear functions mapping fromXtoY. When there is no risk ofconfusion, we will drop the subscript inh (x), and write it more simply ash(x). To simplify our notation, we also introduce the convention of lettingx0= 1 (this is theintercept term), so thath(x) =d i=0 ixi= Tx,where on the right-hand side above we are viewing andxboth as vectors,and heredis the number of input variables (not countingx0).4 Now, given a training set, how do we pick, or learn, the parameters ?
6 One reasonable method seems to be to makeh(x) close toy, at least forthe training examples we have. To formalize this, we will define a functionthat measures, for each value of the s, how close theh(x(i)) s are to thecorrespondingy(i) s. We define thecost function:J( ) =12n i=1(h (x(i)) y(i)) you ve seen linear regression before, you may recognize this as the familiarleast-squares cost function that gives rise to theordinary least squaresregression model. Whether or not you have seen it previously, let s keepgoing, and we ll eventually show this to be a special case of a much broaderfamily of LMS algorithmWe want to choose so as to minimizeJ( ).
7 To do so, let s use a searchalgorithm that starts with some initial guess for , and that repeatedlychanges to makeJ( ) smaller, until hopefully we converge to a value of that minimizesJ( ). Specifically, let s consider thegradient descentalgorithm, which starts with some initial , and repeatedly performs theupdate: j:= j jJ( ).(This update is simultaneously performed for all values ofj= 0,..,d.)Here, is called thelearning rate. This is a very natural algorithm thatrepeatedly takes a step in the direction of steepest decrease order to implement this algorithm, we have to work out what is thepartial derivative term on the right hand side.
8 Let s first work it out for thecase of if we have only one training example (x,y), so that we can neglectthe sum in the definition ofJ. We have: jJ( ) = j12(h (x) y)2= 2 12(h (x) y) j(h (x) y)= (h (x) y) j(d i=0 ixi y)= (h (x) y)xj5 For a single training example, this gives the update rule:1 j:= j+ (y(i) h (x(i)))x(i) rule is called theLMSupdate rule (LMS stands for least mean squares ),and is also known as theWidrow-Hofflearning rule. This rule has severalproperties that seem natural and intuitive. For instance, the magnitude ofthe update is proportional to theerrorterm (y(i) h (x(i))); thus, for in-stance, if we are encountering a training example on which our predictionnearly matches the actual value ofy(i), then we find that there is little needto change the parameters; in contrast, a larger change to the parameters willbe made if our predictionh (x(i)) has a large error ( , if it is very far fromy(i)).
9 We d derived the LMS rule for when there was only a single trainingexample. There are two ways to modify this method for a training set ofmore than one example. The first is replace it with the following algorithm:Repeat until convergence{ j:= j+ n i=1(y(i) h (x(i)))x(i)j,(for everyj)(1)}By grouping the updates of the coordinates into an update of the vector , we can rewrite update (1) in a slightly more succinct way: := + n i=1(y(i) h (x(i)))x(i)The reader can easily verify that the quantity in the summation in theupdate rule above is just J( )/ j(for the original definition ofJ).
10 So, thisis simply gradient descent on the original cost functionJ. This method looksat every example in the entire training set on every step, and is calledbatchgradient descent. Note that, while gradient descent can be susceptibleto local minima in general, the optimization problem we have posed here1We use the notation a:=b to denote an operation (in a computer program) inwhich wesetthe value of a variableato be equal to the value ofb. In other words, thisoperation overwritesawith the value ofb. In contrast, we will write a=b when we areasserting a statement of fact, that the value ofais equal to the value linear regression has only one global, and no other local, optima; thusgradient descent always converges (assuming the learning rate is not toolarge) to the global minimum.