Example: marketing

Part V Support Vector Machines

CS229 LecturenotesAndrewNgPartVSupportVectorMa chinesThissetof notespresents theSupportVectorMachine(SVM) (andmany believe is indeedthebest)\o -the-shelf" telltheSVMstory, we'llneedto rsttalkaboutmarginsandtheideaof separatingdatawitha large\gap."Next,we'lltalkabouttheoptimal marginclassi er,which willleadus into a digressiononLagrangeduality. We'llalsoseekernels,which givea way to applySVMse cientlyin veryhighdimensional(such as in nite-dimensional)featurespaces,and nally, we'llcloseo thestorywiththeSMOalgorithm,which gives ane cient implementationof :IntuitionWe'llstartourstoryonSVMsby theintuitionsaboutmarginsandaboutthe\con dence"of ourpredic-tions;theseideaswillbe madeformalin ,wheretheprobabilityp(y= 1jx; ) is mod-eledbyh (x) =g( Tx).

training example (x(i);y(i)), we de ne the functional margin of (w;b) with respect to the training example ^ (i)= y (wTx+b): Note that if y(i) = 1, then for the functional margin to be large (i.e., for our prediction to be con dent and correct), then we need wTx+b to be a large positive number. Conversely, if y(i) = 1, then for the functional ...

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Part V Support Vector Machines

1 CS229 LecturenotesAndrewNgPartVSupportVectorMa chinesThissetof notespresents theSupportVectorMachine(SVM) (andmany believe is indeedthebest)\o -the-shelf" telltheSVMstory, we'llneedto rsttalkaboutmarginsandtheideaof separatingdatawitha large\gap."Next,we'lltalkabouttheoptimal marginclassi er,which willleadus into a digressiononLagrangeduality. We'llalsoseekernels,which givea way to applySVMse cientlyin veryhighdimensional(such as in nite-dimensional)featurespaces,and nally, we'llcloseo thestorywiththeSMOalgorithm,which gives ane cient implementationof :IntuitionWe'llstartourstoryonSVMsby theintuitionsaboutmarginsandaboutthe\con dence"of ourpredic-tions;theseideaswillbe madeformalin ,wheretheprobabilityp(y= 1jx; ) is mod-eledbyh (x) =g( Tx).

2 We wouldthenpredict\1"onaninputxif andonlyifh (x) 0:5, orequivalently, if andonlyif Tx trainingexample(y= 1).Thelarger Txis,thelargeralsoish (x) =p(y= 1jx;w; b), andthus alsothehigherourdegreeof \con dence"thatthelabel is 1. Thus,informallywe canthinkof ourpredictionas beinga verycon dent onethaty= 1 if Tx , we thinkof logisticregressionas makinga verycon dent predictionofy= 0, if Tx 0. Givena trainingset,againinformallyit seemsthatwe'dhave founda good t tothetrainingdataif we can nd so that Tx(i) 0 whenevery(i)= 1, and12 Tx(i) 0 whenevery(i)= 0, sincethiswouldre ecta verycon dent (andcorrect)setof classi bea nicegoalto aimfor,andwe' a di erent type of intuition,considerthefollowing gure,in which x'srepresent positive trainingexamples,o'sdenotenegative trainingexamples,a decisionboundary(thisis thelinegivenby theequation Tx= 0, andis alsocalledtheseparatinghyperplane) is alsoshown,andthreepointshave alsobeenlabeledA,B andC.

3 BACN oticethatthepoint A is veryfarfromthedecisionboundary. If we areasked to make a predictionforthevalueofyat at A,it seemswe shouldbequitecon dent thaty= 1 , thepoint C is veryclosetothedecisionboundary, andwhileit'sonthesideof thedecisionboundaryonwhich we wouldpredicty= 1, it seemslikelythatjusta smallchangetothedecisionboundarycouldeas ilyhave causedoutpredictionto bey= ,we'remuch morecon dent aboutourpredictionat A thanat C. Thepoint B liesin-betweenthesetwo cases,andmorebroadly, we seethatifa point is farfromtheseparatinghyperplane,thenwe may be signi cantlymorecon dent in ,informallywe thinkit'dbe niceif,givena trainingset,we manageto nda decisionboundarythatallowsusto make allcorrectandcon dent (meaningfarfromthedecisionboundary) 'llformalizethislaterusingthenotionof make ourdiscussionof SVMseasier,we'll rstneedto introducea newnotationfortalkingaboutclassi willbe consideringa linearclassi erfora binaryclassi , we'llusey2f 1;1g(insteadoff0;1g) to ,ratherthanparameterizingourlinearclassi erwiththevector , wewilluseparametersw; b, andwriteourclassi erashw;b(x) =g(wTx+b):Here,g(z) = 1 ifz 0, andg(z) = 1 \w.

4 B" notationallowsusto explicitlytreattheintercepttermbseparate lyfromtheotherparameters.(We alsodroptheconventionwe hadpreviouslyof lettingx0= 1be anextracoordinatein theinputfeaturevector.)Thus,btakes theroleofwhatwas previously 0, andwtakes theroleof [ 1: : : n] ,fromourde nitionofgabove, ourclassi erwilldirectlypredicteither1 or 1 ( ),without rstgoingthroughtheintermediatestepof estimatingtheprobability ofybeing1(which was whatlogisticregressiondid).3 FunctionalandgeometricmarginsLetsformali zethenotionsof atrainingexample(x(i); y(i)), we de nethefunctionalmarginof (w; b) withrespectto thetrainingexample^ (i)=y(i)(wTx+b):Notethatify(i)= 1, thenforthefunctionalmarginto be large( ,forourpredictionto be con dent andcorrect),thenwe needwTx+bto be a largepositive , ify(i)= 1, thenforthefunctionalmargintobe large,thenwe needwTx+bto be a largenegative ,ify(i)(wTx+b)>0, thenourpredictiononthisexampleis correct.

5 (Checkthisyourself.)Hence,a largefunctionalmarginrepresents a con dent a linearclassi erwiththechoiceofggivenabove (takingvaluesinf 1;1g), there'soneproperty of thefunctionalmarginthatmakes it notaverygood measureof con dence, ourchoiceofg, we notethatif we replacewwith2wandbwith2b, thensinceg(wTx+b) =g(2wTx+ 2b),4thiswouldnotchangehw;b(x) at ,g, andhencealsohw;b(x), dependsonlyonthesign,butnotonthemagnitud e,ofwTx+b. However,replacing(w; b) with(2w;2b) alsoresultsin multiplyingourfunctionalmarginby afactorof 2. Thus,it seemsthatby exploitingourfreedomto scalewandb,we canmake , it might thereforemake senseto imposesomesortof normalizationconditionsuch as thatjjwjj2= 1; ,we mightreplace(w; b) with(w=jjwjj2; b=jjwjj2), andinsteadconsiderthefunctionalmarginof (w=jjwjj2; b=jjwjj2).

6 We'llcomeback to trainingsetS=f(x(i); y(i));i= 1; : : : ; mg, we alsode nethefunctionmarginof (w; b) withrespecttoSas thesmallestof thefunctionalmarginsof ^ , thiscanthereforebe written:^ =mini=1;:::;m^ (i):Next,letstalkaboutgeometricmargins. Considerthepicturebelow:wA B(i)Thedecisionboundarycorrespondingto (w; b) is shown,alongwiththevectorw. Notethatwis orthogonal(at90 ) to theseparatinghyperplane.(Youshouldconvin ceyourselfthatthismustbe thecase.)Considerthepoint at A,which represents theinputx(i)of sometrainingexamplewithlabely(i)= 1. Itsdistanceto thedecisionboundary, (i), is given by thelinesegment canwe ndthevalueof (i)? Well,w=jjwjjis a unit-lengthvectorpointingin thesamedirectionasw. SinceArepresentsx(i), we therefore5 ndthatthepointBis givenbyx(i) (i) w=jjwjj.

7 Butthispoint liesonthedecisionboundary, andallpointsxonthedecisionboundarysatisf ytheequationwTx+b= 0. Hence,wT x(i) (i)wjjwjj +b= 0:Solvingfor (i)yields (i)=wTx(i)+bjjwjj= wjjwjj Tx(i)+bjjwjj:Thiswas workedoutforthecaseof a positive trainingexampleat A in the gure,wherebeingonthe\positive" sideof thedecisionboundaryis , we de nethegeometricmarginof (w; b) withrespectto atrainingexample(x(i); y(i)) to be (i)=y(i) wjjwjj Tx(i)+bjjwjj!:Notethatifjjwjj= 1, thenthefunctionalmarginequalsthegeometri cmargin|thisthusgives usa way of relatingthesetwo di erent ,thegeometricmarginis invariant to rescalingof theparame-ters; ,if we replacewwith2wandbwith2b, thenthegeometricmargindoes factcomein cally, becauseof thisinvarianceto thescalingof theparameters,whentryingto twandbto trainingdata,we canimposeanarbitraryscalingconstraint onwwithoutchanginganythingimportant; forinstance,we candemandthatjjwjj= 1, orjw1j= 5, orjw1+bj+jw2j= 2, andany of thesecanbe satis , given a trainingsetS=f(x(i); y(i));i= 1; : : : ; mg, we alsode nethegeometricmarginof (w.)

8 B) withrespecttoSto be thesmallestof thegeometricmarginsontheindividualtraini ngexamples: =mini=1;:::;m (i):4 Theoptimalmarginclassi erGivena trainingset,it seemsfromourpreviousdiscussionthata naturaldesideratumis to tryto nda decisionboundarythatmaximizesthe(ge-omet ric)margin,sincethiswouldre ecta verycon dent setof predictions6onthetrainingsetanda good \ t"to cally, thiswillresultin a classi erthatseparatesthepositive andthenegative trainingexampleswitha \gap"(geometricmargin).For now,we willassumethatwe aregivena trainingsetthatis linearlyseparable; ,thatit is possibleto separatethepositive andnegative we we ndtheonethatachieves themaximumgeometricmargin?We canposethefollowingopti-mizationproblem: max ;w;b (i)(wTx(i)+b) ; i= 1; : : : ; mjjwjj= 1 ,we want to maximize , subjectto each trainingexamplehavingfunc-tionalmarginat least.

9 Thejjwjj= 1 constraint moreover ensuresthatthefunctionalmarginequalsto thegeometricmargin,so we arealsoguaranteedthatallthegeometricmarg insareat least . Thus,solvingthisproblemwillresultin (w; b) withthelargestpossiblegeometricmarginwit hrespectto we couldsolve theoptimizationproblemabove, we'dbe \jjwjj= 1" constraint is a nasty (non-convex)one,andthisproblemcertainlyi sn'tin any formatthatwe canpluginto standardoptimizationsoftwaretosolve. So,letstrytransformingtheprobleminto a :max ;w;b^ (i)(wTx(i)+b) ^ ; i= 1; : : : ; mHere,we'regoingto maximize^ =jjwjj, subjectto thefunctionalmarginsallbeingat least^ . Sincethegeometricandfunctionalmarginsare relatedby = ^ =jjwj, thiswillgive us theanswer we want.

10 Moreover,we've gottenridof theconstraintjjwjj= 1 thatwe didn'tlike. Thedownsideis thatwe nowhave a nasty (again,non-convex)objective^ jjwjjfunction;and,we stilldon'thave any o -the-shelfsoftwarethatcansolve thisformof canaddan arbitraryscalingconstraint thekey ideawe' willintroducethescalingconstraint thatthefunctionalmarginofw; bwithrespectto thetrainingsetmustbe 1:^ = 1:7 Sincemultiplyingwandbby someconstant resultsin thefunctionalmarginbeingmultipliedby thatsameconstant, thisis indeeda scalingconstraint,andcanbe satis edby rescalingw; b. Pluggingthisinto ourproblemabove,andnotingthatmaximizing^ =jjwjj= 1=jjwjjis thesamethingas minimizingjjwjj2, we now have thefollowingoptimizationproblem:min ;w; (i)(wTx(i)+b) 1; i= 1; : : : ; mWe've now transformedtheprobleminto a formthatcanbe e is anoptimizationproblemwitha convexquadraticob-jective us theoptimalmar-ginclassi er.


Related search queries