Transcription of A Short Introduction to Boosting
1 Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.(In Japanese, translation by Naoki Abe.)A ShortIntroductiontoBoostingYoav ResearchShannonLaboratory180 ParkAvenueFlorhamPark, yoav, schapire yoav, schapire a generalmethodforimprovingtheaccuracy ofany ,andexplainstheun-derlyingtheoryofboosti ng,includinganexplanationofwhyboostingof tendoesnotsufferfromover ttingaswellasboosting's relationshipto horse-racinggambler, hopingtomaximizehiswinnings,decidestocre atea computerprogramthatwillaccuratelypredict thewinnerofa horseracebasedontheusualinformation(numb erofracesrecentlywonbyeachhorse,bettingo ddsforeachhorse,etc.).To createsucha program,heasksa highlysuccessfulexpertgamblertoexplainhi sbettingstrategy. Notsurprisingly, theexpertis unabletoarticulatea grandsetofrulesforselectinga ,whenpresentedwiththedatafora speci csetofraces,theexperthasnotroublecomingu pwitha ruleofthumb forthatsetofraces(suchas, Betonthehorsethathasrecentlywonthemostra ces or Betonthehorsewiththemostfavoredodds ).
2 Althoughsucha ruleofthumb,byitself,is obviouslyveryroughandinaccurate,it is notunreasonabletoexpectit ,byrepeatedlyaskingtheexpert's opinionondifferentcollectionsofraces,the gambleris abletoextractmany ,therearetwo problemsfacedbythegambler:First,how shouldhechoosethecollectionsofracesprese ntedtotheexpertsoastoextractrulesofthumb fromtheexpertthatwillbethemostuseful?Sec ond,oncehehascollectedmany rulesofthumb,how canthey becombinedintoa single,highlyaccuratepredictionrule?Boos tingreferstoa generalandprovablyeffective methodofproducinga veryaccuratepre-dictionrulebycombiningro ughandmoderatelyinaccuraterulesofthumbin a , ,wedescribesomeofthebasicunderlyingtheor yofboosting,includinganexplanationofwhyi t oftentendsnottoover theoreticalframeworkforstudyingmachinele arningcalledthe PAC learningmodel,duetoValiant[46];seeKearns andVazirani[32]fora [30,31] werethe rsttoposethequestionofwhethera weak learn-ingalgorithmwhichperformsjustsligh tlybetterthanrandomguessinginthePAC modelcanbe boosted intoanarbitrarilyaccurate strong [38]cameupwiththe , Freund[17]developeda muchmoreef cientboostingalgorithmwhich,althoughopti malina certainsense, rstexperimentswiththeseearlyboostingalgo rithmswerecarriedoutbyDrucker, SchapireandSimard[16] ,introducedin1995byFreundandSchapire[23] ,solvedmany ofthepracticaldif cultiesoftheearlierboostingalgorithms,an dis thefocusofthispaper.
3 PseudocodeforAdaBoostis trainingset whereeach belongstosomedomainorinstancespace , andeachlabel is insomelabelset . Formostofthispaper, weassume ! "# %$; later, givenweakorbaselearningalgorithmrepeated lyina seriesofrounds&' ( ! *). Oneofthemainideasofthealgorithmis tomaintaina +onround&isdenoted,.-/ +0 . Initially, allweightsaresetequally, butoneachround,theweightsofincorrectlycl assi edexamplesareincreasedsothattheweaklearn eris 's jobis to ndaweakhypothesis12-435 6 7 % "# %$appropriateforthedistribution,8-. Thegoodnessofa weakhypothesisis measuredbyitserror9-: <;>= @? A2B CD1E- GF H JIK L JMN BPORQ SUTWVXZYS,#- +0 Noticethattheerrorismeasuredwithrespectt othedistribution,[ ,theweaklearnermaybeanalgorithmthatcanus etheweights,\ , whenthisis notpossible,a subsetofthetrainingexamplescanbesampleda ccordingto,8-, andthese(unweighted) thehorse-racingexample,theinstances : correspondtodescriptionsofhorseraces(suc haswhichhorsesarerunning,whataretheodds, thetrackrecordsofeachhorse,etc.)]
4 2 Given: : / * where , ! "# %$Initialize,[ P+ > .For& ! *): Trainweaklearnerusingdistribution,-. Getweakhypothesis1 -3! 6 ! "# %$witherror9- H;=/ @? ABCR1Z- GF H UI Choose -: 9-9- . Update:,#- K +0 ,#-/ P+ - Bif1Z- Bif1Z- GF ,#-/ P+ 5 / ! - 1Z- * -where -is a normalizationfactor(chosensothat,- K willbea distribution).Outputthe nalhypothesis:" K > $#&% ' )(+*L-X -W1E- K -, Figure1 % give theoutcomes( ,thewinners) , -hasbeenreceived,AdaBoostchoosesa parameter -asinthe , -measurestheimportancethatis assignedto1 -. Notethat -/.10if9-32 54(whichwecanassumewithoutlossofgenerali ty),andthat ,8-is nextupdatedusingtheruleshowninthe toincreasetheweightofexamplesmisclassi edby12-, andtodecreasetheweightofcorrectlyclassi ,theweighttendstoconcentrateon hard "is a weightedmajorityvoteofthe)weakhypotheses where >-is theweightassignedto1 [42]show how ,foreachinstance ,theweakhypothesis1 -outputsa prediction1 - K 76whosesignis thepredictedlabel( or"# ) andwhosemagnitude81E- K 98givesa measureof con dence ,however, wefocusonlyonthecaseofbinary( 7 !]
5 ". $) # roundsmarginFigure2 [41].Left: thetrainingandtesterrorcurves(lowerandup percurves,respectively)ofthecombinedclas si erasa eraswellasthetesterrorofthe nalcombinedclassi : Thecumulative distributionofmarginsofthetrainingexampl esafter5,100and1000iterations,indicatedb yshort-dashed,long-dashed(mostlyhidden)a ndsolidcurves, -. Sincea hypothesisthatguesseseachinstance's classatrandomhasanerrorrateof 4(onbinaryproblems), E-thusmeasureshow muchbetterthanrandomare1E-'s [23]prove thatthetrainingerror(thefractionofmistak esonthetrainingset)ofthe nalhypothesis"is at most - 4 9- 9-W G - -2 ( 4L- -, (1)Thus,if eachweakhypothesisis slightlybetterthanrandomsothat -. forsome 0, similarpropertyis , previousalgorithmsrequiredthatsucha lowerbound beknowna ,knowl-edgeofsucha boundis verydif ,ontheotherhand, thebasisofitsname Ada is shortfor adaptive. TheboundgiveninEq.(1),combinedwiththebou ndsongeneralizationerrorgivenbelow,prove thatAdaBoostisindeeda boostingalgorithminthesensethatit canef cientlyconverta weaklearningalgorithm(whichcanalwaysgene ratea hypothesiswitha weakedgeforanydistribution)intoa stronglearningalgorithm(whichcangenerate a hypothesiswithanarbitrarilylow errorrate,givensuf cientdata).
6 4051015202530boosting setof27benchmarkproblemsasreportedbyFreu ndandSchapire[21].Eachpointineachscatter plotshowsthetesterrorrateofthetwo competingalgorithmsona -coordinateofeachpointgivesthetesterrorr ate(inpercent) ,andthe -coordinategivestheerrorrateofboostingst umps(leftplot) (rightplot).Allerrorrateshave [23]showedhow toboundthegeneralizationerrorofthe nalhypothesisintermsofitstrainingerror, thesamplesize , theVC-dimension oftheweakhypothesisspaceandthenumberofbo ostingrounds . (TheVC-dimensionis a standardmeasureofthe complexity ofa ,forinstance,Blumeret al.[5].)Speci cally, they usedtechniquesfromBaumandHaussler[4]tosh ow thatthegeneralizationerror, withhighprobability, is at most "!#where %$ tif runfortoomany rounds, ,as , , in earlyexperiments,severalauthors[9,15,36] observedempiricallythatboost-ingoftendoe snotover t, , it wasobservedthatAdaBoostwouldsometimescon tinuetodrive downthegeneralizationerrorlongafterthetr ainingerrorhadreachedzero, , showsthetrainingandtestcurvesofrunningbo ostingontopofQuinlan's [37]onthe letter ndings,Schapireet al.
7 [41],followingtheworkofBartlett[2],gave analternative ErrorNumber of ClassesAdaBoostSleeping-expertsRocchioNa ive-BayesPrTFIDF510152025303546810121416 1820% ErrorNumber of ClassesAdaBoostSleeping-expertsRocchioNa ive-BayesPrTFIDFF igure4:ComparisonoferrorratesforAdaBoost andfourothertextcategorizationmethods(na ive Bayes,probabilisticTF-IDF, Rocchioandsleepingexperts)asreportedbySc hapireandSinger[43].Thealgorithmsweretes tedontwo textcorpora Reutersnewswirearticles(left)andAPnewswi reheadlines(right) andwithvaryingnumbersofclasslabelsasindi catedonthe -axisofeach Z is de nedtobe L- -W1Z- K L- - (2)It is a numberinC@ ! "# Iwhichis positive if andonlyif"correctlyclassi , themagnitudeofthemargincanbeinterpreteda sa measureofcon superiorupperboundonthegeneralizationerr or. Speci cally, thegeneralizationerroris at most ;=5C =&'% * E 2 I " (3)forany 0withhighprobability. Notethatthisboundis entirelyindependentof), ,Schapireet particularlyaggressive atreducingthemargin(ina quanti ablesense)sinceit concentratesontheexampleswiththesmallest margins(whetherpositive ornegative).
8 Boosting 's effectonthemarginscanbeseenempirically,f orinstance, whichshowsthecumulative distributionofmarginsofthetrainingexampl esonthe letter ,evenafterthetrainingerrorreacheszero,bo ostingcontinuesto increasethemarginsofthetrainingexamplese ffectinga (notalwayssuccessful)tousetheinsightsgle anedfromthetheoryofmarginshavebeenmadeby severalauthors[7,27,34].ThebehaviorofAda Boostcanalsobeunderstoodina game-theoreticsettingasexploredbyFreunda ndSchapire[22, 24](seealsoGrove andSchuurmans[27]andBreiman[8]).Inpartic ular,boostingcanbeviewedasrepeatedplayof a certaingame,andAdaBoostcanbeshowntobea6s pecialcaseofa moregeneralalgorithmforplayingrepeatedga mesandforapproximatelysolvinga closelyrelatedto a strongconnectionbetweenboostingandthesup port-vectormachinesofVapnikandothers[6,1 2,47].To clarifytheconnection,supposethatwehave alreadyfoundtheweakhypothesesthatwewantt ocombineandareonlyinterestedinchoosingth ecoef cients -. Onereasonableapproachsuggestedbytheanaly sisofAdaBoost's generalizationerroris tochoosethecoef cientssothattheboundgiveninEq.
9 (3) , supposethatthe rsttermiszeroandletusconcentrateonthesec ondtermsothatweareeffectivelyattemptingt omaximizetheminimummarginofany make thisideaprecise,letusdenotethevectorofwe ak-hypothesispredictionsassociatedwithth eexample * E by K 1K K 1 K 1 K whichwecalltheinstancevectorandthevector ofcoef cientsby whichwecalltheweightvector. Usingthisnotationandthede nitionofmargingiveninEq.(2)wecanwritethe goalofmaximizingtheminimummarginas % P / 8 8 8 88 8 8 8(4)where,forboosting,thenormsinthedenom inatorarede nedas:8 8 88R <L-8 -8 8 8 K 988 -81Z- K 98% (Whenthe1E-'s allhave range ! "# %$,8 8 K 8 8 is simplyequalto .)Incomparison,theexplicitgoalofsupport- vectormachinesis tomaximizea minimalmarginoftheformdescribedinEq.(4), butwherethenormsareinsteadEuclidean:88 88 L- - 88 K 988 L-1Z- K Thus,SVM's usethe normforboththeinstancevectorandtheweight vector, whileAdaBoostusesthe normfortheinstancevectorand % , SVMandAdaBoostseemverysimilar. However, thereareseveralimportantdifferences: , and maynotbeverysigni cantwhenoneconsiderslow , inboostingorinSVM,thedimensionisusuallyv eryhigh, case,thedifferencebetweenthenormscanresu ltinverylargedifferences1 Ofcourse, ,Schapireetal.
10 's [41] analysissuggeststhatthealgorithmdoestryt omake themarginsofallthetrainingexamplesaslarg easpossible,sointhissense,wecanregardthi smaximumminimalmarginalgorithmasanillust rative ,algorithmsthatexplicitlyattempttomaximi zeminimalmarginhave notbeenexperimentallyassuccessfulasAdaBo ost[7, 27]. fewrelevantvariablessothat ,supposetheweakhypothesesallhaverange ! "# %$andthatthelabel onallexamplescanbecomputedbya majorityvoteof ,it canbeshownthatif thenumberofrelevantweakhypotheses isa smallfractionofthetotalnumberofweakhypot hesesthenthemarginassociatedwithAdaBoost willbemuchlargerthantheoneassociatedwith supportvectormachines. Thecomputationrequirementsare mathematicalprogramming, ,maximizinga mathematicalexpressiongivena methodsinthisregardis thatSVMcor-respondstoquadraticprogrammin g, whileAdaBoostcorrespondsonlytolinearprog ram-ming. (Infact,asnotedabove,thereisa deeprelationshipbetweenAdaBoostandlinear programmingwhichalsoconnectsAdaBoostwith gametheoryandonlinelearning[22].)