A Short Introduction to Boosting

Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.(In Japanese, translation by Naoki Abe.)A ShortIntroductiontoBoostingYoav ResearchShannonLaboratory180 ParkAvenueFlorhamPark, yoav, schapire yoav, schapire a generalmethodforimprovingtheaccuracy ofany ,andexplainstheun-derlyingtheoryofboosti ng,includinganexplanationofwhyboostingof tendoesnotsufferfromover ttingaswellasboosting's relationshipto horse-racinggambler, hopingtomaximizehiswinnings,decidestocre atea computerprogramthatwillaccuratelypredict thewinnerofa horseracebasedontheusualinformation(numb erofracesrecentlywonbyeachhorse,bettingo ddsforeachhorse,etc.).To createsucha program,heasksa highlysuccessfulexpertgamblertoexplainhi sbettingstrategy. Notsurprisingly, theexpertis unabletoarticulatea grandsetofrulesforselectinga ,whenpresentedwiththedatafora speci csetofraces,theexperthasnotroublecomingu pwitha ruleofthumb forthatsetofraces(suchas, Betonthehorsethathasrecentlywonthemostra ces or Betonthehorsewiththemostfavoredodds ).

Althoughsucha ruleofthumb,byitself,is obviouslyveryroughandinaccurate,it is notunreasonabletoexpectit ,byrepeatedlyaskingtheexpert's opinionondifferentcollectionsofraces,the gambleris abletoextractmany ,therearetwo problemsfacedbythegambler:First,how shouldhechoosethecollectionsofracesprese ntedtotheexpertsoastoextractrulesofthumb fromtheexpertthatwillbethemostuseful?Sec ond,oncehehascollectedmany rulesofthumb,how canthey becombinedintoa single,highlyaccuratepredictionrule?Boos tingreferstoa generalandprovablyeffective methodofproducinga veryaccuratepre-dictionrulebycombiningro ughandmoderatelyinaccuraterulesofthumbin a , ,wedescribesomeofthebasicunderlyingtheor yofboosting,includinganexplanationofwhyi t oftentendsnottoover theoreticalframeworkforstudyingmachinele arningcalledthe PAC learningmodel,duetoValiant[46];seeKearns andVazirani[32]fora [30,31] werethe rsttoposethequestionofwhethera weak learn-ingalgorithmwhichperformsjustsligh tlybetterthanrandomguessinginthePAC modelcanbe boosted intoanarbitrarilyaccurate strong [38]cameupwiththe , Freund[17]developeda muchmoreef cientboostingalgorithmwhich,althoughopti malina certainsense, rstexperimentswiththeseearlyboostingalgo rithmswerecarriedoutbyDrucker, SchapireandSimard[16] ,introducedin1995byFreundandSchapire[23] ,solvedmany ofthepracticaldif cultiesoftheearlierboostingalgorithms,an dis thefocusofthispaper.

PseudocodeforAdaBoostis trainingset whereeach belongstosomedomainorinstancespace , andeachlabel is insomelabelset . Formostofthispaper, weassume ! "# %$; later, givenweakorbaselearningalgorithmrepeated lyina seriesofrounds&' ( ! *). Oneofthemainideasofthealgorithmis tomaintaina +onround&isdenoted,.-/ +0 . Initially, allweightsaresetequally, butoneachround,theweightsofincorrectlycl assi edexamplesareincreasedsothattheweaklearn eris 's jobis to ndaweakhypothesis12-435 6 7 % "# %$appropriateforthedistribution,8-. Thegoodnessofa weakhypothesisis measuredbyitserror9-: <;>= @? A2B CD1E- GF H JIK L JMN BPORQ SUTWVXZYS,#- +0 Noticethattheerrorismeasuredwithrespectt othedistribution,[ ,theweaklearnermaybeanalgorithmthatcanus etheweights,\ , whenthisis notpossible,a subsetofthetrainingexamplescanbesampleda ccordingto,8-, andthese(unweighted) thehorse-racingexample,theinstances : correspondtodescriptionsofhorseraces(suc haswhichhorsesarerunning,whataretheodds, thetrackrecordsofeachhorse,etc.)]

2 Given: : / * where , ! "# %$Initialize,[ P+ > .For& ! *): Trainweaklearnerusingdistribution,-. Getweakhypothesis1 -3! 6 ! "# %$witherror9- H;=/ @? ABCR1Z- GF H UI Choose -: 9-9- . Update:,#- K +0 ,#-/ P+ - Bif1Z- Bif1Z- GF ,#-/ P+ 5 / ! - 1Z- * -where -is a normalizationfactor(chosensothat,- K willbea distribution).Outputthe nalhypothesis:" K > $#&% ' )(+*L-X -W1E- K -, Figure1 % give theoutcomes( ,thewinners) , -hasbeenreceived,AdaBoostchoosesa parameter -asinthe , -measurestheimportancethatis assignedto1 -. Notethat -/.10if9-32 54(whichwecanassumewithoutlossofgenerali ty),andthat ,8-is nextupdatedusingtheruleshowninthe toincreasetheweightofexamplesmisclassi edby12-, andtodecreasetheweightofcorrectlyclassi ,theweighttendstoconcentrateon hard "is a weightedmajorityvoteofthe)weakhypotheses where >-is theweightassignedto1 [42]show how ,foreachinstance ,theweakhypothesis1 -outputsa prediction1 - K 76whosesignis thepredictedlabel( or"# ) andwhosemagnitude81E- K 98givesa measureof con dence ,however, wefocusonlyonthecaseofbinary( 7 !]

". $) # roundsmarginFigure2 [41].Left: thetrainingandtesterrorcurves(lowerandup percurves,respectively)ofthecombinedclas si erasa eraswellasthetesterrorofthe nalcombinedclassi : Thecumulative distributionofmarginsofthetrainingexampl esafter5,100and1000iterations,indicatedb yshort-dashed,long-dashed(mostlyhidden)a ndsolidcurves, -. Sincea hypothesisthatguesseseachinstance's classatrandomhasanerrorrateof 4(onbinaryproblems), E-thusmeasureshow muchbetterthanrandomare1E-'s [23]prove thatthetrainingerror(thefractionofmistak esonthetrainingset)ofthe nalhypothesis"is at most - 4 9- 9-W G - -2 ( 4L- -, (1)Thus,if eachweakhypothesisis slightlybetterthanrandomsothat -. forsome 0, similarpropertyis , previousalgorithmsrequiredthatsucha lowerbound beknowna ,knowl-edgeofsucha boundis verydif ,ontheotherhand, thebasisofitsname Ada is shortfor adaptive. TheboundgiveninEq.(1),combinedwiththebou ndsongeneralizationerrorgivenbelow,prove thatAdaBoostisindeeda boostingalgorithminthesensethatit canef cientlyconverta weaklearningalgorithm(whichcanalwaysgene ratea hypothesiswitha weakedgeforanydistribution)intoa stronglearningalgorithm(whichcangenerate a hypothesiswithanarbitrarilylow errorrate,givensuf cientdata).

4051015202530boosting setof27benchmarkproblemsasreportedbyFreu ndandSchapire[21].Eachpointineachscatter plotshowsthetesterrorrateofthetwo competingalgorithmsona -coordinateofeachpointgivesthetesterrorr ate(inpercent) ,andthe -coordinategivestheerrorrateofboostingst umps(leftplot) (rightplot).Allerrorrateshave [23]showedhow toboundthegeneralizationerrorofthe nalhypothesisintermsofitstrainingerror, thesamplesize , theVC-dimension oftheweakhypothesisspaceandthenumberofbo ostingrounds . (TheVC-dimensionis a standardmeasureofthe complexity ofa ,forinstance,Blumeret al.[5].)Speci cally, they usedtechniquesfromBaumandHaussler[4]tosh ow thatthegeneralizationerror, withhighprobability, is at most "!#where %$ tif runfortoomany rounds, ,as , , in earlyexperiments,severalauthors[9,15,36] observedempiricallythatboost-ingoftendoe snotover t, , it wasobservedthatAdaBoostwouldsometimescon tinuetodrive downthegeneralizationerrorlongafterthetr ainingerrorhadreachedzero, , showsthetrainingandtestcurvesofrunningbo ostingontopofQuinlan's [37]onthe letter ndings,Schapireet al.

[41],followingtheworkofBartlett[2],gave analternative ErrorNumber of ClassesAdaBoostSleeping-expertsRocchioNa ive-BayesPrTFIDF510152025303546810121416 1820% ErrorNumber of ClassesAdaBoostSleeping-expertsRocchioNa ive-BayesPrTFIDFF igure4:ComparisonoferrorratesforAdaBoost andfourothertextcategorizationmethods(na ive Bayes,probabilisticTF-IDF, Rocchioandsleepingexperts)asreportedbySc hapireandSinger[43].Thealgorithmsweretes tedontwo textcorpora Reutersnewswirearticles(left)andAPnewswi reheadlines(right) andwithvaryingnumbersofclasslabelsasindi catedonthe -axisofeach Z is de nedtobe L- -W1Z- K L- - (2)It is a numberinC@ ! "# Iwhichis positive if andonlyif"correctlyclassi , themagnitudeofthemargincanbeinterpreteda sa measureofcon superiorupperboundonthegeneralizationerr or. Speci cally, thegeneralizationerroris at most ;=5C =&'% * E 2 I " (3)forany 0withhighprobability. Notethatthisboundis entirelyindependentof), ,Schapireet particularlyaggressive atreducingthemargin(ina quanti ablesense)sinceit concentratesontheexampleswiththesmallest margins(whetherpositive ornegative).

Boosting 's effectonthemarginscanbeseenempirically,f orinstance, whichshowsthecumulative distributionofmarginsofthetrainingexampl esonthe letter ,evenafterthetrainingerrorreacheszero,bo ostingcontinuesto increasethemarginsofthetrainingexamplese ffectinga (notalwayssuccessful)tousetheinsightsgle anedfromthetheoryofmarginshavebeenmadeby severalauthors[7,27,34].ThebehaviorofAda Boostcanalsobeunderstoodina game-theoreticsettingasexploredbyFreunda ndSchapire[22, 24](seealsoGrove andSchuurmans[27]andBreiman[8]).Inpartic ular,boostingcanbeviewedasrepeatedplayof a certaingame,andAdaBoostcanbeshowntobea6s pecialcaseofa moregeneralalgorithmforplayingrepeatedga mesandforapproximatelysolvinga closelyrelatedto a strongconnectionbetweenboostingandthesup port-vectormachinesofVapnikandothers[6,1 2,47].To clarifytheconnection,supposethatwehave alreadyfoundtheweakhypothesesthatwewantt ocombineandareonlyinterestedinchoosingth ecoef cients -. Onereasonableapproachsuggestedbytheanaly sisofAdaBoost's generalizationerroris tochoosethecoef cientssothattheboundgiveninEq.

(3) , supposethatthe rsttermiszeroandletusconcentrateonthesec ondtermsothatweareeffectivelyattemptingt omaximizetheminimummarginofany make thisideaprecise,letusdenotethevectorofwe ak-hypothesispredictionsassociatedwithth eexample * E by K 1K K 1 K 1 K whichwecalltheinstancevectorandthevector ofcoef cientsby whichwecalltheweightvector. Usingthisnotationandthede nitionofmargingiveninEq.(2)wecanwritethe goalofmaximizingtheminimummarginas % P / 8 8 8 88 8 8 8(4)where,forboosting,thenormsinthedenom inatorarede nedas:8 8 88R <L-8 -8 8 8 K 988 -81Z- K 98% (Whenthe1E-'s allhave range ! "# %$,8 8 K 8 8 is simplyequalto .)Incomparison,theexplicitgoalofsupport- vectormachinesis tomaximizea minimalmarginoftheformdescribedinEq.(4), butwherethenormsareinsteadEuclidean:88 88 L- - 88 K 988 L-1Z- K Thus,SVM's usethe normforboththeinstancevectorandtheweight vector, whileAdaBoostusesthe normfortheinstancevectorand % , SVMandAdaBoostseemverysimilar. However, thereareseveralimportantdifferences: , and maynotbeverysigni cantwhenoneconsiderslow , inboostingorinSVM,thedimensionisusuallyv eryhigh, case,thedifferencebetweenthenormscanresu ltinverylargedifferences1 Ofcourse, ,Schapireetal.

's [41] analysissuggeststhatthealgorithmdoestryt omake themarginsofallthetrainingexamplesaslarg easpossible,sointhissense,wecanregardthi smaximumminimalmarginalgorithmasanillust rative ,algorithmsthatexplicitlyattempttomaximi zeminimalmarginhave notbeenexperimentallyassuccessfulasAdaBo ost[7, 27]. fewrelevantvariablessothat ,supposetheweakhypothesesallhaverange ! "# %$andthatthelabel onallexamplescanbecomputedbya majorityvoteof ,it canbeshownthatif thenumberofrelevantweakhypotheses isa smallfractionofthetotalnumberofweakhypot hesesthenthemarginassociatedwithAdaBoost willbemuchlargerthantheoneassociatedwith supportvectormachines. Thecomputationrequirementsare mathematicalprogramming, ,maximizinga mathematicalexpressiongivena methodsinthisregardis thatSVMcor-respondstoquadraticprogrammin g, whileAdaBoostcorrespondsonlytolinearprog ram-ming. (Infact,asnotedabove,thereisa deeprelationshipbetweenAdaBoostandlinear programmingwhichalsoconnectsAdaBoostwith gametheoryandonlinelearning[22].)

A Short Introduction to Boosting

Tags:

Information

Transcription of A Short Introduction to Boosting

Related search queries

A Short Introduction to Boosting

Tags:

Information

Documents from same domain

Related documents

Related search queries