A Decision-Theoretic Generalization of On-Line …

File: 571J 150401 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. Page 01:01 Codes: 6428 Signs: 4372 . Length: 60 pic 11 pts, 257 mmJournal of Computer and System Sciences SS1504journal of computer and system sciences55, 119 139 (1997)A Decision-Theoretic Generalization of On-Line Learningand an Application to Boosting*Yoav Freund and Robert E. Schapire-AT6T Labs,180 Park Avenue,Florham Park,New Jersey07932 Received December 19, 1996In the first part of the paper we consider the problem of dynamicallyapportioning resources among a set of options in a worst-case on-lineframework. The model we study can be interpreted as a broad, abstractextension of the well-studied On-Line prediction model to a generaldecision- theoretic setting. We show that the multiplicative weight-update Littlestone Warmuth rule can be adapted to this model, yieldingbounds that are slightly weaker in some cases, but applicable to a con-siderably more general class of learning problems.

We show how theresulting learning algorithm can be applied to a variety of problems,including gambling, multiple-outcome prediction, repeated games, andprediction of points inRn. In the second part of the paper we apply themultiplicative weight-update technique to derive a new boosting algo-rithm. This boosting algorithm does not require any prior knowledgeabout the performance of the weak learning algorithm. We also studygeneralizations of the new boosting algorithm to the problem oflearning functions whose range, rather than being binary, is an arbitraryfinite set or a bounded segment of the real line.]1997 Academic Press1. INTRODUCTIONA gambler, frustrated by persistent horse-racing lossesand envious of his friends' winnings, decides to allow agroup of his fellow gamblers to make bets on his behalf.

Hedecides he will wager a fixed sum of money in every race, butthat he will apportion his money among his friends based onhow well they are doing. Certainly, if he knew psychicallyahead of time which of his friends would win the most, hewould naturally have that friend handle all his such clairvoyance, however, he attempts to allocateeach race's wager in such a way that his total winnings forthe season will be reasonably close to what he would havewon had he bet everything with the luckiest of his this paper, we describe a simple algorithm for solvingsuch dynamic allocation problems, and we show that oursolution can be applied to a great assortment of learningproblems. Perhaps the most surprising of these applicationsis the derivation of a new algorithm for ``boosting,'' , forconverting a ``weak'' PAC learning algorithm that performsjust slightly better than random guessing into one witharbitrarily high formalize ouron-line allocation modelas follows.

Theallocation agentAhasNoptions orstrategiesto choosefrom; we number these using the integers 1, ..,N. At eachtime stept=1, 2, ..,T, the allocatorAdecides on a distribu-tionptover the strategies; that ispti 0 is the amountallocated to strategyi, and Ni=1pti=1. Each strategyithensuffers somelossltiwhich is determined by the (possiblyadversarial) ``environment.'' The loss suffered byAis then ni=1ptilti=pt}lt, , the average loss of the strategies withrespect toA's chosen allocation rule. We call this loss func-tion themixture this paper, we always assume that the loss suffered byany strategy is bounded so that, without loss of generality,lti# [0, 1]. Besides this condition, we make no assumptionsabout the form of the loss vectorslt, or about the mannerin which they are generated; indeed, the adversary's choiceforltmay even depend on the allocator's chosen goal of the algorithmAis to minimize its cumulativeloss relative to the loss suffered by the best strategy.

That is,Aattempts to minimize itsnet lossLA&miniLiwhereLA=:Tt=1pt}ltis the total cumulative loss suffered by algorithmAon thefirstTtrials, andLi=:Tt=1ltiis strategyi's cumulative Section 2, we show that Littlestone and Warmuth's[20] ``weighted majority'' algorithm can be generalized toarticle 97 1997 by Academic PressAll rights of reproduction in any form reserved.* An extended abstract of this work appeared in the ``Proceedings ofthe Second European Conference on Computational Learning Theory,Barcelona, March, 1995.''-E-mail:[yoav, schapire] : 571J 150402 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. Page 01:01 Codes: 6472 Signs: 5755 . Length: 56 pic 0 pts, 236 mmhandle this problem, and we prove a number of bounds onthe net loss. For instance, one of our results shows that thenet loss of our algorithm can be bounded byO(-TlnN)or, put another way, that the average per trial net loss isdecreasing at the rateO(-(lnN) T).

Thus, asTincreases,this difference decreases to results for the On-Line allocation model can beapplied to a wide variety of learning problems, as wedescribe in Section 3. In particular, we generalize the resultsof Littlestone and Warmuth [20] and Cesa-Bianchiet al.[4] for the problem of predicting a binary sequence usingthe advice of a team of ``experts.'' Whereas these authorsproved worst-case bounds for making On-Line randomizeddecisions over a binary decision and outcome space witha[0, 1]-valued discrete loss, we prove (slightly weaker)bounds that are applicable to any bounded loss functionover any decision and outcome spaces. Our bounds expressexplicitly the rate at which the loss of the learning algorithmapproaches that of the best generalizations of the expert prediction modelwere studied by Vovk [25], Kivinen and Warmuth [19],and Haussleret al.

[15]. Like us, these authors focusedprimarily on multiplicative weight-update algorithms. Chung[5] also presented a Generalization , giving the problem agame- theoretic to the horse-racing story, suppose now that thegambler grows weary of choosing among the experts andinstead wishes to create a computer program that willaccurately predict the winner of a horse race based on theusual information (number of races recently won by eachhorse, betting odds for each horse, etc.). To create such aprogram, he asks his favorite expert to explain his bettingstrategy. Not surprisingly, the expert is unable to articulatea grand set of rules for selecting a horse. On the other hand,when presented with the data for a specific set of races, theexpert has no trouble coming up with a ``rule-of-thumb'' forthat set of races (such as, ``Bet on the horse that has recentlywon the most races'' or ``Bet on the horse with the mostfavored odds'').

Although such a rule-of-thumb, by itself, isobviously very rough and inaccurate, it is not unreasonableto expect it to provide predictions that are at least a little bitbetter than random guessing. Furthermore, by repeatedlyasking the expert's opinion on different collections of races,the gambler is able to extract many order to use these rules-of-thumb to maximum advan-tage, there are two problems faced by the gambler: First,how should he choose the collections of races presented tothe expert so as to extract rules-of-thumb from the expertthat will be the most useful? Second, once he has collectedmany rules-of-thumb, how can they be combined into asingle, highly accurate prediction rule?Boostingrefers to this general problem of producing avery accurate prediction rule by combining rough andmoderately inaccurate rules-of-thumb.

In the second part ofthe paper, we present and analyze a new boosting algorithminspired by the methods we used for solving the on-lineallocation , boosting proceeds as follows: The booster isprovided with a set of labelled training examples (x1,y1),.., (xN,yN), whereyiis the label associated with instancexi;for instance, in the horse-racing example,ximight be theobservable data associated with a particular horse race, andyithe outcome (winning horse) of that race. On each roundt=1, ..,T, the booster devises a distributionDtover the setof examples, and requests (from an unspecified oracle) aweak hypothesis(or rule-of-thumb)htwith low error=twithrespect toDt(that is,=t=PritDt[ht(xi){yi]). Thus, dis-tributionDtspecifies the relative importance of each examplefor the current round. AfterTrounds, the booster mustcombine the weak hypotheses into a single prediction the previous boosting algorithms of Freund[10, 11] and Schapire [22], the new algorithm needs noprior knowledge of the accuracies of the weak , it adapts to these accuracies and generates aweighted majority hypothesis in which the weight of eachweak hypothesis is a function of its accuracy.}

For binaryprediction problems, we prove in Section 4 that the errorof this final hypothesis (with respect to the given set ofexamples) is bounded by exp( &2 Tt=1#2t) where=t=1 2&#tis the error of thetth weak hypothesis. Since ahypothesis that makes entirely random guesses has error1 2,#tmeasures the accuracy of thetth weak hypothesisrelative to random guessing. Thus, this bound shows that ifwe can consistently find weak hypotheses that are slightlybetter than random guessing, then the error of the finalhypothesis drops exponentially that the bound on the accuracy of the finalhypothesis improves whenanyof the weak hypotheses isimproved. This is in contrast with previous boosting algo-rithms whose performance bound depended only on theaccuracy of the least accurate weak hypothesis.

A Decision-Theoretic Generalization of On-Line …

Tags:

Information

Transcription of A Decision-Theoretic Generalization of On-Line …

Related search queries

A Decision-Theoretic Generalization of On-Line …

Tags:

Information

Documents from same domain

Related documents

Related search queries