Example: air traffic controller

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: a guide to the gbm packageGreg RidgewayAugust 3, 2007 Boosting takes on various forms with different programs using different lossfunctions, different base models, and different optimization schemes. The gbmpackage takes the approach described in [2] and [3]. Some of the terminologydiffers, mostly due to an effort to cast boosting terms into more standard sta-tistical terminology ( deviance). In addition, the gbm package implementsboosting for models commonly used in statistics but not commonly associatedwith boosting.

Generalized Boosted Models: A guide to the gbm package Greg Ridgeway August 3, 2007 Boosting takes on various forms with different programs using different loss

Tags:

  Guide, Model, A guide, Generalized, Boosted, Generalized boosted models

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Generalized Boosted Models: A guide to the gbm package

1 Generalized Boosted Models: a guide to the gbm packageGreg RidgewayAugust 3, 2007 Boosting takes on various forms with different programs using different lossfunctions, different base models, and different optimization schemes. The gbmpackage takes the approach described in [2] and [3]. Some of the terminologydiffers, mostly due to an effort to cast boosting terms into more standard sta-tistical terminology ( deviance). In addition, the gbm package implementsboosting for models commonly used in statistics but not commonly associatedwith boosting.

2 The Cox proportional hazard model , for example, is an incred-ibly useful model and the boosting framework applies quite readily with onlyslight modification [5]. Also some algorithms implemented in the gbm packagediffer from the standard implementation. The AdaBoost algorithm [1] has aparticular loss function and a particular optimization algorithm associated withit. The gbm implementation of AdaBoost adopts AdaBoost s exponential lossfunction (its bound on misclassification rate) but uses Friedman s gradient de-scent algorithm rather than the original one proposed.

3 So the main purposes ofthis document is to spell out in detail what the gbm package Gradient boostingThis section essentially presents the derivation of boosting described in [2]. Thegbm package also adopts the stochastic gradient boosting strategy, a small butimportant tweak on the basic algorithm, described in [3]. Friedman s gradient boosting machineFriedman (2001) and the companion paper Friedman (2002) extended the workof Friedman, Hastie, and Tibshirani (2000) and laid the ground work for a newgeneration of boosting algorithms.

4 Using the connection between boosting andoptimization, this new work proposes the Gradient Boosting any function estimation problem we wish to find a regression function, f(x), that minimizes the expectation of some loss function, (y, f), as shownin (4). f(x) = arg minf(x)Ey,x (y, f(x))1 Initialize f(x) to be a constant, f(x) = arg min Ni=1 (yi, ).Fortin 1, .. , Tdo1. Compute the negative gradient as the working responsezi= f(xi) (yi, f(xi))|f(xi)= f(xi)(1)2. Fit a regression model ,g(x), predictingzifrom the Choose a gradient descent step size as = arg min N i=1 (yi, f(xi) + g(xi))(2)4.

5 Update the estimate off(x) as f(x) f(x) + g(x)(3)Figure 1: Friedman s Gradient Boost algorithm2= arg minf(x)Ex[Ey|x (y, f(x)) x](4)We will focus on finding estimates off(x) such that f(x) = arg minf(x)Ey|x[ (y, f(x))|x](5)Parametric regression models assume thatf(x) is a function with a finite numberof parameters, , and estimates them by selecting those values that minimize aloss function ( squared error loss) over a training sample ofNobservationson (y,x) pairs as in (6). = arg min N i=1 (yi, f(xi; ))(6)When we wish to estimatef(x) non-parametrically the task becomes more dif-ficult.

6 Again we can proceed similarly to [4] and modify our current estimateoff(x) by adding a new functionf(x) in a greedy fashion. Lettingfi=f(xi),we see that we want to decrease theNdimensional functionJ(f) =N i=1 (yi, f(xi))=N i=1 (yi, Fi).(7)The negative gradient ofJ(f) indicates the direction of the locally greatestdecrease inJ(f). Gradient descent would then have us modifyfas f f J(f)(8)where is the size of the step along the direction of greatest descent. Clearly,this step alone is far from our desired goal.

7 First, it only fitsfat values ofxfor which we have observations. Second, it does not take into account thatobservations with similarxare likely to have similar values off(x). Both theseproblems would have disastrous effects on generalization error. However, Fried-man suggests selecting a class of functions that use the covariate informationto approximate the gradient, usually a regression tree. This line of reasoningproduces his Gradient Boosting algorithm shown in Figure 1. At each itera-tion the algorithm determines the direction, the gradient, in which it needs toimprove the fit to the data and selects a particular model from the allowableclass of functions that is in most agreement with the direction.

8 In the case ofsquared-error loss, (yi, f(xi)) = Ni=1(yi f(xi))2, this algorithm correspondsexactly to residual are various ways to extend and improve upon the basic frameworksuggested in Figure 1. For example, Friedman (2001) substituted several choices3in for to develop new boosting algorithms for robust regression with leastabsolute deviation and Huber loss functions. Friedman (2002) showed thata simple subsampling trick can greatly improve predictive performance whilesimultaneously reduce computation time.

9 Section 2 discusses some of Improving boosting methods using control ofthe learning rate, sub-sampling, and a decom-position for interpretationThis section explores the variations of the previous algorithms that have thepotential to improve their predictive performance and interpretability. In par-ticular, by controlling the optimization speed or learning rate, introducing low-variance regression methods, and applying ideas from robust regression we canproduce non-parametric regression procedures with many desirable a by-product some of these modifications lead directly into implementationsfor learning from massive datasets.

10 All these methods take advantage of thegeneral form of boosting f(x) f(x) + E(z(y, f(x))|x).(9)So far we have taken advantage of this form only by substituting in our favoriteregression procedure for Ew(z|x). I will discuss some modifications to estimatingEw(z|x) that have the potential to improve our Decreasing the learning rateAs several authors have phrased slightly differently, ..boosting, whatever flavor,seldom seems to overfit, no matter how many terms are included in the additiveexpansion.


Related search queries