Bagging and Boosting - mit.edu

Bagging and Class 10, 13 March 2006 Sasha RakhlinPlan Bagging and sub-sampling methods bias - variance and stability for Bagging Boosting and correlations of machines Gradient descent view of boostingBagging (Bootstrap AGGregatING)Given a training setD={(x1, y1), ..(xn, yn)}, sampleTsets ofnelements fromD(with replacement)D1, D2, .. DT Tquasi replica training sets; train a machine on eachDi,i= 1, .., Tand obtain asequence ofToutputsf1(x), .. fT(x). Bagging (cont.)The final aggregate classifier can be for regression f(x) =T i=1fi(x),the average offifori= 1, .., T; for classification f(x) = sign(T i=1fi(x))or the majority vote f(x) = sign(T i=1sign(fi(x)))Variation I: Sub-sampling methods- Standard Bagging : each of theTsubsamples has sizenand created with Sub- Bagging : createTsubsamples of size only ( <n).

- No replacement: same as Bagging or sub- Bagging , butusing sampling without replacement- Overlap vs non-overlap: Should theTsubsamples over-lap? createTsubsamples each withnTtraining - variance for Regression (Breiman1996)LetI[f] = (f(x) y)2p(x, y)dxdybe the expected risk andf0the regression function. With f(x) =ESfS(x), if we define thebiasas (f0(x) f(x))2p(x)dxand thevarianceasES{ (fS(x) f(x))2p(x)dx},we have the decompositionES{I[fS]}=I[f0] + bias + reduces variance (Intuition)If each single classifier isunstable that is, it hashighvariance, the aggregated classifier fhas a smallervari-ancethan a single original aggregated classifier fcan be thought of as an ap-proximation to the true averagefobtained by replacingthe probability distributionpwith the bootstrap approxi-mation topobtained concentrating mass 1/nat each point(xi, yi).

Variation II: weighting and combiningalternatives- No subsampling, but instead each machine uses differentweights on the training Instead of equal voting, use weighted Instead of voting, combine using other and strong learnersKearns and Valiant in 1988/1989 asked if there exist twotypes of hypothesis spaces of classifiers. Strong learners: Given a large enough dataset the clas-sifier can arbitrarily accurately learn the target function1 Weak learners: Given a large enough dataset the clas-sifier can barely learn the target function12+ The hypothesis Boosting problem: are the above equiva-lent ?The original Boosting (Schapire, 1990):For Classification Only1.

Train a first classifierf1on a training set drawn froma probabilityp(x, y). Let 1be the obtained trainingperformance;2. Train a second classifierf2on a training set drawn froma probabilityp2(x, y) such that it has half its measureon the event thath1makes a mistake and half on therest. Let 2be the obtained performance;3. Train a third classifierf3on disagreements of the firsttwo that is, drawn from a probabilityp3(x, y) whichhas its support on the event 3be the obtained (cont.)Main result: If i< pfor alli, the boosted hypothesisg=MajorityV ote(f1, f2, f3)has training performance no worse than = 3p2 (Freund and Schapire, 1996)The idea is ofadaptivelyresampling the data Maintain a probability distribution over training set; Generate a sequence of classifiers in which the next classifier focuses on sample where the previous clas-sifier failed; Weighmachines according to their : a classF={f:X 7 { 1,1}}of weak learners andthe data{(x1, y1).}

,(xn, yn)},yi { 1,1}. Initialize theweights asw1(i) = 1 1, .. T:1. Find a weak learnerftbased on weightswt(i);2. Compute theweightederror t= ni=1wt(i)I(yi6=ft(xi));3. Compute theimportanceofftas t= 1/2ln(1 t t);4. Update the distributionwt+1(i) =wt(i)e tyift(xi)Zt,Zt= ni=1wt(i)e tyiht(xi).Adaboost (cont.)Adopt as final hypothesisg(x) = sign T t=1 tft(x) Theory of BoostingWe define the margin of (xi, yi) according tothe real valuedfunctiongto bemargin(xi, yi) =yig(xi).Note that this notion of margin isdifferentfrom the SVMmargin. This defines a margin for each training point!Performance of AdaboostTheorem:Let t= 1/2 t(how much betterftis on theweighted sample than tossing a coin).

Then1nn i=1I(yig(xi)<0) T t=1 1 4 2tGradient descent view of boostingWe would like to minimize1nn i=1I(yig(xi)<0)over the linear span of some base classF. Think ofFasthe weak problems: a) linear span ofFcan be huge and search-ing for the minimizer directly is intractable. b) the indi-cator is non-convex and the problem can be shown to beNP-hard even for to b): replace the indicatorI(yg(x)<0) with aconvex upper bound (yg(x)).Solution to a)?Gradient descent view of boostingLet s search over the linear span ofFstep-by-step. Ateach stept, we add a new functionft Fto the existingg= t 1i=1 (g) =1n ni=1 (yig(xi)). We wish to findft Fto add togsuch thatC (g+ ft) decreases.

The desireddirection is 5C (g). We choose the new functionftsuchthat it has the greatest inner product with 5C , itmaximizes <5C (g), ft> .Gradient descent view of boostingOne can verify that <5C (g), ft>= 1n2n i=1yift(xi) (yig(xi)).Hence, findingftmaximizing <5C (g), ft>is equivalentto minimizing the weighted errorn i=1wt(i)I(ft(xi)6=yi)wherewt(i) := (yig(xi)) nj=1 (yjg(xj))For (yg(x)) =e yg(x)this becomes Adaboost.

Bagging and Boosting - mit.edu

Tags:

Information

Transcription of Bagging and Boosting - mit.edu

Related search queries

Bagging and Boosting - mit.edu

Tags:

Information

Documents from same domain

Related documents

Related search queries