Transcription of Predicting Good Probabilities With Supervised Learning
1 IthacaNY14853 AbstractWe showthatmaxi-mummarginmethodssuchasboost edtreesandboostedstumpspushprobabilityma ssawayfrom0 and1 yieldinga Bayes,whichmake unrealis-ticindependenceassumptions,push probabilitiestoward0 ex-perimentwithtwo waysofcorrectingthebiasedprobabilitiespr edictedbysomelearningmeth- muchdatathey ,randomforests, IntroductionInmany applicationsit isimportanttopredictwellcali-bratedproba bilities;goodaccuracy orareaundertheROCcurve :SVMs,neuralnets,decisiontrees,memory-ba sedlearn-ing,baggedtrees,randomforests,b oostedtrees,boostedstumps,naive show howmaximummarginmethodssuchasSVMs,booste dtrees,andboostedstumpstendtopushpredict edprobabilitiesawayfrom0 predictandyieldsa bayeshave theoppositebiasandtendtopushpredictionsc loserto0 , Bonn,Germany, (s)/owner(s).
2 Suchasbaggedtreesandneuralnetshave (orlackof)characteristictoeachlearningme thod,weexperimentwithtwo :a methodfortransformingSVMoutputsfrom[ 1;+1]toposteriorprobabilities(Platt,1999 )IsotonicRegression:themethodusedbyZadro zny andElkan(2002;2001)tocalibrateprediction sfromboostednaive bayes,SVM,anddecisiontreemodelsPlattScal ingismosteffective whenthedistortioninthepredictedprobabili tiesis a morepowerfulcalibrationmethodthatcancorr ectany , thisextrapowercomesat a learningcurve analysisshowsthatIso-tonicRegressionis morepronetooverfitting,andthusper-formsw orsethanPlattScaling,whendatais , weexaminehowgoodaretheprobabilitiespre-d ictedbyeachlearningmethodaftereachmethod s predic-tionshave ,neuralnetsandbaggeddecisiontreesaretheb estlearningmethodsforpredictingwell-cali bratedprobabilitiespriortocalibration,bu t aftercalibrationthebestmethodsareboosted trees, CalibrationMethodsInthissectionwedescrib ethetwo , thesemethodsaredesignedforbinaryclassifi cationandit is ,calibratethebinarymodels.
3 Andrecombinethepre-dictions(Zadrozny &Elkan,2002). (1999)proposedtransformingSVMpredictions toposteriorprobabilitiesbypassingthemthr ougha willseeinSection4 thata learningmethodbef(x). To getcali-bratedprobabilities,passtheoutpu tthrougha sigmoid:P(y= 1jf) =11 +exp(Af+B)(1)wheretheparametersAandBaref ittedusingmaximumlikelihoodestimationfro ma fittingtrainingset(fi; yi).GradientdescentisusedtofindAandBsuch thattheyarethesolutionto:argminA;Bf Xiyilog(pi) + (1 yi)log(1 pi)g;(2)wherepi=11 +exp(Afi+B)(3)Two questionsarise:wheredoesthesigmoidtrains etcomefrom?andhow toavoidoverfittingtothistrainingset?If weusethesamedatasetthatwasusedtotrainthe modelwewanttocalibrate, ,if themodellearnstodiscriminatethetrainsetp er-fectlyandordersallthenegative examplesbeforetheposi-tive examples,thenthesigmoidtransformationwil loutputjusta 0, orderto ,however, is nota draw back, avoidoverfittingtothesigmoidtrainset,ano ut-of-samplemodelis thereareN+positive examplesandN negative examplesinthetrainset,foreachtrain-ingex amplePlattCalibrationusestargetvaluesy+a ndy (insteadof1 and0, respectively),wherey+=N++ 1N++ 2;y =1N + 2(4)Fora moredetailedtreatment,anda justificationoftheseparticulartargetvalu essee(Platt,1999).
4 ,butit (2002;2001)successfullyuseda moregeneralmethodbasedonIsotonicRegressi on(Robertsonetal.,1988)tocalibratepredic tionsfromSVMs,Naive Bayes,boostedNaive Bayes, thatthemappingfunctionbeisotonic(monoton icallyincreasing).Thatis,giventhepredict ionsfifroma modelandthetruetargetsyi, thebasicassumptioninIsotonicRegressionis that:yi=m(fi) + i(5) :trainingset(fi; yi)sortedaccordingtofi2 Initialize^mi;i=yi,wi;i= 13 While9i s:t:^mk;i 1 ^mi;lSetwk;l=wk;i 1+wi;lSet^mk;l= (wk;i 1^mk;i 1+wi;l^mi;l)=wk;lReplace^mk;i 1and^mi;lwith^mk; :^m(f) = ^mi;j, forfi< f fjwheremisanisotonic(monotonicallyincrea sing) ,givena trainset(fi; yi), theIsotonicRegres-sionproblemis findingtheisotonicfunction^msuchthat^m=a rgminzX(yi z(fi))2(6)Onealgorithmthatfindsa stepwiseconstantsolutionfortheIsotonicRe gressionproblemis pair-adjacentviolators(PAV)algorithm(Aye ret al.)
5 ,1955) thecaseofPlattcalibration,if weusethemodeltrain-ingset(xi; yi)to getthetrainingset(f(xi); yi)forIsotonicRegression, DataSetsWe comparealgorithmson8 , COVTYPEandLETTER arefromUCIR epository(Blake &Merz,1998).COVTYPE hasbeenconvertedtoa binaryproblembytreatingthelargestclassas positive andtherestasnegative. We convertedLETTER tobooleantwo O aspositive andtheremaining25lettersasnegative,yield inga ,yieldinga difficult,butwellbalanced, (Gualtierietal.,1999)wherethedifficultcl assSoybean-mintillis thepositive is a problemfromtheStanfordLinearAccelerator. #ATTRTRAINSIZETEST SIZE%POZADULT14/10440003522225%COVTYPE54 40002500036% Qualitative ,wetrainmodelsusingtende-cisiontreestyle s,neuralnetsofmany sizes,SVMswithmany kernels, , ,foreachproblem,andforeachlearningalgori thm, ,modelcalibrationcanbevisualizedwithre-l iabilitydiagrams(DeGroot&Fienberg,1982).
6 First, , , ,themeanpredictedvalueis plottedagainstthetruefractionofpositive showshistogramsofthepredictedvalues(topr ow)andreliabilitydiagrams(middleandbotto mrows) is thatthey displaya sig-moidalshapeonsevenoftheeightproblems 1, motivatingtheuseofa sigmoidto ofthefigureshow sigmoidsfittedusingPlatt s (toprowinFigure1),notethatalmostalltheva luespredictedbyboostedtreeslieinthecentr alregionwithfewpredictionsapproaching0 ,ahighlyskeweddatasetthathasonly3%positi ve ,thoughcarefulexaminationofthehistograms howsthatevenonthisproblemthereis a sharpdropinthenumberofcasespredictedtoha ve showhowcalibrationtransformspredictions, weplothistogramsandreliabilitydiagramsfo rtheeightproblems1 BecauseboostingoverfitsontheADULT problem,thebestperformanceis allowedtocontinueformoreiterations,it willdisplaythesamesigmoidalshapeonADULT (Figure2)andIso-tonicRegression(Figure3) .
7 Thefiguresshow thatcalibra-tionundoestheshiftinprobabil itymasscausedbyboost-ing:aftercalibratio nmany morecaseshave predictedprob-abilitiesnear0 , ,transformingpre-dictionsusingPlattScali ngorIsotonicRegressionyieldsa significantimprovementinthepredictedprob abilities, apparentinthehistograms:becauseIsotonicR egressiongeneratesa piecewiseconstantfunction,thehistogramsa recoarse,whilethehistogramsgeneratedbyPl attScalingaresmoother. See(Niculescu-Mizil&Caruana,2005) showsthepredictionhistogramsforthetenlea rn-ingmethodsontheSLAC problembeforecalibration,andaftercalibra tionwithPlatt s s sigmoid-shapedreliabilityplots(secondand thirdrows,respectively, ofFigure6). ,thesigmoidalshapeofthereliabilityplotsc o-occurswiththeconcentrationofmassinthec enterofthehistogramsofpredictedvalues, and1 whichshowshistogramsofpredictedvaluesand reliabilityplotsforneuralnetstellsa ,a taskthatisn t ,scalingmighthurtneuralnetcalibrationa s methodhave troublefittingthetailsproperly, effec-tivelypushingpredictionsawayfrom0 and1 Figure4 looksimilarto thehistogramsforboostedtreesafterPlattSc alinginFigure2,givingusconfidencethatthe histogramsreflecttheunderlyingstruc-2 SVMpredictionsarescaledto[0,1]by(x min)=(max min).
8 PredictingGoodProbabilitiesWithSupervise dLearning 0 0 0 1 Fraction of Positives 0 1 Fraction of Positives 0 1 0 1 Fraction of PositivesMean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 0 1 Fraction of PositivesMean Predicted 0 0 0 1 0 1 Fraction of PositivesMean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 0 1 Fraction of PositivesMean Predicted s method.
9 0 0 0 1 0 1 Fraction of PositivesMean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 0 1 Fraction of PositivesMean Predicted ,wecouldconcludethattheLETTERandHSproble ms,giventheavailablefea-tures,have welldefinedclasseswitha smallnumberofcasesinthe gray region,whileintheSLAC problemthetwo classeshave is interestingtonotethatneuralnetworkswitha singlesigmoidoutputunitcanbeviewedasa linearclassifier(inthespanofit s hiddenunits)witha SVMsandboostedtreesaftertheyhave beencalibratedusingPlatt s is notsurprisingthatlogisticregressionpre-P redictingGoodProbabilitiesWithSupervised Learning 0 0 0 1 Fraction of Positives 0 1 Fraction of Positives 0 1 0 1 Fraction of PositivesMean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 0 1 Fraction of PositivesMean Predicted 0 0 0 1 0 1 Fraction of PositivesMean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0
10 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 Mean Predicted Value 0 1 0 1 Fraction of PositivesMean Predicted s ,it is ,wecandeducethatregulardecisiontreesalso arewellcalibratedonaver-age,inthesenseth atif decisiontreesaretrainedondif-ferentsampl esofthedataandtheirpredictionsaveraged, , a sin-gledecisiontreehashighvarianceandthi svarianceaffectsit s , ,sixandseveninFigure6 showthehistograms(beforeandaftercalibrat ion)andreliabilitydiagramsforlogisticreg ression,baggedtrees,anddecisiontreesonth eSLAC , ,andnotwellcalibratedonHS,COVTYPE, ,RFsseemtoexhibit,althoughtoa lesserex-tent,thesamebehaviorasthemaxmar ginmethods:pre-dictedvaluesareslightlypu shedtowardthemiddleofthehistogramandther eliabilityplotsshowa sigmoidalshape(moreaccentuatedontheLETTE R problemsandlesssoonCOVTYPE,MEDISandHS).