Transcription of A Tutorial on Conformal Prediction
1 JournalofMachineLearningResearch9 (2008)371-421 Submitted8/07;Published3/08A TutorialonConformalPredictionGlennShafer CentreDepartmentofComputerScienceRoyalHo lloway, UniversityofLondonEgham,Surrey TW200EX,UKEditor:Sanjoy , togetherwitha methodthatmakesa Prediction yofa labely, it producesa setoflabels,typicallycontaining y, thatalsocontainsywithprobability1 .Conformalpredictioncanbeappliedtoany methodforproducing y: a nearest-neighbormethod,asupport-vectorma chine,ridgeregression, designedforanon-linesettinginwhichlabels arepredictedsucces-sively, eachonebeingrevealedbeforethenextis andvaluablefeatureofconformalpredictioni s thatif thesuccessive examplesaresampledindependentlyfromthesa medistribution,thenthesuccessive predictionswillberight1 ofthetime,eventhoughthey examplesaresampledindependently, treatmentofthetopicis providedinAlgorithmicLearningin a RandomWorld, byVladimirVovk,Alex Gammerman,andGlennShafer(Springer, 2005).
2 Keywords:confidence,on-linecompressionmo deling,on-linelearning, goodis yourprediction y? If youarepredictingthelabelyofa new object,how confidentareyouthaty= y? If thelabelyis a number, howclosedoyouthinkit is to y? Inmachinelearning,thesequestionsareusual lyansweredina methodformakinga Prediction y, conformalpredictionproducesa 95% Prediction . Alsoat ComputerLearningResearchCentre, aset 0:05thatcontainsywithprobabilityatleast9 5%.Typically 0:05alsocontainstheprediction y. We call ythepointprediction, andwecall 0:05theregionprediction. Inthecaseofregression,whereyis a number, 0:05is typicallyaninterval around y. Inthecaseofclassification,whereyhasa limitednumberofpossiblevalues, 0:05mayconsistofa fewofthesevaluesor, intheidealcase, methodofpointpredictionforclassification orre-gression,includingsupport-vectormac hines,decisiontrees,boosting,neuralnetwo rks, ,weconstructanonconformitymeasure,whichm easureshowunusualanexamplelooksrelative topreviousexamples, nonconformitymeasure,theconformalalgorit hmproducesa predictionregion foreveryprobabilityoferror.
3 Theregion is a(1 )-predictionregion; it containsywithprobabil-ityat least1 . Theregionsfordifferent arenested:when 1 2, sothat1 1is a lowerlevelofconfidencethan1 2, wehave 1 2. If containsonlya singlelabel(theidealoutcomeinthecaseofcl assification),wemayaskhowsmall canbemadebeforewemustenlarge byaddinga secondlabel;thecorrespondingvalueof1 is 4,theconformalalgorithmis designedforanon-linesetting,inwhichwepre dictthelabelsofobjectssuccessively, seeingeachlabelafterwehave predictedit ynofthenthlabelynmayuseobservedfeaturesx nofthenth objectandtheprecedingexamples(x1;y1); : : : ;(xn 1;yn 1). Thesizeofthepredictionregion implementingtheconformalalgorithmmaywish tolookfirstattheelementaryexamplesin 2,theon-linepictureleadstoa , a methodforfinding95%predictionregionswasc onsideredvalidif it hada 95%probabilityofcontainingthelabelpredic ted,becausebythelawoflargenumbersit wouldthenbecorrect95%ofthetimewhenrepeat edlyappliedto theon-linepicture,werepeatedlyapplya (x1;y1); : : : ;(xn 1;yn 1)andxntopredictyn, weuse(x1;y1); : : : ;(xn 1;yn 1);(xn;yn)andxn+1topredictyn+1, 95%on-linemethodtobevalid,95% ,conformalpredictionis validin thisnew validinthenew on-linesenseis theoneinwhichtheexamples(xi;yi)aresample dindependentlyfroma constantpopulation thatis,froma fixedbutunknownprobabilitydistributionQ.
4 It is alsovalidundertheslightlyweakerassumptio nthattheexamplesareprobabilisticallyexch angeable(see 3)andunderotheron-linecompressionmodels, includingthewidelyusedGaussianlinearmode l(see 5).Thevalidityofconformalpredictionunder thesemodelsis methodforproducing95%predictionregions,w earealsointer-estedinitsefficiency. It is efficientif thepredictionregionis usuallyrelativelysmallandthereforeinform ative. Inclassification,wewouldlike toseea 95%predictionregionsosmallthatit containsonlythesinglepredictedlabel yn. Inregression,wewouldlike toseea verynarrow interval aroundthepredictednumber ONCONFORMALPREDICTIONT heclaimof95%confidencefora 95%conformalpredictionregionis validunderexchange-ability, ,wemaychoosea nonconformitymeasurethatwillbeefficienti f wehave priorprobabilitiesforQ, wemayusethesepriorprobabilitiestoconstru cta pointpredictor ynanda ,wemightuseas ynthemeanoftheposteriordistributionforyn giventhefirstn 1 examplesandxn; intheclassificationcase,wemightusethelab elwiththegreatestposteriorprobability.
5 Thisstrategyoffirstguaranteeingvalidityu ndera relativelyweakassumptionandthenseekingef ficiency understrongerassumptionsconformstoadvice longgivenbyJohnTukey andothers(Tukey, 1986;Smallet al.,2006).Conformalpredictionis studiedindetailinAlgorithmicLearningina RandomWorld, byVovk,Gammerman,andShafer(2005).A recentexpositionbyGammermanandVovk(2007) emphasizesconnectionswiththetheoryofrand omness,Bayesianmethods, , themeaningofexchangeability, leave asidemany importanttopicsthataretreatedinAlgorithm icLearningina RandomWorld, traditionthatcanbetracedbacktoJerzyNeyma n sintroductionofconfidenceintervalsforpar ameters(Neyman,1937)andeventoworkbyLapla ceandothersinthelate18thcentury. Buttheshiftofemphasistoprediction(fromes timationofparameters)andtotheon-linesett ing(whereourpredictionruleisrepeatedlyup dated) thatitssuccessive beingright95%ofthetime inanunusuallydirectway.
6 In ,weillustratethispointwitha well-wornexample, ,wecontrastconfidencewithfull-fledgedcon ditionalprobability. Thiscontrasthasbeenthetopicofendlessdeba tebetweenthosewhofindconfidencemethodsin formative (classicalstatisticians)andthosewhoinsis tthatfull-fledgedprobabilitiesbasedonall one s informationarealwayspreferable,evenif theonlyavailableprobabilitiesareverysubj ective (Bayesians).Becausethedebateusuallyfocus esonestimatingparametersratherthanpredic tingfutureobservations,andbecausesomerea dersmaybeunawareofthedebate,wetake containsthetruth95% make thismoreprecise, , make onepredictionafteranother, make clearwhatvaliditymeansandhow it canbeobtainedin thison-linepicture,weconsiderpredictionu nderanassumptionoftenmadeina firstcourseinstatistics:373 SHAFERANDVOVKR andomvariablesz1;z2; : : :areindependentlydrawnfroma , whoexplainedhow togivea 95%predictioninterval forznbasedonz1; : : : ;zn 1thatis validin willstateFisher spredictionrule,illustrateitsapplication todata,andexplainwhy it is ,thepredictionsgivenbyFisher s notsurprising,becausewearepredictingznba sedonoldexamplesz1.
7 Zn ,moreprecisepredictionis possibleonlyinthemorefavorablebut morecomplicatedset-upwhereweknow somefeaturesxnofthenew exampleandcanusebothxnandtheoldexamplest opredictsomeotherfeatureyn. Butthesimplicityoftheset-upwherewepredic tznfromz1; : : : ;zn 1alonewillhelpusmake SPREDICTIONINTERVALS upposeweobserve , westartpredicting;forn=3;4; : : :, wepredictznafterhavingseenz1; : : : ;zn 1. Thenaturalpointpredictorforznis theaveragesofar:zn 1:=1n 1n 1 i=1zi;butwewanttogive anintervalthatwillcontainzn95% s answer(1935) 1, calculates2n 1:=1n 2n 1 i=1(zi zn 1)2;whichis canusuallyassumethatit is tableofpercentilesfort-distributions,fin dt0:025n 2, thepointthatthet-distributionwithn 2 degreesoffreedomexceedsexactly2:5% 1 t0:025n 2sn 1rnn 1:(1)Fisherbasedthisprocedureonthefactth atzn zn 1sn 1rn 1nhasthet-distributionwithn 2 degreesoffreedom, (1)willcontainznwithprobability95% canillustrate(1)usingsomenumbersgenerate din1900bythestudentsofEmanuelCzuber(1851 1925).
8 Thesenumbersareintegers,butthey theoreticallyhave a s first19numbers,z1; : : : ;z19:17;20;10;17;12;15;19;22;17;19;14;22 ;18;17;13;12;18;15;17:Fromthem,wecalcula tez19=16:53;s19=3:31:Theupper2:5%pointfo rthet-distributionwith18degreesoffreedom ,t0:02518, is2 (1)forz20comesoutto[9:40;24:13].Takingin toaccountourknowledgethatz20willbeaninte ger, wecansaythatthe95%predictionis thatz20willbeanintegerbetween10and24, correct;z20is picturewherethefor-mula(1)is ,wemightconductmanyseparateexperimentsth ateachconsistsofdrawing100randomnumbersf roma normaldistributionandthenpredictinga 101stdraw using(1).Eachexperimentmightinvolve a differentnormaldis-tribution(adifferentm eanandvariance),butprovidedtheexperiment sareindependentfromeachother, thelaw oflargenumberswillapply. Eachtimetheprobabilityis 95%thatz101willbeintheinterval,andsothis eventwillhappenapproximately95% ,becausetheexperimentinvolvedinpredictin gz101fromz1; : : : ;z100is notentirelyindependentoftheexperimentinv olvedinpredicting,say,z105fromz1; : : : ;z104.
9 Masteroftheanalyticalgeometryofthenormal distribution(Fisher,1925;Efron,1969),Fis herwouldhave noticed,hadhethoughtaboutit,thatthisover lapdoesnotactuallymatter. Asweshow ,theeventszn 1 t0:025n 2sn 1rnn 1 zn zn 1+t0:025n 2sn 1rnn 1(2) , ,wecanconcludethatapproximately95% calltheevents(2) (1)generalizestolinearregressionwithnorm allydistributederrors, , Fisher, thetextbookauthorsdidnothave s studentsrandomlydrewballsfromanurncontai ningsixballs,numbered1 drewaball,they noteditslabelandputit ,they recordedthenumberoftimesthattheballlabel edwitha 1 wasdrawn(Czuber,1914, 335).Thisshouldhave a binomialdistributionwithparameters100and 1=6,andit is thereforeapproximatelynormalwithmean100= 6=16:67andstandarddeviationp500=36=3 willreturntothegeneralizationtolinearreg ressionin thetextbookintervalsasconformalpredictio nregionswithintheon-lineGaussianlinearmo del, s notionofconfidencelooksata oftheziareobserved,theevent(2)involvesmu ltipleuncertainties:zn 1,sn 1, (2)holdsis 95%.
10 We is afterweobserve thefirstn 1 examplesthatwecalculatezn 1andsn 1andthencalculatetheinterval (1),andwewouldlike to beableto sayat thispointthatthereis stilla 95%probabilitythatznwillbein(1).Butthis, it seems,is madeareinsufficienttoenableustofinda numericalprobabilityfor(2) a conditionalprobabilityfor(2)givenz1; : : : ;zn 1,butit bestunderstoodfromthegame-theoreticpoint ofview. A , forexample,is anoffertotake eithersideofa betat19to1 validif theofferdoesnotputthepersonmakingit atadisadvantage,inasmuchasa longsequenceofequallyreasonableofferswil lnotallow anopponenttomultiplythecapitalheorsheris ksbya largefactor(ShaferandVovk,2001).Whenweas sumea probabilitymodel(suchasthenormalmodelwej ustusedortheon-linecompressionmodelswewi llstudylater),weareassumingthatthemodel s ,a 95%conformalpredictoris a ruleforusingtheprecedingexamples(x1;y1); : : : ;(xn 1;yn 1)anda new objectxntogive a set,say 0:05((x1;y1); : : : ;(xn 1;yn 1);xn);(3)thatwepredictwillcontainyn.