A Tutorial on Conformal Prediction - Journal of Machine ...

JournalofMachineLearningResearch9 (2008)371-421 Submitted8/07;Published3/08A TutorialonConformalPredictionGlennShafer CentreDepartmentofComputerScienceRoyalHo lloway, UniversityofLondonEgham,Surrey TW200EX,UKEditor:Sanjoy , togetherwitha methodthatmakesa Prediction yofa labely, it producesa setoflabels,typicallycontaining y, thatalsocontainsywithprobability1 .Conformalpredictioncanbeappliedtoany methodforproducing y: a nearest-neighbormethod,asupport-vectorma chine,ridgeregression, designedforanon-linesettinginwhichlabels arepredictedsucces-sively, eachonebeingrevealedbeforethenextis andvaluablefeatureofconformalpredictioni s thatif thesuccessive examplesaresampledindependentlyfromthesa medistribution,thenthesuccessive predictionswillberight1 ofthetime,eventhoughthey examplesaresampledindependently, treatmentofthetopicis providedinAlgorithmicLearningin a RandomWorld, byVladimirVovk,Alex Gammerman,andGlennShafer(Springer, 2005).

Keywords:confidence,on-linecompressionmo deling,on-linelearning, goodis yourprediction y? If youarepredictingthelabelyofa new object,how confidentareyouthaty= y? If thelabelyis a number, howclosedoyouthinkit is to y? Inmachinelearning,thesequestionsareusual lyansweredina methodformakinga Prediction y, conformalpredictionproducesa 95% Prediction . Alsoat ComputerLearningResearchCentre, aset 0:05thatcontainsywithprobabilityatleast9 5%.Typically 0:05alsocontainstheprediction y. We call ythepointprediction, andwecall 0:05theregionprediction. Inthecaseofregression,whereyis a number, 0:05is typicallyaninterval around y.

Inthecaseofclassification,whereyhasa limitednumberofpossiblevalues, 0:05mayconsistofa fewofthesevaluesor, intheidealcase, methodofpointpredictionforclassification orre-gression,includingsupport-vectormac hines,decisiontrees,boosting,neuralnetwo rks, ,weconstructanonconformitymeasure,whichm easureshowunusualanexamplelooksrelative topreviousexamples, nonconformitymeasure,theconformalalgorit hmproducesa predictionregion foreveryprobabilityoferror . Theregion is a(1 )-predictionregion; it containsywithprobabil-ityat least1 . Theregionsfordifferent arenested:when 1 2, sothat1 1is a lowerlevelofconfidencethan1 2, wehave 1 2.

If containsonlya singlelabel(theidealoutcomeinthecaseofcl assification),wemayaskhowsmall canbemadebeforewemustenlarge byaddinga secondlabel;thecorrespondingvalueof1 is 4,theconformalalgorithmis designedforanon-linesetting,inwhichwepre dictthelabelsofobjectssuccessively, seeingeachlabelafterwehave predictedit ynofthenthlabelynmayuseobservedfeaturesx nofthenth objectandtheprecedingexamples(x1;y1); : : : ;(xn 1;yn 1). Thesizeofthepredictionregion implementingtheconformalalgorithmmaywish tolookfirstattheelementaryexamplesin 2,theon-linepictureleadstoa , a methodforfinding95%predictionregionswasc onsideredvalidif it hada 95%probabilityofcontainingthelabelpredic ted,becausebythelawoflargenumbersit wouldthenbecorrect95%ofthetimewhenrepeat edlyappliedto theon-linepicture,werepeatedlyapplya (x1;y1); : : : ;(xn 1;yn 1)andxntopredictyn, weuse(x1;y1); : : : ;(xn 1;yn 1);(xn.)

Yn)andxn+1topredictyn+1, 95%on-linemethodtobevalid,95% ,conformalpredictionis validin thisnew validinthenew on-linesenseis theoneinwhichtheexamples(xi;yi)aresample dindependentlyfroma constantpopulation thatis,froma fixedbutunknownprobabilitydistributionQ. It is alsovalidundertheslightlyweakerassumptio nthattheexamplesareprobabilisticallyexch angeable(see 3)andunderotheron-linecompressionmodels, includingthewidelyusedGaussianlinearmode l(see 5).Thevalidityofconformalpredictionunder thesemodelsis methodforproducing95%predictionregions,w earealsointer-estedinitsefficiency.

It is efficientif thepredictionregionis usuallyrelativelysmallandthereforeinform ative. Inclassification,wewouldlike toseea 95%predictionregionsosmallthatit containsonlythesinglepredictedlabel yn. Inregression,wewouldlike toseea verynarrow interval aroundthepredictednumber ONCONFORMALPREDICTIONT heclaimof95%confidencefora 95%conformalpredictionregionis validunderexchange-ability, ,wemaychoosea nonconformitymeasurethatwillbeefficienti f wehave priorprobabilitiesforQ, wemayusethesepriorprobabilitiestoconstru cta pointpredictor ynanda ,wemightuseas ynthemeanoftheposteriordistributionforyn giventhefirstn 1 examplesandxn; intheclassificationcase,wemightusethelab elwiththegreatestposteriorprobability.

Thisstrategyoffirstguaranteeingvalidityu ndera relativelyweakassumptionandthenseekingef ficiency understrongerassumptionsconformstoadvice longgivenbyJohnTukey andothers(Tukey, 1986;Smallet al.,2006).Conformalpredictionis studiedindetailinAlgorithmicLearningina RandomWorld, byVovk,Gammerman,andShafer(2005).A recentexpositionbyGammermanandVovk(2007) emphasizesconnectionswiththetheoryofrand omness,Bayesianmethods, , themeaningofexchangeability, leave asidemany importanttopicsthataretreatedinAlgorithm icLearningina RandomWorld, traditionthatcanbetracedbacktoJerzyNeyma n sintroductionofconfidenceintervalsforpar ameters(Neyman,1937)andeventoworkbyLapla ceandothersinthelate18thcentury.

Buttheshiftofemphasistoprediction(fromes timationofparameters)andtotheon-linesett ing(whereourpredictionruleisrepeatedlyup dated) thatitssuccessive beingright95%ofthetime inanunusuallydirectway. In ,weillustratethispointwitha well-wornexample, ,wecontrastconfidencewithfull-fledgedcon ditionalprobability. Thiscontrasthasbeenthetopicofendlessdeba tebetweenthosewhofindconfidencemethodsin formative (classicalstatisticians)andthosewhoinsis tthatfull-fledgedprobabilitiesbasedonall one s informationarealwayspreferable,evenif theonlyavailableprobabilitiesareverysubj ective (Bayesians).

Becausethedebateusuallyfocusesonestimati ngparametersratherthanpredictingfutureob servations,andbecausesomereadersmaybeuna wareofthedebate,wetake containsthetruth95% make thismoreprecise, , make onepredictionafteranother, make clearwhatvaliditymeansandhow it canbeobtainedin thison-linepicture,weconsiderpredictionu nderanassumptionoftenmadeina firstcourseinstatistics:373 SHAFERANDVOVKR andomvariablesz1;z2; : : :areindependentlydrawnfroma , whoexplainedhow togivea 95%predictioninterval forznbasedonz1; : : : ;zn 1thatis validin willstateFisher spredictionrule,illustrateitsapplication todata,andexplainwhy it is ,thepredictionsgivenbyFisher s notsurprising,becausewearepredictingznba sedonoldexamplesz1; : : : ;zn ,moreprecisepredictionis possibleonlyinthemorefavorablebut morecomplicatedset-upwhereweknow somefeaturesxnofthenew exampleandcanusebothxnandtheoldexamplest opredictsomeotherfeatureyn.

Butthesimplicityoftheset-upwherewepredic tznfromz1; : : : ;zn 1alonewillhelpusmake SPREDICTIONINTERVALS upposeweobserve , westartpredicting;forn=3;4; : : :, wepredictznafterhavingseenz1; : : : ;zn 1. Thenaturalpointpredictorforznis theaveragesofar:zn 1:=1n 1n 1 i=1zi;butwewanttogive anintervalthatwillcontainzn95% s answer(1935) 1, calculates2n 1:=1n 2n 1 i=1(zi zn 1)2;whichis canusuallyassumethatit is tableofpercentilesfort-distributions,fin dt0:025n 2, thepointthatthet-distributionwithn 2 degreesoffreedomexceedsexactly2:5% 1 t0:025n 2sn 1rnn 1:(1)Fisherbasedthisprocedureonthefactth atzn zn 1sn 1rn 1nhasthet-distributionwithn 2 degreesoffreedom, (1)willcontainznwithprobability95% canillustrate(1)usingsomenumbersgenerate din1900bythestudentsofEmanuelCzuber(1851 1925).

A Tutorial on Conformal Prediction - Journal of Machine ...

Tags:

Information

Transcription of A Tutorial on Conformal Prediction - Journal of Machine ...

Related search queries

A Tutorial on Conformal Prediction - Journal of Machine ...

Tags:

Information

Documents from same domain

Related documents

Related search queries