Transcription of Model Compression - Cornell University
1 ModelCompressionCristianBucil hundredsor thousandsof base-level classi , thespacerequiredto storethismany clas-si ers,andthetimerequiredto executethemat run-time,prohibitstheirusein applicationswheretestsetsarelarge( ),wherestoragespaceis ata premium( ),andwherecomputationalpower is limited( ).We present a method for\compressing"large,complexensemblesin to smaller,fastermodels,usuallywith-outsign i cant lossin Subject [PatternRe-cognition]:Models{ :Algorithms,Experimentation,Measure-ment , Performance, :SupervisedLearning, a collectionof modelswhosepredictionsarecombinedby weightedaveragingor beenthefocusof signi cant research in thepastdecade,anda variety of ensemblemethods have knownensemblemethods includebagging[2],boosting[14],randomfor ests[3],Bayesianaveraging[9]andstacking[ 17].}
2 Much of theinterestin ensemblemethodshasbeenfueledby theirexcellent ,however,have onedisadvantagethatoftenis overlooked:many ensemblemethods unusableforapplicationswithlim-itedmemor y, storagespace,or computationalpower such asportabledevicesor sensornetworks,andforapplicationsinwhich ,forexam-ple,boosteddecisiontrees,bagged decisiontreesor thousandsof decisiontrees,each of which mustbe stored,andexecutedat run-timeto make singletreeisfast,butexecutinga thousandtreesis digitalorhardcopiesofallorpartofthiswork forpersonalorclassroomuseis copy otherwise,torepublish,topostonserversort oredistributetolists.
3 Requirespriorspecificpermissionand/ora 06,August20 23,2006,Philadelphia,Pennsylvania, $ thispaper we show how to compressthefunctionthatis learnedby a complexmodelinto a much smaller, cally, weshow how to traincompactarti cialneuralnetstomimicthefunctionlearnedb y ensembleselection,an ensemblelearningmethod introducedby Caruanaet al.[5].To achieve this,we take advantageof thewellknownproperty of arti cialneuralnets,namelythattheyareuniversa lapproximators:given enoughtrainingdata,anda largeenoughhiddenlayer,a neuralnetcanapproximateany functionto trainingtheneuralnetontheoriginal(oftens mall)
4 Trainingsetusedto traintheensemble,we usetheensembleto label a largeunlabeleddatasetandthentraintheneur alnetonthismuch larger,ensemblelabeled, neuralnetthatmakes predictionssimilarto theensemble,andwhich performsmuch betterthana culty whencompressingcomplexensemblesinto simplermodelsthisway is theneedfora somedomains,unlabeleddatais otherdomains,however,largedatasets(label edor unlabeled) thesedomains,we gen-eratesyntheticcasesthatas closelyas possiblematch thedistributionof introducea newmethod forgeneratingsyntheticcasescalledMUNGE thatoutperformsothermethods to which we have ,we areableto trainneuralnetsthatareathousandtimessmal lerandfasterthanensembleselectionensembl es,butwhich have somesituations,it is notenoughfora classi eror re-gressorto be highlyaccurate,it alsohasto many cases,however,thebestperformingmodelis too slow andtoo largetomeettheserequirements.
5 Whilefastandcompactmodelsarelessaccurate ,becauseeithertheyarenotexpressive enough,ortheyover tto such situations,we proposeusingmodelcompressionto obtainfast,compactyet to usea fastandcompactmodelto approximatethefunctionlearnedbya slower,larger, thetruefunctionthatis unknown,thefunctionlearnedby ahighperformingmodelis availableandcanbe usedto labellargeamounts of fast,compactandexpres-sive modeltrainedonenoughpseudodatawillnotove r tandwillapproximatewellthefunctionlearne dby slow,complexmodelsuchas a massive ensembleto be compressedinto a fast,compactmodelsuch as a neuralnetwithlittlelossin questionis how dowe of unlabeleddatais easyto collect( text,webandimagedomains)
6 Andcanbe usedas otherdomains,however,unla-beleddatais notreadilyavailableandsyntheticcasesneed to be moredi cultthanit might seemat is important thatthesyntheticdatamatch wellthedistributionof in a smallsubmanifoldof thesyntheticdatais drawnfroma distri-butionthathaslittleoverlapto thismanifold,thelabeledsyntheticpoints willfailto capturethetargetfunctionintheregionof ,if thedistribu-tionfromwhich thesyntheticdatais sampledis too broad,onlya fractionof thepoints willbe drawnfromthetruemanifoldandmany moresampleswillbe necessaryto ade-quatelysampletheregionof whenthesyntheticdistributionis verysimilarto minimumnumber of sampleswillbe necessaryto experiment withthreemethods of generatingpseudodata:RANDOM,generatedata foreach attributeindepen-dentlyfromitsmarginaldi stribution;NBE,estimatethejoint density of attributesusingtheNaive Bayes Estimationalgorithm[12]andthengeneratesa mplesfromthisjoint dis-tribution.
7 AndMUNGE,a newprocedurewe proposethatsamplesfroma non-parametricestimateof thejoint to generatepseudodatais to indepen-dentlysamplethevalueof each attributefromthemarginaldistributionof theprocedurepredom-inantlyusedin theliteraturewhenever thereis a needforarti cialdata( [13,6]).Usuallythenominalattributesarege neratedfroma uniformdistribution,a Gaussiandistributionwithmeanandvariancee stimatedfromthetrainingset,or viakerneldensity estimation[15].TheRANDOM method forgeneratingpseudodatausesa attribute,a valueis selecteduniformlyat randomfromthemultiset(bag)
8 Of all valuesforthatattributepresent in ,allconditionalstructureis lostandthepseudoexamplesaregeneratedfrom a distributionthatis usuallymuch broaderthanthetruedistributionof consequencemany of thegeneratedpseudoexampleswillcover uninter-estingpartsof thespace,andthismay prevent themimicmodelfromfocusingontheimportant togeneratingpseudodatais toesti-matethejoint distributionof attributesusingthetrainingset,thensample pseudoexamplesfromthisjoint distribu-1 For nominalattributesthisis equivalent to slightlydi erent thanpreviouslyproposedones,butgeneratess imilarvaluesin distributioncanbe esti-matedwell,theconditionalstructureof thedomainwouldbe preservedandthenewarti cialexampleswouldcoverwelltheinteresting regionsof to estimatethejoint distributionof a setof vari-ablesis to comingfroma mixtureof components,each component witha di erent thiscategory.
9 Usedin domainswithonlycontinuousattributesis themixtureof Gaussians[7],whereeach component consistsof a Gaussiandistributionwithadi erent mixturemodelalgorithmthathandlesbothdisc reteandcontinuousattributes,NBE(Naive Bayes Estimation),was recentlyintroducedby LowdandDomingos[12].WeusedNBEto estimatethejoint distributionof theattributesbecauseit handlesmixedattributes,it is simpleto use,itperformsas wellas learninga BayesianNetworkfromthesamedata[12],andit is fulljoint distributionis di cultwhentherearemany tryingto reliablyestimatea joint distribution,we have developed anewalgorithmthatsamplesdirectlyfroma non-parametricestimateof thejoint :setoftrainingexamplesT, sizemultiplierk,probability parameterp, localvarianceparametersReturns:unlabeled trainingsetDof sizek size(T)1:D.
10 2:loopktimes3:T0 T4:for allexampleseinT0do5:e0 theclosestexampleofefromT06:for allattributesaof examplee(excludingthelabel attribute)do7:ifais continuousthen8:withprobabilityp:ea norm(e0a; sd), ande0a norm(ea; sd), wheresd jea e0aj=s,andnorm(a; b) is a :else10:withprobabilityp: swap thevaluesof attributeaforexampleseande011:end if12:end for13:end for14:D D T015:end loop16:ReturnDStartingfromtheoriginaltra iningset,we visiteach measuredistancebetweencases,we [0,1].Given exampleeanditsclosestotherexamplee0, theval-uesforeach noncontinuousattributeareswappedbetweene ande0withprobabilitypandareleftunchanged withprobability 1 p.