Transcription of Model Compression - Cornell University
1 ModelCompressionCristianBucil hundredsor thousandsof base-level classi , thespacerequiredto storethismany clas-si ers,andthetimerequiredto executethemat run-time,prohibitstheirusein applicationswheretestsetsarelarge( ),wherestoragespaceis ata premium( ),andwherecomputationalpower is limited( ).We present a method for\compressing"large,complexensemblesin to smaller,fastermodels,usuallywith-outsign i cant lossin Subject [PatternRe-cognition]:Models{ :Algorithms,Experimentation,Measure-ment , Performance, :SupervisedLearning, a collectionof modelswhosepredictionsarecombinedby weightedaveragingor beenthefocusof signi cant research in thepastdecade,anda variety of ensemblemethods have knownensemblemethods includebagging[2],boosting[14],randomfor ests[3],Bayesianaveraging[9]andstacking[ 17].Much of theinterestin ensemblemethodshasbeenfueledby theirexcellent ,however,have onedisadvantagethatoftenis overlooked.}
2 Many ensemblemethods unusableforapplicationswithlim-itedmemor y, storagespace,or computationalpower such asportabledevicesor sensornetworks,andforapplicationsinwhich ,forexam-ple,boosteddecisiontrees,bagged decisiontreesor thousandsof decisiontrees,each of which mustbe stored,andexecutedat run-timeto make singletreeisfast,butexecutinga thousandtreesis digitalorhardcopiesofallorpartofthiswork forpersonalorclassroomuseis copy otherwise,torepublish,topostonserversort oredistributetolists,requirespriorspecif icpermissionand/ora 06,August20 23,2006,Philadelphia,Pennsylvania, $ thispaper we show how to compressthefunctionthatis learnedby a complexmodelinto a much smaller, cally, weshow how to traincompactarti cialneuralnetstomimicthefunctionlearnedb y ensembleselection,an ensemblelearningmethod introducedby Caruanaet al.[5].To achieve this,we take advantageof thewellknownproperty of arti cialneuralnets,namelythattheyareuniversa lapproximators.
3 Given enoughtrainingdata,anda largeenoughhiddenlayer,a neuralnetcanapproximateany functionto trainingtheneuralnetontheoriginal(oftens mall)trainingsetusedto traintheensemble,we usetheensembleto label a largeunlabeleddatasetandthentraintheneur alnetonthismuch larger,ensemblelabeled, neuralnetthatmakes predictionssimilarto theensemble,andwhich performsmuch betterthana culty whencompressingcomplexensemblesinto simplermodelsthisway is theneedfora somedomains,unlabeleddatais otherdomains,however,largedatasets(label edor unlabeled) thesedomains,we gen-eratesyntheticcasesthatas closelyas possiblematch thedistributionof introducea newmethod forgeneratingsyntheticcasescalledMUNGE thatoutperformsothermethods to which we have ,we areableto trainneuralnetsthatareathousandtimessmal lerandfasterthanensembleselectionensembl es,butwhich have somesituations,it is notenoughfora classi eror re-gressorto be highlyaccurate,it alsohasto many cases,however,thebestperformingmodelis too slow andtoo largetomeettheserequirements,whilefastan dcompactmodelsarelessaccurate,becauseeit hertheyarenotexpressive enough,ortheyover tto such situations,we proposeusingmodelcompressionto obtainfast,compactyet to usea fastandcompactmodelto approximatethefunctionlearnedbya slower,larger, thetruefunctionthatis unknown,thefunctionlearnedby ahighperformingmodelis availableandcanbe usedto labellargeamounts of fast.
4 Compactandexpres-sive modeltrainedonenoughpseudodatawillnotove r tandwillapproximatewellthefunctionlearne dby slow,complexmodelsuchas a massive ensembleto be compressedinto a fast,compactmodelsuch as a neuralnetwithlittlelossin questionis how dowe of unlabeleddatais easyto collect( text,webandimagedomains)andcanbe usedas otherdomains,however,unla-beleddatais notreadilyavailableandsyntheticcasesneed to be moredi cultthanit might seemat is important thatthesyntheticdatamatch wellthedistributionof in a smallsubmanifoldof thesyntheticdatais drawnfroma distri-butionthathaslittleoverlapto thismanifold,thelabeledsyntheticpoints willfailto capturethetargetfunctionintheregionof ,if thedistribu-tionfromwhich thesyntheticdatais sampledis too broad,onlya fractionof thepoints willbe drawnfromthetruemanifoldandmany moresampleswillbe necessaryto ade-quatelysampletheregionof whenthesyntheticdistributionis verysimilarto minimumnumber of sampleswillbe necessaryto experiment withthreemethods of generatingpseudodata:RANDOM,generatedata foreach attributeindepen-dentlyfromitsmarginaldi stribution.
5 NBE,estimatethejoint density of attributesusingtheNaive Bayes Estimationalgorithm[12]andthengeneratesa mplesfromthisjoint dis-tribution;andMUNGE,a newprocedurewe proposethatsamplesfroma non-parametricestimateof thejoint to generatepseudodatais to indepen-dentlysamplethevalueof each attributefromthemarginaldistributionof theprocedurepredom-inantlyusedin theliteraturewhenever thereis a needforarti cialdata( [13,6]).Usuallythenominalattributesarege neratedfroma uniformdistribution,a Gaussiandistributionwithmeanandvariancee stimatedfromthetrainingset,or viakerneldensity estimation[15].TheRANDOM method forgeneratingpseudodatausesa attribute,a valueis selecteduniformlyat randomfromthemultiset(bag)of all valuesforthatattributepresent in ,allconditionalstructureis lostandthepseudoexamplesaregeneratedfrom a distributionthatis usuallymuch broaderthanthetruedistributionof consequencemany of thegeneratedpseudoexampleswillcover uninter-estingpartsof thespace,andthismay prevent themimicmodelfromfocusingontheimportant togeneratingpseudodatais toesti-matethejoint distributionof attributesusingthetrainingset,thensample pseudoexamplesfromthisjoint distribu-1 For nominalattributesthisis equivalent to slightlydi erent thanpreviouslyproposedones,butgeneratess imilarvaluesin distributioncanbe esti-matedwell,theconditionalstructureof thedomainwouldbe preservedandthenewarti cialexampleswouldcoverwelltheinteresting regionsof to estimatethejoint distributionof a setof vari-ablesis to comingfroma mixtureof components,each component witha di erent thiscategory.
6 Usedin domainswithonlycontinuousattributesis themixtureof Gaussians[7],whereeach component consistsof a Gaussiandistributionwithadi erent mixturemodelalgorithmthathandlesbothdisc reteandcontinuousattributes,NBE(Naive Bayes Estimation),was recentlyintroducedby LowdandDomingos[12].WeusedNBEto estimatethejoint distributionof theattributesbecauseit handlesmixedattributes,it is simpleto use,itperformsas wellas learninga BayesianNetworkfromthesamedata[12],andit is fulljoint distributionis di cultwhentherearemany tryingto reliablyestimatea joint distribution,we have developed anewalgorithmthatsamplesdirectlyfroma non-parametricestimateof thejoint :setoftrainingexamplesT, sizemultiplierk,probability parameterp, localvarianceparametersReturns:unlabeled trainingsetDof sizek size(T)1:D ;2:loopktimes3:T0 T4:for allexampleseinT0do5:e0 theclosestexampleofefromT06:for allattributesaof examplee(excludingthelabel attribute)do7:ifais continuousthen8:withprobabilityp:ea norm(e0a; sd), ande0a norm(ea; sd), wheresd jea e0aj=s,andnorm(a.)
7 B) is a :else10:withprobabilityp: swap thevaluesof attributeaforexampleseande011:end if12:end for13:end for14:D D T015:end loop16:ReturnDStartingfromtheoriginaltra iningset,we visiteach measuredistancebetweencases,we [0,1].Given exampleeanditsclosestotherexamplee0, theval-uesforeach noncontinuousattributeareswappedbetweene ande0withprobabilitypandareleftunchanged withprobability 1 p. Foreach continuousattributea, withTRUE DISTRANDOMNBEMUNGEF igure1: Syntheticdatageneratedfor a ,eais assigneda randomvaluedrawnfromanormaldistributionw ithmeane0aandstandarddeviationsd=jea e0aj=s, ande0ais assigneda randomvaluedrawnfromthenormaldistributio nwithmeaneaandthesamestandarddeviationsd . We callthisapproach to generatingarti cialdataby is presentedin showssamplesgeneratedfroma simple2 Ddis-tribution(TRUEDIST),andthedistribut ionslearnedbyRANDOM,NBEandMUNGE froma trainsetof 4000points ,thesamplesgeneratedby RANDOM cover anareamuch largerthanthetruedistribution,so onlyrelativelyfewof thesamplesover-lapwiththeregionof a betterjobatapproximatingthetruedistribut ion,butstillhasproblems,especiallyin the\corners".
8 Ofthethreemethods, evaluatethee ectivenessof modelcompressiononeight binaryclassi ,COVTYPEandLETTER arefromtheUCIR epository[1].COVTYPE hasbeenconvertedtoa binaryproblemby treatingthelargestclassas positive andtherestas con-vertedLETTERto a binaryproblemin two "O"as positive andtheremaining25lettersasnegative, yieldinga andtherestas negatives,yieldinga di cult,butwell balanced, theIndianPine92dataset[10]wherethedif- cultclassSoybean-mintillis thepositive is experiment withusingneuralnetworksto compressthemodelsbuiltusingtheensemblese lectionalgorithmpro-posedby Caruanaet [5].Theensemblemodelsgener-atedby ensembleselectionareverylarge,complexmod elsthathave verygood generalizationperformance,thus theyarea nesmungeas\To imperfectlytransforminformation"or\To modifydatain a waythatcannotbedescribed succinctly".Table1: Descriptionof #attrtrainsizetestsize%pozadult14/104400 03522225%covtype5440002500036%hs20040004 36624% sizeRANDNBEMUNGE ensemble selectionbest single modelbest neural netFigure2: Averageperf.
9 Over the eight rstbuildsa libraryof diversebase-level modesusingmany di erent built,thebasicen-sembleselectionprocedur ebuildstheensemblemodelbygreedilyselecti ngat each iterationthemodelfromtheli-brarythatwhen addedto theensembleimproves theper-formanceof number of enhancements to thebasicensembleselectionalgorithmthatim prove itsperformance,butas asidee ectincreasethesizeof theensembleby increasingthenumber of base-level modelsit ,a trainingsetof 4000points is usedto trainthebase-level models,anda validationsetof 1000points is usedas hillclimb ,the4000trainingpoints areusedas a trainingsetforthethreealgorithmsforprodu cingarti cialdata:RANDOM, cialdatageneratedwitheach algorithmis thenlabeledby theensemblemodelandusedto traina , the1000points comparetheperformanceof thecompressedmodelswiththeperformanceof thetargetensembleselectionmod-elsontheei ght alsoshow theperfor-manceof thebestsinglebase-level modelfromtheensembleselectionlibrary, selectedusingthesame1000points valida-tionsets,andthebestneuralnetworkt hatcouldbe trainedontheoriginal4000points trainingsets,usingthe1000points validationsetsforearlystoppingandforsele ctingthenumber of ecttheroot-mean-squared-error(RMSE)of modelspredictionsto thebinary0/1targetson largeindependent naltestsets.
10 Of hidden unitsADULTMUNGE ensemble selectionbest single modelbest neural net number of hidden unitsCOVTYPEMUNGE ensemble selectionbest single modelbest neural net number of hidden unitsHSMUNGE ensemble selectionbest single modelbest neural net of hidden selectionbest single modelbest neural net number of hidden selectionbest single modelbest neural net number of hidden unitsMEDISMUNGE ensemble selectionbest single modelbest neural net of hidden unitsMGMUNGE ensemble selectionbest single modelbest neural net number of hidden unitsSLACMUNGE ensemble selectionbest single modelbest neural net number of hidden unitsAVERAGEMUNGE ensemble selectionbest single modelbest neural netFigure3: Performanceof compressedmodelsvs showstheaverageRMSE performanceontheeight RMSE represents the gureshowstheperformanceof thebestneuralnetswe ensembleselectiontrainedon themiddleis theaverageperformanceofthebestsinglebase -level modelsfromtheensembleselec-tionlibraries , thebestperformancewe couldachieve withany of thefollowinglearningmethods.