Model Compression - Cornell University

ModelCompressionCristianBucil hundredsor thousandsof base-level classi , thespacerequiredto storethismany clas-si ers,andthetimerequiredto executethemat run-time,prohibitstheirusein applicationswheretestsetsarelarge( ),wherestoragespaceis ata premium( ),andwherecomputationalpower is limited( ).We present a method for\compressing"large,complexensemblesin to smaller,fastermodels,usuallywith-outsign i cant lossin Subject [PatternRe-cognition]:Models{ :Algorithms,Experimentation,Measure-ment , Performance, :SupervisedLearning, a collectionof modelswhosepredictionsarecombinedby weightedaveragingor beenthefocusof signi cant research in thepastdecade,anda variety of ensemblemethods have knownensemblemethods includebagging[2],boosting[14],randomfor ests[3],Bayesianaveraging[9]andstacking[ 17].}

Much of theinterestin ensemblemethodshasbeenfueledby theirexcellent ,however,have onedisadvantagethatoftenis overlooked:many ensemblemethods unusableforapplicationswithlim-itedmemor y, storagespace,or computationalpower such asportabledevicesor sensornetworks,andforapplicationsinwhich ,forexam-ple,boosteddecisiontrees,bagged decisiontreesor thousandsof decisiontrees,each of which mustbe stored,andexecutedat run-timeto make singletreeisfast,butexecutinga thousandtreesis digitalorhardcopiesofallorpartofthiswork forpersonalorclassroomuseis copy otherwise,torepublish,topostonserversort oredistributetolists.

Requirespriorspecificpermissionand/ora 06,August20 23,2006,Philadelphia,Pennsylvania, $ thispaper we show how to compressthefunctionthatis learnedby a complexmodelinto a much smaller, cally, weshow how to traincompactarti cialneuralnetstomimicthefunctionlearnedb y ensembleselection,an ensemblelearningmethod introducedby Caruanaet al.[5].To achieve this,we take advantageof thewellknownproperty of arti cialneuralnets,namelythattheyareuniversa lapproximators:given enoughtrainingdata,anda largeenoughhiddenlayer,a neuralnetcanapproximateany functionto trainingtheneuralnetontheoriginal(oftens mall)

Trainingsetusedto traintheensemble,we usetheensembleto label a largeunlabeleddatasetandthentraintheneur alnetonthismuch larger,ensemblelabeled, neuralnetthatmakes predictionssimilarto theensemble,andwhich performsmuch betterthana culty whencompressingcomplexensemblesinto simplermodelsthisway is theneedfora somedomains,unlabeleddatais otherdomains,however,largedatasets(label edor unlabeled) thesedomains,we gen-eratesyntheticcasesthatas closelyas possiblematch thedistributionof introducea newmethod forgeneratingsyntheticcasescalledMUNGE thatoutperformsothermethods to which we have ,we areableto trainneuralnetsthatareathousandtimessmal lerandfasterthanensembleselectionensembl es,butwhich have somesituations,it is notenoughfora classi eror re-gressorto be highlyaccurate,it alsohasto many cases,however,thebestperformingmodelis too slow andtoo largetomeettheserequirements.

Whilefastandcompactmodelsarelessaccurate ,becauseeithertheyarenotexpressive enough,ortheyover tto such situations,we proposeusingmodelcompressionto obtainfast,compactyet to usea fastandcompactmodelto approximatethefunctionlearnedbya slower,larger, thetruefunctionthatis unknown,thefunctionlearnedby ahighperformingmodelis availableandcanbe usedto labellargeamounts of fast,compactandexpres-sive modeltrainedonenoughpseudodatawillnotove r tandwillapproximatewellthefunctionlearne dby slow,complexmodelsuchas a massive ensembleto be compressedinto a fast,compactmodelsuch as a neuralnetwithlittlelossin questionis how dowe of unlabeleddatais easyto collect( text,webandimagedomains)

Andcanbe usedas otherdomains,however,unla-beleddatais notreadilyavailableandsyntheticcasesneed to be moredi cultthanit might seemat is important thatthesyntheticdatamatch wellthedistributionof in a smallsubmanifoldof thesyntheticdatais drawnfroma distri-butionthathaslittleoverlapto thismanifold,thelabeledsyntheticpoints willfailto capturethetargetfunctionintheregionof ,if thedistribu-tionfromwhich thesyntheticdatais sampledis too broad,onlya fractionof thepoints willbe drawnfromthetruemanifoldandmany moresampleswillbe necessaryto ade-quatelysampletheregionof whenthesyntheticdistributionis verysimilarto minimumnumber of sampleswillbe necessaryto experiment withthreemethods of generatingpseudodata:RANDOM,generatedata foreach attributeindepen-dentlyfromitsmarginaldi stribution;NBE,estimatethejoint density of attributesusingtheNaive Bayes Estimationalgorithm[12]andthengeneratesa mplesfromthisjoint dis-tribution.

AndMUNGE,a newprocedurewe proposethatsamplesfroma non-parametricestimateof thejoint to generatepseudodatais to indepen-dentlysamplethevalueof each attributefromthemarginaldistributionof theprocedurepredom-inantlyusedin theliteraturewhenever thereis a needforarti cialdata( [13,6]).Usuallythenominalattributesarege neratedfroma uniformdistribution,a Gaussiandistributionwithmeanandvariancee stimatedfromthetrainingset,or viakerneldensity estimation[15].TheRANDOM method forgeneratingpseudodatausesa attribute,a valueis selecteduniformlyat randomfromthemultiset(bag)

Of all valuesforthatattributepresent in ,allconditionalstructureis lostandthepseudoexamplesaregeneratedfrom a distributionthatis usuallymuch broaderthanthetruedistributionof consequencemany of thegeneratedpseudoexampleswillcover uninter-estingpartsof thespace,andthismay prevent themimicmodelfromfocusingontheimportant togeneratingpseudodatais toesti-matethejoint distributionof attributesusingthetrainingset,thensample pseudoexamplesfromthisjoint distribu-1 For nominalattributesthisis equivalent to slightlydi erent thanpreviouslyproposedones,butgeneratess imilarvaluesin distributioncanbe esti-matedwell,theconditionalstructureof thedomainwouldbe preservedandthenewarti cialexampleswouldcoverwelltheinteresting regionsof to estimatethejoint distributionof a setof vari-ablesis to comingfroma mixtureof components,each component witha di erent thiscategory.

Usedin domainswithonlycontinuousattributesis themixtureof Gaussians[7],whereeach component consistsof a Gaussiandistributionwithadi erent mixturemodelalgorithmthathandlesbothdisc reteandcontinuousattributes,NBE(Naive Bayes Estimation),was recentlyintroducedby LowdandDomingos[12].WeusedNBEto estimatethejoint distributionof theattributesbecauseit handlesmixedattributes,it is simpleto use,itperformsas wellas learninga BayesianNetworkfromthesamedata[12],andit is fulljoint distributionis di cultwhentherearemany tryingto reliablyestimatea joint distribution,we have developed anewalgorithmthatsamplesdirectlyfroma non-parametricestimateof thejoint :setoftrainingexamplesT, sizemultiplierk,probability parameterp, localvarianceparametersReturns:unlabeled trainingsetDof sizek size(T)1:D.

2:loopktimes3:T0 T4:for allexampleseinT0do5:e0 theclosestexampleofefromT06:for allattributesaof examplee(excludingthelabel attribute)do7:ifais continuousthen8:withprobabilityp:ea norm(e0a; sd), ande0a norm(ea; sd), wheresd jea e0aj=s,andnorm(a; b) is a :else10:withprobabilityp: swap thevaluesof attributeaforexampleseande011:end if12:end for13:end for14:D D T015:end loop16:ReturnDStartingfromtheoriginaltra iningset,we visiteach measuredistancebetweencases,we [0,1].Given exampleeanditsclosestotherexamplee0, theval-uesforeach noncontinuousattributeareswappedbetweene ande0withprobabilitypandareleftunchanged withprobability 1 p.

Model Compression - Cornell University

Tags:

Information

Advertisement

Transcription of Model Compression - Cornell University

Related search queries

Model Compression - Cornell University

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries