Transcription of A fast learning algorithm for deep belief nets
1 A fastlearningalgorithmfordeepbeliefnets TehDepartmentofComputerScienceNationalUn iversityofSingapore3 ScienceDrive 3, showhowtouse complementarypriors toeliminatetheexplainingawayeffectsthatm akeinferencedifficultindensely-connected beliefnetsthathave many ,wederive a fast,greedyalgo-rithmthatcanlearndeep,di rectedbeliefnetworksonelayerata time,providedthetoptwo lay-ersformanundirectedassociative memory. Thefast,greedyalgorithmis usedto initializea slowerlearningprocedurethatfine-tunesthe weightsus-inga contrastive ,a networkwiththreehiddenlayersformsa verygoodgenerative modelgivesbetterdigitclassificationthant hebestdiscrimi-native associative memoryandit is easytoex-ploretheseravinesbyusingthedire ctedconnec-tionstodisplaywhattheassociat ive difficultin densely-connected,directedbeliefnetsthat have many hiddenlayersbecauseit is difficultto infertheconditionaldistributionofthehidd enactivitieswhengivenadatavector. Variationalmethodsusesimpleapproximation stothetrueconditionaldistribution,butthe approximationsmaybepoor, , describea modelinwhichthetoptwo hiddenlayersformanundirectedassociative memory(seefigure1)
2 Andthe To appearinNeuralComputation2006remaininghi ddenlayersforma directedacyclicgraphthatconvertstherepre sentationsintheassociative a fast,greedylearningalgorithmthatcanfinda fairlygoodsetofparametersquickly, evenindeepnetworkswithmillionsofparamete rsandmany unsupervisedbutcanbeap-pliedtolabeleddat abylearninga fine-tuningalgorithmthatlearnsanexcel-le ntgenerative modelwhichoutperformsdiscrimina-tive modelmakesit perceptis introducestheideaofa complementary priorwhichexactlycancelsthe explainingaway introducesa fast,greedylearningalgorithmforconstruct ingmulti-layerdirectednetworksonelayerat a variationalboundit showsthataseachnewlayerisadded,theoveral lgenerative weak learner, butinsteadofre-weightingeachdata-vectort oensurethatthenextsteplearnssomethingnew , it weak learnerthat2000 top-level units500 units 500 units 28 x 28 pixel image10 label unitsThis could be the top level of another sensory pathwayFigure1 , eachtrainingcaseconsistsofanimageandanex plicitclasslabel,butworkinprogresshassho wnthatthesamelearningalgorithmcanbeusedi f the labels arereplacedbya generatepairsthatconsistofanimageanda usedtoconstructdeepdirectednetsis showshowtheweightsproducedbythefastgreed yalgorithmcanbefine-tunedusingthe up-down a contrastive versionofthewake-sleepal-gorithmHintonet al.
3 (1995)thatdoesnotsufferfromthe mode-averaging showsthepatternrecognitionperformanceofa ,thegeneralizationperformanceofthenetwor kis ; (2002) , section7 showswhathappensinthemindofthenetworkwhe nit is fullgenerative model,soit iseasyto lookintoitsmind wesimplygenerateanimagefromitshigh-level , wewillconsidernetscomposedofFigure2:Asim plelogisticbeliefnetcontainingtwo inde-pendent,rarecausesthatbecomehighlya nti-correlatedwhenweobserve 10ontheearth-quake nodemeansthat,in theabsenceofany observation,thisnodeise10timesmorelikely tobeoff theearth-quake nodeis onandthetrucknodeis off, thejumpnodehasa totalinputof0whichmeansthatit a muchbetterexplanationoftheobservationtha tthehousejumpedthantheoddsofe 20whichapplyifneitherofthehiddencausesis active. Butit is wastefulto turnonbothhiddencausestoexplaintheobserv ationbecausetheprobabilityofthembothhapp eningise 10 e 10=e nodeis turnedonit explainsaway variableis anadditive functionofthestatesofitsdirectly-connect edneigh-bours(seeAppendixA fordetails).
4 2 ComplementarypriorsThephenomenonofexplai ningaway(illustratedinfigure2) ,theposteriordistributionoverthehid-denv ariablesis intractableexceptina few specialcasessuchasmixturemodelsorlinearm odelswithadditive ChainMonteCarlomethods(Neal,1992)canbeus edtosamplefromtheposterior, butthey (NealandHinton,1998)approximatethetruepo steriorwitha moretractabledistributionandthey canbeusedto improve a is comfortingthatlearningisguaranteedtoimpr ove a variationalboundevenwhentheinferenceofth ehiddenstatesisdoneincorrectly,butit wouldbemuchbettertofinda wayofeliminatingex-plainingawayaltogethe r, eveninmodelswhosehiddenvari-ableshave is widelyassumedthatthisis logisticbeliefnet(Neal,1992)is ,theprobabilityofturningonunitiisa logisticfunctionofthestatesofitsimmediat eancestors,j, andoftheweights,wij,onthedirectedconnect ionsfromtheancestors:p(si= 1) =11 + exp( bi Pjsjwij)(1)wherebiisthebiasofuniti. Ifa logisticbeliefnetonlyhasonehiddenlayer, thepriordistributionoverthehiddenvariabl esisfactorialbecausetheirbinarystatesare chosenindependentlywhenthemodelis complementary ,whenthelikelihoodtermis multipliedbytheprior, wewillgeta posteriorthatis isnotat allobviousthatcomplementarypriorsexist,b ut figure3showsa simpleexampleofaninfinitelogisticbeliefn etwithtiedweightsinwhichthepriorsarecomp lementaryateveryhiddenlayer(seeAppendixA fora moregeneraltreatmentoftheconditionsunder whichcomplementarypriorsexist).
5 Theuseoftiedweightstoconstructcomplement arypriorsmayseemlike a ,however, it leadstoa cangeneratedatafromtheinfinitedirectedne tinfig-ure3 bystartingwitha randomconfigurationat aninfinitelydeephiddenlayer1andthenperfo rminga top-down ances-tral passinwhichthebinarystateofeachvariablei na layerischosenfromtheBernoullidistributio ndeterminedbythetop-downinputcomingfromi tsactive ,it is justlike any otherdirectednets,however, wecansam-plefromthetrueposteriordistribu tionoverallofthehiddenlayersbystartingwi tha Ap-pendixAshowsthatthisproceduregivesunb iasedsamplesbecausethecomplementaryprior ateachlayerensuresthattheposteriordistri butionreallyis , Chain,soweneedtostartata layerthatisdeepcomparedwiththetimeit exactlythesameastheinferenceprocedureuse dinthewake-sleepalgorithm(Hintonet al.,1995) fora generative weight,w00ij, froma unitjinlayerH0tounitiinlayerV0(seefigure 3).Ina logisticbeliefnet,themaximumlikelihoodle arningrulefora singledata-vector,v0, is:@logp(v0)@w00ij=<h0j(v0i ^v0i)>(2)where< >denotesanaverageoverthesampledstatesand^ v0iis theprobabilitythatunitiwouldbeturnedonif ,V1, fromthesampledbinarystatesin thefirsthiddenlayer,H0, is exactlythesameprocessasrecon-structingth edata,sov1iis a samplefroma Bernoullirandomvariablewithprobability^v 0i.
6 Thelearningrulecanthereforebewrittenas:@ logp(v0)@w00ij=<h0j(v0i v1i)>(3)Thedependenceofv1ionh0jis because^v0iis anexpectationthatisconditionalonh0j. Sincetheweightsarereplicated,thefullderi vative fora generative weightis obtainedbysummingthederivativesofthegene rative weightsbetweenallpairsoflay-ers:@logp(v0 )@wij=<h0j(v0i v1i)>+<v1i(h0j h1j)>+<h1j(v1i v2i)>+:::(4) divergencelearningIt maynotbeimmediatelyobviousthattheinfinit edirectednetinfigure3 isequivalenttoa RestrictedBoltzmannMa-chine(RBM).AnRBMha sa singlelayerofhiddenunitswhicharenotconne ctedtoeachotherandhave undirected,symmetricalconnectionstoa gen-eratedatafromanRBM,wecanstartwitha randomstateinoneofthelayersandthenperfor malternatingGibbssam-pling:Alloftheunits inonelayerareupdatedinparallelgiventhecu rrentstatesoftheunitsintheotherlayerandt hisis repeateduntilthesystemis performmaximumlikelihoodlearninginanRBM, wecanusethedifferencebetweentwo ,wij, betweena visibleunitianda hiddenunit,jwemeasurethecorrelation< v0ih0j>whena representtheparametersthatareusedtoinfer samplesfromtheposteriordistributionat eachhiddenlayerofthenetwhena datavectoris , ,usingal-ternatingGibbssampling,werunthe Markov chainshowninfigure4 untilit reachesitsstationarydistributionandmeasu rethecorrelation<v1ih1j>.
7 Thegradientofthelogprobabilityofthetrain ingdatais then@logp(v0)@wij=<v0ih0j> <v1ih1j>(5)Thislearningruleisthesameasthemaximum likelihoodlearningrulefortheinfinitelogi sticbeliefnetwithtiedweights,andeachstep ofGibbssamplingcorrespondstocomputingthe exactposteriordistributionina ,KL(P0jjP1 ), betweenthedistributionofthedata,P0, andtheequilibriumdistributiondefinedbyth emodel,P1 . Incontrastive divergencelearning(Hinton,2002),weonlyru ntheMarkov equivalenttoignoringthederivatives3 Eachfullstepconsistsofupdatinghgivenvthe nupdatingvgivenh.> <00jihvijijijij t = infinityt = 0 t = 1 t = 2 t = infinity> < jihvFigure4:Thisdepictsa Markov , totheinputsreceivedfromthethecurrentstatesofthevisibleunitsinthebottomlayer, thenthevisibleunitsareallupdatedin initializedbysettingthebinarystatesofthevisibleunitstobethesameasa data-vector. Thecorrelationsintheactivitiesofa visibleanda ofthelogprobabilityoftheposteriordistributioninlayerVn, whichis alsothederivative oftheKullback-Leiblerdivergencebe-tweentheposteriordistributionin layerVn,Pn , di-vergencelearningminimizesthedifferenceoftwo Kullback-Leiblerdivergences:KL(P0jjP1 ) KL(Pn jjP1 )(6)Ignoringsamplingnoise,thisdifferenceis nevernegativebecauseGibbssamplingis usedtoproducePn is importanttono-ticethatPn dependsonthecurrentmodelparametersandthewayinwhichPn changesastheparameterschangeisbeingignoredbycontrastive divergencelearningrulescanbefoundinCarreira-PerpinanandHinton(2005).
8 Contrastive divergencelearningina restrictedBoltzmannmachineis efficientenoughtobepractical(MayrazandHi n-ton,2001).Variationsthatusereal-valued unitsanddiffer-entsamplingschemesaredesc ribedinTehetal.(2003)andhave beenquitesuccessfulformodelingtheformati onofto-pographicmaps(Wellingetal.,2003), fordenoisingnaturalimages(RothandBlack,2 005)orimagesofbiologicalcells(Ningetal., 2005).MarksandMovellan(2001)describeaway ofusingcontrastive divergencetoperformfactoranaly-sisandWel linget al.(2005)show thata networkwithlogistic,binaryvisibleunitsan dlinear, , it appearsthattheefficiency hasbeenboughtat a highprice:Whenappliedintheobviousway, contrastive divergencelearningfailsfordeep,multilaye rnetworkswithdifferentweightsat eachlayerbecausethesenetworkstake fartoolongeventoreachcondi-tionalequilib riumwitha clampeddata-vector. We now showthattheequivalencebetweenRBM s greedylearningalgorithmfortransformingre presentationsAnefficientwaytolearna complicatedmodelis tocombinea setofsimplermodelsthatarelearnedsequenti ally.
9 To forceeachmodelinthesequencetolearnsometh ingdifferentfromthepreviousmodels,thedat ais (Freund,1995),eachmodelin thesequenceis ,thevarianceina modeleddirectionis removedthusforcingthenext modeleddirectiontolieintheorthogonalsubs pace(Sanger,1989).Inprojectionpursuit(Fr iedmanandStuetzle,1981),thedatais transformedbynonlinearlydistortingonedir ectioninthedata-spacetoremove toallow eachmodelinthesequencetoreceive a showsa multilayergenerative modelinwhichthetoptwo thetopareequivalentto havinginfinitelymany ,tosimplifytheanalysis,alllayershave is possibletolearnsensible(thoughnotoptimal )valuesfortheparametersW0byassumingthatt heparame-tersbetweenhigherlayerswillbeus edto constructa comple-mentarypriorforW0. Thisis stilldifficult,goodap-proximatesolutions canbefoundrapidlybyminimizingcon-trastiv e ,thedatacanbemappedthroughWT0tocreatehig her-level data perfectmodeloftheoriginaldata,thehigher- level data , however, theRBMwillnotbeabletomodeltheoriginaldat aperfectlyandwecanmake thegenerative layershave undi-rectedconnectionsandformanassociati ve memory.
10 Thelay-ersbelowhave directed,top-down,generative connectionsthatcanbeusedtomapa stateoftheassociative ,bottom-up,recognitionconnectionsthatare usedtoinfera factorialrepresentationinonelayerfromthe binaryactivitiesinthelayerbelow. Inthegreedyinitiallearningtherecognition connectionsaretiedtothegenerative , evenifsubsequentchangesinhigherlevelweig htsmeanthatthisinferencemethodis , butuntiedfromW0, learnanRBMmodelofthehigher-level data ,it isguaranteedtoimprove thegenerative (NealandHinton,1998),thenegative logprob-abilityofa singledata-vector,v0, underthemultilayergen-erative modelis boundedbya variationalfreeenergywhichis theexpectedenergyundertheapproximatingdi stribution,Q(h0jv0), minustheentropy di-rectedmodel,the energy oftheconfigurationv0;h0isgivenby:E(v0;h0 ) = logp(h0) + logp(v0jh0) (7)Sotheboundis:logp(v0) Xallh0Q(h0jv0) logp(h0) + logp(v0jh0) Xallh0Q(h0jv0) logQ(h0jv0)(8)whereh0is a binaryconfigurationoftheunitsinthefirsth id-denlayer,p(h0)is thepriorprobabilityofh0underthecur-rentm odel(whichis definedbytheweightsaboveH0) andQ( jv0)isany probabilitydistributionoverthebinarycon- figurationsinthefirsthiddenlayer.