Transcription of A fast learning algorithm for deep belief nets
1 A fastlearningalgorithmfordeepbeliefnets TehDepartmentofComputerScienceNationalUn iversityofSingapore3 ScienceDrive 3, showhowtouse complementarypriors toeliminatetheexplainingawayeffectsthatm akeinferencedifficultindensely-connected beliefnetsthathave many ,wederive a fast,greedyalgo-rithmthatcanlearndeep,di rectedbeliefnetworksonelayerata time,providedthetoptwo lay-ersformanundirectedassociative memory. Thefast,greedyalgorithmis usedto initializea slowerlearningprocedurethatfine-tunesthe weightsus-inga contrastive ,a networkwiththreehiddenlayersformsa verygoodgenerative modelgivesbetterdigitclassificationthant hebestdiscrimi-native associative memoryandit is easytoex-ploretheseravinesbyusingthedire ctedconnec-tionstodisplaywhattheassociat ive difficultin densely-connected,directedbeliefnetsthat have many hiddenlayersbecauseit is difficultto infertheconditionaldistributionofthehidd enactivitieswhengivenadatavector.
2 Variationalmethodsusesimpleapproximation stothetrueconditionaldistribution,butthe approximationsmaybepoor, , describea modelinwhichthetoptwo hiddenlayersformanundirectedassociative memory(seefigure1)andthe To appearinNeuralComputation2006remaininghi ddenlayersforma directedacyclicgraphthatconvertstherepre sentationsintheassociative a fast,greedylearningalgorithmthatcanfinda fairlygoodsetofparametersquickly, evenindeepnetworkswithmillionsofparamete rsandmany unsupervisedbutcanbeap-pliedtolabeleddat abylearninga fine-tuningalgorithmthatlearnsanexcel-le ntgenerative modelwhichoutperformsdiscrimina-tive modelmakesit perceptis introducestheideaofa complementary priorwhichexactlycancelsthe explainingaway introducesa fast.
3 Greedylearningalgorithmforconstructingmu lti-layerdirectednetworksonelayerata variationalboundit showsthataseachnewlayerisadded,theoveral lgenerative weak learner, butinsteadofre-weightingeachdata-vectort oensurethatthenextsteplearnssomethingnew , it weak learnerthat2000 top-level units500 units 500 units 28 x 28 pixel image10 label unitsThis could be the top level of another sensory pathwayFigure1 , eachtrainingcaseconsistsofanimageandanex plicitclasslabel,butworkinprogresshassho wnthatthesamelearningalgorithmcanbeusedi f the labels arereplacedbya generatepairsthatconsistofanimageanda usedtoconstructdeepdirectednetsis showshowtheweightsproducedbythefastgreed yalgorithmcanbefine-tunedusingthe up-down a contrastive versionofthewake-sleepal-gorithmHintonet al.
4 (1995)thatdoesnotsufferfromthe mode-averaging showsthepatternrecognitionperformanceofa ,thegeneralizationperformanceofthenetwor kis ; (2002) , section7 showswhathappensinthemindofthenetworkwhe nit is fullgenerative model,soit iseasyto lookintoitsmind wesimplygenerateanimagefromitshigh-level , wewillconsidernetscomposedofFigure2:Asim plelogisticbeliefnetcontainingtwo inde-pendent,rarecausesthatbecomehighlya nti-correlatedwhenweobserve 10ontheearth-quake nodemeansthat,in theabsenceofany observation,thisnodeise10timesmorelikely tobeoff theearth-quake nodeis onandthetrucknodeis off, thejumpnodehasa totalinputof0whichmeansthatit a muchbetterexplanationoftheobservationtha tthehousejumpedthantheoddsofe 20whichapplyifneitherofthehiddencausesis active .
5 Butit is wastefulto turnonbothhiddencausestoexplaintheobserv ationbecausetheprobabilityofthembothhapp eningise 10 e 10=e nodeis turnedonit explainsaway variableis anadditive functionofthestatesofitsdirectly-connect edneigh-bours(seeAppendixA fordetails).2 ComplementarypriorsThephenomenonofexplai ningaway(illustratedinfigure2) ,theposteriordistributionoverthehid-denv ariablesis intractableexceptina few specialcasessuchasmixturemodelsorlinearm odelswithadditive ChainMonteCarlomethods(Neal,1992)canbeus edtosamplefromtheposterior, butthey (NealandHinton,1998)approximatethetruepo steriorwitha moretractabledistributionandthey canbeusedto improve a is comfortingthatlearningisguaranteedtoimpr ove a variationalboundevenwhentheinferenceofth ehiddenstatesisdoneincorrectly,butit wouldbemuchbettertofinda wayofeliminatingex-plainingawayaltogethe r, eveninmodelswhosehiddenvari-ableshave is widelyassumedthatthisis logisticbeliefnet(Neal,1992)is ,theprobabilityofturningonunitiisa logisticfunctionofthestatesofitsimmediat eancestors,j, andoftheweights,wij,onthedirectedconnect ionsfromtheancestors.
6 P(si= 1) =11 + exp( bi Pjsjwij)(1)wherebiisthebiasofuniti. Ifa logisticbeliefnetonlyhasonehiddenlayer, thepriordistributionoverthehiddenvariabl esisfactorialbecausetheirbinarystatesare chosenindependentlywhenthemodelis complementary ,whenthelikelihoodtermis multipliedbytheprior, wewillgeta posteriorthatis isnotat allobviousthatcomplementarypriorsexist,b ut figure3showsa simpleexampleofaninfinitelogisticbeliefn etwithtiedweightsinwhichthepriorsarecomp lementaryateveryhiddenlayer(seeAppendixA fora moregeneraltreatmentoftheconditionsunder whichcomplementarypriorsexist).Theuseoft iedweightstoconstructcomplementarypriors mayseemlike a ,however, it leadstoa cangeneratedatafromtheinfinitedirectedne tinfig-ure3 bystartingwitha randomconfigurationat aninfinitelydeephiddenlayer1andthenperfo rminga top-down ances-tral passinwhichthebinarystateofeachvariablei na layerischosenfromtheBernoullidistributio ndeterminedbythetop-downinputcomingfromi tsactive ,it is justlike any otherdirectednets,however.
7 Wecansam-plefromthetrueposteriordistribu tionoverallofthehiddenlayersbystartingwi tha Ap-pendixAshowsthatthisproceduregivesunb iasedsamplesbecausethecomplementaryprior ateachlayerensuresthattheposteriordistri butionreallyis , Chain,soweneedtostartata layerthatisdeepcomparedwiththetimeit exactlythesameastheinferenceprocedureuse dinthewake-sleepalgorithm(Hintonet al.,1995) fora generative weight,w00ij, froma unitjinlayerH0tounitiinlayerV0(seefigure 3).Ina logisticbeliefnet,themaximumlikelihoodle arningrulefora singledata-vector,v0, is:@logp(v0)@w00ij=<h0j(v0i ^v0i)>(2)where< >denotesanaverageoverthesampledstatesand^ v0iis theprobabilitythatunitiwouldbeturnedonif ,V1, fromthesampledbinarystatesin thefirsthiddenlayer,H0, is exactlythesameprocessasrecon-structingth edata,sov1iis a samplefroma Bernoullirandomvariablewithprobability^v 0i.
8 Thelearningrulecanthereforebewrittenas:@ logp(v0)@w00ij=<h0j(v0i v1i)>(3)Thedependenceofv1ionh0jis because^v0iis anexpectationthatisconditionalonh0j. Sincetheweightsarereplicated,thefullderi vative fora generative weightis obtainedbysummingthederivativesofthegene rative weightsbetweenallpairsoflay-ers:@logp(v0 )@wij=<h0j(v0i v1i)>+<v1i(h0j h1j)>+<h1j(v1i v2i)>+:::(4) divergencelearningIt maynotbeimmediatelyobviousthattheinfinit edirectednetinfigure3 isequivalenttoa RestrictedBoltzmannMa-chine(RBM).AnRBMha sa singlelayerofhiddenunitswhicharenotconne ctedtoeachotherandhave undirected,symmetricalconnectionstoa gen-eratedatafromanRBM,wecanstartwitha randomstateinoneofthelayersandthenperfor malternatingGibbssam-pling.
9 Alloftheunitsinonelayerareupdatedinparal lelgiventhecurrentstatesoftheunitsintheo therlayerandthisis repeateduntilthesystemis performmaximumlikelihoodlearninginanRBM, wecanusethedifferencebetweentwo ,wij, betweena visibleunitianda hiddenunit,jwemeasurethecorrelation< v0ih0j>whena representtheparametersthatareusedtoinfer samplesfromtheposteriordistributionat eachhiddenlayerofthenetwhena datavectoris , ,usingal-ternatingGibbssampling,werunthe Markov chainshowninfigure4 untilit reachesitsstationarydistributionandmeasu rethecorrelation<v1ih1j>. Thegradientofthelogprobabilityofthetrain ingdatais then@logp(v0)@wij=<v0ih0j> <v1ih1j>(5)Thislearningruleisthesameasthemaximum likelihoodlearningrulefortheinfinitelogi sticbeliefnetwithtiedweights,andeachstep ofGibbssamplingcorrespondstocomputingthe exactposteriordistributionina ,KL(P0jjP1 ), betweenthedistributionofthedata,P0, andtheequilibriumdistributiondefinedbyth emodel,P1.
10 Incontrastive divergencelearning(Hinton,2002),weonlyru ntheMarkov equivalenttoignoringthederivatives3 Eachfullstepconsistsofupdatinghgivenvthe nupdatingvgivenh.> <00jihvijijijij t = infinityt = 0 t = 1 t = 2 t = infinity> < jihvFigure4:Thisdepictsa Markov , totheinputsreceivedfromthethecurrentstatesofthevisibleunitsinthebottomlayer, thenthevisibleunitsareallupdatedin initializedbysettingthebinarystatesofthevisibleunitstobethesameasa data-vector. Thecorrelationsintheactivitiesofa visibleanda ofthelogprobabilityoftheposteriordistributioninlayerVn, whichis alsothederivative oftheKullback-Leiblerdivergencebe-tweentheposteriordistributionin layerVn,Pn , di-vergencelearningminimizesthedifferenceoftwo Kullback-Leiblerdivergences:KL(P0jjP1 ) KL(Pn jjP1 )(6)Ignoringsamplingnoise,thisdifferenceis nevernegativebecauseGibbssamplingis usedtoproducePn is importanttono-ticethat