Transcription of A fast learning algorithm for deep belief nets
1 A fastlearningalgorithmfordeepbeliefnets TehDepartmentofComputerScienceNationalUn iversityofSingapore3 ScienceDrive 3, showhowtouse complementarypriors toeliminatetheexplainingawayeffectsthatm akeinferencedifficultindensely-connected beliefnetsthathave many ,wederive a fast,greedyalgo-rithmthatcanlearndeep,di rectedbeliefnetworksonelayerata time,providedthetoptwo lay-ersformanundirectedassociative memory. Thefast,greedyalgorithmis usedto initializea slowerlearningprocedurethatfine-tunesthe weightsus-inga contrastive ,a networkwiththreehiddenlayersformsa verygoodgenerative modelgivesbetterdigitclassificationthant hebestdiscrimi-native associative memoryandit is easytoex-ploretheseravinesbyusingthedire ctedconnec-tionstodisplaywhattheassociat ive difficultin densely-connected,directedbeliefnetsthat have many hiddenlayersbecauseit is difficultto infertheconditionaldistributionofthehidd enactivitieswhengivenadatavector.
2 Variationalmethodsusesimpleapproximation stothetrueconditionaldistribution,butthe approximationsmaybepoor, , describea modelinwhichthetoptwo hiddenlayersformanundirectedassociative memory(seefigure1)andthe To appearinNeuralComputation2006remaininghi ddenlayersforma directedacyclicgraphthatconvertstherepre sentationsintheassociative a fast,greedylearningalgorithmthatcanfinda fairlygoodsetofparametersquickly, evenindeepnetworkswithmillionsofparamete rsandmany unsupervisedbutcanbeap-pliedtolabeleddat abylearninga fine-tuningalgorithmthatlearnsanexcel-le ntgenerative modelwhichoutperformsdiscrimina-tive modelmakesit perceptis introducestheideaofa complementary priorwhichexactlycancelsthe explainingaway introducesa fast,greedylearningalgorithmforconstruct ingmulti-layerdirectednetworksonelayerat a variationalboundit showsthataseachnewlayerisadded,theoveral lgenerative weak learner, butinsteadofre-weightingeachdata-vectort oensurethatthenextsteplearnssomethingnew .
3 It weak learnerthat2000 top-level units500 units 500 units 28 x 28 pixel image10 label unitsThis could be the top level of another sensory pathwayFigure1 , eachtrainingcaseconsistsofanimageandanex plicitclasslabel,butworkinprogresshassho wnthatthesamelearningalgorithmcanbeusedi f the labels arereplacedbya generatepairsthatconsistofanimageanda usedtoconstructdeepdirectednetsis showshowtheweightsproducedbythefastgreed yalgorithmcanbefine-tunedusingthe up-down a contrastive versionofthewake-sleepal-gorithmHintonet al.(1995)thatdoesnotsufferfromthe mode-averaging showsthepatternrecognitionperformanceofa ,thegeneralizationperformanceofthenetwor kis ; (2002) , section7 showswhathappensinthemindofthenetworkwhe nit is fullgenerative model,soit iseasyto lookintoitsmind wesimplygenerateanimagefromitshigh-level , wewillconsidernetscomposedofFigure2:Asim plelogisticbeliefnetcontainingtwo inde-pendent,rarecausesthatbecomehighlya nti-correlatedwhenweobserve 10ontheearth-quake nodemeansthat,in theabsenceofany observation,thisnodeise10timesmorelikely tobeoff theearth-quake nodeis onandthetrucknodeis off, thejumpnodehasa totalinputof0whichmeansthatit a muchbetterexplanationoftheobservationtha tthehousejumpedthantheoddsofe 20whichapplyifneitherofthehiddencausesis active.
4 Butit is wastefulto turnonbothhiddencausestoexplaintheobserv ationbecausetheprobabilityofthembothhapp eningise 10 e 10=e nodeis turnedonit explainsaway variableis anadditive functionofthestatesofitsdirectly-connect edneigh-bours(seeAppendixA fordetails).2 ComplementarypriorsThephenomenonofexplai ningaway(illustratedinfigure2) ,theposteriordistributionoverthehid-denv ariablesis intractableexceptina few specialcasessuchasmixturemodelsorlinearm odelswithadditive ChainMonteCarlomethods(Neal,1992)canbeus edtosamplefromtheposterior, butthey (NealandHinton,1998)approximatethetruepo steriorwitha moretractabledistributionandthey canbeusedto improve a is comfortingthatlearningisguaranteedtoimpr ove a variationalboundevenwhentheinferenceofth ehiddenstatesisdoneincorrectly,butit wouldbemuchbettertofinda wayofeliminatingex-plainingawayaltogethe r, eveninmodelswhosehiddenvari-ableshave is widelyassumedthatthisis logisticbeliefnet(Neal,1992)is ,theprobabilityofturningonunitiisa logisticfunctionofthestatesofitsimmediat eancestors,j, andoftheweights,wij,onthedirectedconnect ionsfromtheancestors:p(si= 1) =11 + exp( bi Pjsjwij)(1)wherebiisthebiasofuniti.
5 Ifa logisticbeliefnetonlyhasonehiddenlayer, thepriordistributionoverthehiddenvariabl esisfactorialbecausetheirbinarystatesare chosenindependentlywhenthemodelis complementary ,whenthelikelihoodtermis multipliedbytheprior, wewillgeta posteriorthatis isnotat allobviousthatcomplementarypriorsexist,b ut figure3showsa simpleexampleofaninfinitelogisticbeliefn etwithtiedweightsinwhichthepriorsarecomp lementaryateveryhiddenlayer(seeAppendixA fora moregeneraltreatmentoftheconditionsunder whichcomplementarypriorsexist).Theuseoft iedweightstoconstructcomplementarypriors mayseemlike a ,however, it leadstoa cangeneratedatafromtheinfinitedirectedne tinfig-ure3 bystartingwitha randomconfigurationat aninfinitelydeephiddenlayer1andthenperfo rminga top-down ances-tral passinwhichthebinarystateofeachvariablei na layerischosenfromtheBernoullidistributio ndeterminedbythetop-downinputcomingfromi tsactive ,it is justlike any otherdirectednets,however, wecansam-plefromthetrueposteriordistribu tionoverallofthehiddenlayersbystartingwi tha Ap-pendixAshowsthatthisproceduregivesunb iasedsamplesbecausethecomplementaryprior ateachlayerensuresthattheposteriordistri butionreallyis , Chain,soweneedtostartata layerthatisdeepcomparedwiththetimeit exactlythesameastheinferenceprocedureuse dinthewake-sleepalgorithm(Hintonet al.)
6 ,1995) fora generative weight,w00ij, froma unitjinlayerH0tounitiinlayerV0(seefigure 3).Ina logisticbeliefnet,themaximumlikelihoodle arningrulefora singledata-vector,v0, is:@logp(v0)@w00ij=<h0j(v0i ^v0i)>(2)where< >denotesanaverageoverthesampledstatesand^ v0iis theprobabilitythatunitiwouldbeturnedonif ,V1, fromthesampledbinarystatesin thefirsthiddenlayer,H0, is exactlythesameprocessasrecon-structingth edata,sov1iis a samplefroma Bernoullirandomvariablewithprobability^v 0i. Thelearningrulecanthereforebewrittenas:@ logp(v0)@w00ij=<h0j(v0i v1i)>(3)Thedependenceofv1ionh0jis because^v0iis anexpectationthatisconditionalonh0j. Sincetheweightsarereplicated,thefullderi vative fora generative weightis obtainedbysummingthederivativesofthegene rative weightsbetweenallpairsoflay-ers:@logp(v0 )@wij=<h0j(v0i v1i)>+<v1i(h0j h1j)>+<h1j(v1i v2i)>+:::(4) divergencelearningIt maynotbeimmediatelyobviousthattheinfinit edirectednetinfigure3 isequivalenttoa RestrictedBoltzmannMa-chine(RBM).
7 AnRBMhasa singlelayerofhiddenunitswhicharenotconne ctedtoeachotherandhave undirected,symmetricalconnectionstoa gen-eratedatafromanRBM,wecanstartwitha randomstateinoneofthelayersandthenperfor malternatingGibbssam-pling:Alloftheunits inonelayerareupdatedinparallelgiventhecu rrentstatesoftheunitsintheotherlayerandt hisis repeateduntilthesystemis performmaximumlikelihoodlearninginanRBM, wecanusethedifferencebetweentwo ,wij, betweena visibleunitianda hiddenunit,jwemeasurethecorrelation< v0ih0j>whena representtheparametersthatareusedtoinfer samplesfromtheposteriordistributionat eachhiddenlayerofthenetwhena datavectoris , ,usingal-ternatingGibbssampling,werunthe Markov chainshowninfigure4 untilit reachesitsstationarydistributionandmeasu rethecorrelation<v1ih1j>. Thegradientofthelogprobabilityofthetrain ingdatais then@logp(v0)@wij=<v0ih0j> <v1ih1j>(5)Thislearningruleisthesameasthemaximum likelihoodlearningrulefortheinfinitelogi sticbeliefnetwithtiedweights,andeachstep ofGibbssamplingcorrespondstocomputingthe exactposteriordistributionina ,KL(P0jjP1 ), betweenthedistributionofthedata,P0, andtheequilibriumdistributiondefinedbyth emodel,P1.
8 Incontrastive divergencelearning(Hinton,2002),weonlyru ntheMarkov equivalenttoignoringthederivatives3 Eachfullstepconsistsofupdatinghgivenvthe nupdatingvgivenh.> <00jihvijijijij t = infinityt = 0 t = 1 t = 2 t = infinity> < jihvFigure4:Thisdepictsa Markov , totheinputsreceivedfromthethecurrentstatesofthevisibleunitsinthebottomlayer, thenthevisibleunitsareallupdatedin initializedbysettingthebinarystatesofthevisibleunitstobethesameasa data-vector. Thecorrelationsintheactivitiesofa visibleanda ofthelogprobabilityoftheposteriordistributioninlayerVn, whichis alsothederivative oftheKullback-Leiblerdivergencebe-tweentheposteriordistributionin layerVn,Pn , di-vergencelearningminimizesthedifferenceoftwo Kullback-Leiblerdivergences:KL(P0jjP1 ) KL(Pn jjP1 )(6)Ignoringsamplingnoise,thisdifferenceis nevernegativebecauseGibbssamplingis usedtoproducePn is importanttono-ticethatPn dependsonthecurrentmodelparametersandthewayinwhichPn changesastheparameterschangeisbeingignoredbycontrastive divergencelearningrulescanbefoundinCarreira-PerpinanandHinton(2005).
9 Contrastive divergencelearningina restrictedBoltzmannmachineis efficientenoughtobepractical(MayrazandHi n-ton,2001).Variationsthatusereal-valued unitsanddiffer-entsamplingschemesaredesc ribedinTehetal.(2003)andhave beenquitesuccessfulformodelingtheformati onofto-pographicmaps(Wellingetal.,2003), fordenoisingnaturalimages(RothandBlack,2 005)orimagesofbiologicalcells(Ningetal., 2005).MarksandMovellan(2001)describeaway ofusingcontrastive divergencetoperformfactoranaly-sisandWel linget al.(2005)show thata networkwithlogistic,binaryvisibleunitsan dlinear, , it appearsthattheefficiency hasbeenboughtat a highprice:Whenappliedintheobviousway, contrastive divergencelearningfailsfordeep,multilaye rnetworkswithdifferentweightsat eachlayerbecausethesenetworkstake fartoolongeventoreachcondi-tionalequilib riumwitha clampeddata-vector. We now showthattheequivalencebetweenRBM s greedylearningalgorithmfortransformingre presentationsAnefficientwaytolearna complicatedmodelis tocombinea setofsimplermodelsthatarelearnedsequenti ally.
10 To forceeachmodelinthesequencetolearnsometh ingdifferentfromthepreviousmodels,thedat ais (Freund,1995),eachmodelin thesequenceis ,thevarianceina modeleddirectionis removedthusforcingthenext modeleddirectiontolieintheorthogonalsubs pace(Sanger,1989).Inprojectionpursuit(Fr iedmanandStuetzle,1981),thedatais transformedbynonlinearlydistortingonedir ectioninthedata-spacetoremove toallow eachmodelinthesequencetoreceive a showsa multilayergenerative modelinwhichthetoptwo thetopareequivalentto havinginfinitelymany ,tosimplifytheanalysis,alllayershave is possibletolearnsensible(thoughnotoptimal )valuesfortheparametersW0byassumingthatt heparame-tersbetweenhigherlayerswillbeus edto constructa comple-mentarypriorforW0. Thisis stilldifficult,goodap-proximatesolutions canbefoundrapidlybyminimizingcon-trastiv e ,thedatacanbemappedthroughWT0tocreatehig her-level data perfectmodeloftheoriginaldata,thehigher- level data , however, theRBMwillnotbeabletomodeltheoriginaldat aperfectlyandwecanmake thegenerative layershave undi-rectedconnectionsandformanassociati ve memory.