Example: bachelor of science

A fast learning algorithm for deep belief nets

A fastlearningalgorithmfordeepbeliefnets TehDepartmentofComputerScienceNationalUn iversityofSingapore3 ScienceDrive 3, showhowtouse complementarypriors toeliminatetheexplainingawayeffectsthatm akeinferencedifficultindensely-connected beliefnetsthathave many ,wederive a fast,greedyalgo-rithmthatcanlearndeep,di rectedbeliefnetworksonelayerata time,providedthetoptwo lay-ersformanundirectedassociative memory. Thefast,greedyalgorithmis usedto initializea slowerlearningprocedurethatfine-tunesthe weightsus-inga contrastive ,a networkwiththreehiddenlayersformsa verygoodgenerative modelgivesbetterdigitclassificationthant hebestdiscrimi-native associative memoryandit is easytoex-ploretheseravinesbyusingthedire ctedconnec-tionstodisplaywhattheassociat ive difficultin densely-connected,directedbeliefnetsthat have many hiddenlayersbecauseit is difficultto infertheconditio

pixel image 10 label units This could be the top level of another sensory pathway Figure 1: The network used to model the joint distribution ... neither of the hidden causes is active. But it is wasteful to turn on both hidden causes to explain the observation because the

Tags:

  Active, Pixel

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of A fast learning algorithm for deep belief nets

1 A fastlearningalgorithmfordeepbeliefnets TehDepartmentofComputerScienceNationalUn iversityofSingapore3 ScienceDrive 3, showhowtouse complementarypriors toeliminatetheexplainingawayeffectsthatm akeinferencedifficultindensely-connected beliefnetsthathave many ,wederive a fast,greedyalgo-rithmthatcanlearndeep,di rectedbeliefnetworksonelayerata time,providedthetoptwo lay-ersformanundirectedassociative memory. Thefast,greedyalgorithmis usedto initializea slowerlearningprocedurethatfine-tunesthe weightsus-inga contrastive ,a networkwiththreehiddenlayersformsa verygoodgenerative modelgivesbetterdigitclassificationthant hebestdiscrimi-native associative memoryandit is easytoex-ploretheseravinesbyusingthedire ctedconnec-tionstodisplaywhattheassociat ive difficultin densely-connected,directedbeliefnetsthat have many hiddenlayersbecauseit is difficultto infertheconditionaldistributionofthehidd enactivitieswhengivenadatavector.

2 Variationalmethodsusesimpleapproximation stothetrueconditionaldistribution,butthe approximationsmaybepoor, , describea modelinwhichthetoptwo hiddenlayersformanundirectedassociative memory(seefigure1)andthe To appearinNeuralComputation2006remaininghi ddenlayersforma directedacyclicgraphthatconvertstherepre sentationsintheassociative a fast,greedylearningalgorithmthatcanfinda fairlygoodsetofparametersquickly, evenindeepnetworkswithmillionsofparamete rsandmany unsupervisedbutcanbeap-pliedtolabeleddat abylearninga fine-tuningalgorithmthatlearnsanexcel-le ntgenerative modelwhichoutperformsdiscrimina-tive modelmakesit perceptis introducestheideaofa complementary priorwhichexactlycancelsthe explainingaway introducesa fast.

3 Greedylearningalgorithmforconstructingmu lti-layerdirectednetworksonelayerata variationalboundit showsthataseachnewlayerisadded,theoveral lgenerative weak learner, butinsteadofre-weightingeachdata-vectort oensurethatthenextsteplearnssomethingnew , it weak learnerthat2000 top-level units500 units 500 units 28 x 28 pixel image10 label unitsThis could be the top level of another sensory pathwayFigure1 , eachtrainingcaseconsistsofanimageandanex plicitclasslabel,butworkinprogresshassho wnthatthesamelearningalgorithmcanbeusedi f the labels arereplacedbya generatepairsthatconsistofanimageanda usedtoconstructdeepdirectednetsis showshowtheweightsproducedbythefastgreed yalgorithmcanbefine-tunedusingthe up-down a contrastive versionofthewake-sleepal-gorithmHintonet al.

4 (1995)thatdoesnotsufferfromthe mode-averaging showsthepatternrecognitionperformanceofa ,thegeneralizationperformanceofthenetwor kis ; (2002) , section7 showswhathappensinthemindofthenetworkwhe nit is fullgenerative model,soit iseasyto lookintoitsmind wesimplygenerateanimagefromitshigh-level , wewillconsidernetscomposedofFigure2:Asim plelogisticbeliefnetcontainingtwo inde-pendent,rarecausesthatbecomehighlya nti-correlatedwhenweobserve 10ontheearth-quake nodemeansthat,in theabsenceofany observation,thisnodeise10timesmorelikely tobeoff theearth-quake nodeis onandthetrucknodeis off, thejumpnodehasa totalinputof0whichmeansthatit a muchbetterexplanationoftheobservationtha tthehousejumpedthantheoddsofe 20whichapplyifneitherofthehiddencausesis active .

5 Butit is wastefulto turnonbothhiddencausestoexplaintheobserv ationbecausetheprobabilityofthembothhapp eningise 10 e 10=e nodeis turnedonit explainsaway variableis anadditive functionofthestatesofitsdirectly-connect edneigh-bours(seeAppendixA fordetails).2 ComplementarypriorsThephenomenonofexplai ningaway(illustratedinfigure2) ,theposteriordistributionoverthehid-denv ariablesis intractableexceptina few specialcasessuchasmixturemodelsorlinearm odelswithadditive ChainMonteCarlomethods(Neal,1992)canbeus edtosamplefromtheposterior, butthey (NealandHinton,1998)approximatethetruepo steriorwitha moretractabledistributionandthey canbeusedto improve a is comfortingthatlearningisguaranteedtoimpr ove a variationalboundevenwhentheinferenceofth ehiddenstatesisdoneincorrectly,butit wouldbemuchbettertofinda wayofeliminatingex-plainingawayaltogethe r, eveninmodelswhosehiddenvari-ableshave is widelyassumedthatthisis logisticbeliefnet(Neal,1992)is ,theprobabilityofturningonunitiisa logisticfunctionofthestatesofitsimmediat eancestors,j, andoftheweights,wij,onthedirectedconnect ionsfromtheancestors.

6 P(si= 1) =11 + exp( bi Pjsjwij)(1)wherebiisthebiasofuniti. Ifa logisticbeliefnetonlyhasonehiddenlayer, thepriordistributionoverthehiddenvariabl esisfactorialbecausetheirbinarystatesare chosenindependentlywhenthemodelis complementary ,whenthelikelihoodtermis multipliedbytheprior, wewillgeta posteriorthatis isnotat allobviousthatcomplementarypriorsexist,b ut figure3showsa simpleexampleofaninfinitelogisticbeliefn etwithtiedweightsinwhichthepriorsarecomp lementaryateveryhiddenlayer(seeAppendixA fora moregeneraltreatmentoftheconditionsunder whichcomplementarypriorsexist).Theuseoft iedweightstoconstructcomplementarypriors mayseemlike a ,however, it leadstoa cangeneratedatafromtheinfinitedirectedne tinfig-ure3 bystartingwitha randomconfigurationat aninfinitelydeephiddenlayer1andthenperfo rminga top-down ances-tral passinwhichthebinarystateofeachvariablei na layerischosenfromtheBernoullidistributio ndeterminedbythetop-downinputcomingfromi tsactive ,it is justlike any otherdirectednets,however.

7 Wecansam-plefromthetrueposteriordistribu tionoverallofthehiddenlayersbystartingwi tha Ap-pendixAshowsthatthisproceduregivesunb iasedsamplesbecausethecomplementaryprior ateachlayerensuresthattheposteriordistri butionreallyis , Chain,soweneedtostartata layerthatisdeepcomparedwiththetimeit exactlythesameastheinferenceprocedureuse dinthewake-sleepalgorithm(Hintonet al.,1995) fora generative weight,w00ij, froma unitjinlayerH0tounitiinlayerV0(seefigure 3).Ina logisticbeliefnet,themaximumlikelihoodle arningrulefora singledata-vector,v0, is:@logp(v0)@w00ij=<h0j(v0i ^v0i)>(2)where< >denotesanaverageoverthesampledstatesand^ v0iis theprobabilitythatunitiwouldbeturnedonif ,V1, fromthesampledbinarystatesin thefirsthiddenlayer,H0, is exactlythesameprocessasrecon-structingth edata,sov1iis a samplefroma Bernoullirandomvariablewithprobability^v 0i.

8 Thelearningrulecanthereforebewrittenas:@ logp(v0)@w00ij=<h0j(v0i v1i)>(3)Thedependenceofv1ionh0jis because^v0iis anexpectationthatisconditionalonh0j. Sincetheweightsarereplicated,thefullderi vative fora generative weightis obtainedbysummingthederivativesofthegene rative weightsbetweenallpairsoflay-ers:@logp(v0 )@wij=<h0j(v0i v1i)>+<v1i(h0j h1j)>+<h1j(v1i v2i)>+:::(4) divergencelearningIt maynotbeimmediatelyobviousthattheinfinit edirectednetinfigure3 isequivalenttoa RestrictedBoltzmannMa-chine(RBM).AnRBMha sa singlelayerofhiddenunitswhicharenotconne ctedtoeachotherandhave undirected,symmetricalconnectionstoa gen-eratedatafromanRBM,wecanstartwitha randomstateinoneofthelayersandthenperfor malternatingGibbssam-pling.

9 Alloftheunitsinonelayerareupdatedinparal lelgiventhecurrentstatesoftheunitsintheo therlayerandthisis repeateduntilthesystemis performmaximumlikelihoodlearninginanRBM, wecanusethedifferencebetweentwo ,wij, betweena visibleunitianda hiddenunit,jwemeasurethecorrelation< v0ih0j>whena representtheparametersthatareusedtoinfer samplesfromtheposteriordistributionat eachhiddenlayerofthenetwhena datavectoris , ,usingal-ternatingGibbssampling,werunthe Markov chainshowninfigure4 untilit reachesitsstationarydistributionandmeasu rethecorrelation<v1ih1j>. Thegradientofthelogprobabilityofthetrain ingdatais then@logp(v0)@wij=<v0ih0j> <v1ih1j>(5)Thislearningruleisthesameasthemaximum likelihoodlearningrulefortheinfinitelogi sticbeliefnetwithtiedweights,andeachstep ofGibbssamplingcorrespondstocomputingthe exactposteriordistributionina ,KL(P0jjP1 ), betweenthedistributionofthedata,P0, andtheequilibriumdistributiondefinedbyth emodel,P1.

10 Incontrastive divergencelearning(Hinton,2002),weonlyru ntheMarkov equivalenttoignoringthederivatives3 Eachfullstepconsistsofupdatinghgivenvthe nupdatingvgivenh.> <00jihvijijijij t = infinityt = 0 t = 1 t = 2 t = infinity> < jihvFigure4:Thisdepictsa Markov , totheinputsreceivedfromthethecurrentstatesofthevisibleunitsinthebottomlayer, thenthevisibleunitsareallupdatedin initializedbysettingthebinarystatesofthevisibleunitstobethesameasa data-vector. Thecorrelationsintheactivitiesofa visibleanda ofthelogprobabilityoftheposteriordistributioninlayerVn, whichis alsothederivative oftheKullback-Leiblerdivergencebe-tweentheposteriordistributionin layerVn,Pn , di-vergencelearningminimizesthedifferenceoftwo Kullback-Leiblerdivergences:KL(P0jjP1 ) KL(Pn jjP1 )(6)Ignoringsamplingnoise,thisdifferenceis nevernegativebecauseGibbssamplingis usedtoproducePn is importanttono-ticethat


Related search queries