Example: stock market

A fast learning algorithm for deep belief nets

A fastlearningalgorithmfordeepbeliefnets TehDepartmentofComputerScienceNationalUn iversityofSingapore3 ScienceDrive 3, showhowtouse complementarypriors toeliminatetheexplainingawayeffectsthatm akeinferencedifficultindensely-connected beliefnetsthathave many ,wederive a fast,greedyalgo-rithmthatcanlearndeep,di rectedbeliefnetworksonelayerata time,providedthetoptwo lay-ersformanundirectedassociative memory. Thefast,greedyalgorithmis usedto initializea slowerlearningprocedurethatfine-tunesthe weightsus-inga contrastive ,a networkwiththreehiddenlayersformsa verygoodgenerative modelgivesbetterdigitclassificationthant hebestdiscrimi-native associative memoryandit is easytoex-ploretheseravinesbyusingthedire ctedconnec-tionstodisplaywhattheassociat ive difficultin densely-connected,directedbeliefnetsthat have many hiddenlayersbecauseit is difficultto infertheconditionaldistributionofthehidd enactivitieswhengivenadatavector.

image 10 label units This could be the top level of another sensory pathway Figure 1: The network used to model the joint distribution of digit images and digit labels. In this paper, each training case consists of an image and an explicit class label, but work in progress has shown that the same learning algorithm can

Tags:

  Image

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of A fast learning algorithm for deep belief nets

1 A fastlearningalgorithmfordeepbeliefnets TehDepartmentofComputerScienceNationalUn iversityofSingapore3 ScienceDrive 3, showhowtouse complementarypriors toeliminatetheexplainingawayeffectsthatm akeinferencedifficultindensely-connected beliefnetsthathave many ,wederive a fast,greedyalgo-rithmthatcanlearndeep,di rectedbeliefnetworksonelayerata time,providedthetoptwo lay-ersformanundirectedassociative memory. Thefast,greedyalgorithmis usedto initializea slowerlearningprocedurethatfine-tunesthe weightsus-inga contrastive ,a networkwiththreehiddenlayersformsa verygoodgenerative modelgivesbetterdigitclassificationthant hebestdiscrimi-native associative memoryandit is easytoex-ploretheseravinesbyusingthedire ctedconnec-tionstodisplaywhattheassociat ive difficultin densely-connected,directedbeliefnetsthat have many hiddenlayersbecauseit is difficultto infertheconditionaldistributionofthehidd enactivitieswhengivenadatavector.

2 Variationalmethodsusesimpleapproximation stothetrueconditionaldistribution,butthe approximationsmaybepoor, , describea modelinwhichthetoptwo hiddenlayersformanundirectedassociative memory(seefigure1)andthe To appearinNeuralComputation2006remaininghi ddenlayersforma directedacyclicgraphthatconvertstherepre sentationsintheassociative a fast,greedylearningalgorithmthatcanfinda fairlygoodsetofparametersquickly, evenindeepnetworkswithmillionsofparamete rsandmany unsupervisedbutcanbeap-pliedtolabeleddat abylearninga fine-tuningalgorithmthatlearnsanexcel-le ntgenerative modelwhichoutperformsdiscrimina-tive modelmakesit perceptis introducestheideaofa complementary priorwhichexactlycancelsthe explainingaway introducesa fast,greedylearningalgorithmforconstruct ingmulti-layerdirectednetworksonelayerat a variationalboundit showsthataseachnewlayerisadded.

3 Theoverallgenerative weak learner, butinsteadofre-weightingeachdata-vectort oensurethatthenextsteplearnssomethingnew , it weak learnerthat2000 top-level units500 units 500 units 28 x 28 pixel image10 label unitsThis could be the top level of another sensory pathwayFigure1 , eachtrainingcaseconsistsofanimageandanex plicitclasslabel,butworkinprogresshassho wnthatthesamelearningalgorithmcanbeusedi f the labels arereplacedbya generatepairsthatconsistofanimageanda usedtoconstructdeepdirectednetsis showshowtheweightsproducedbythefastgreed yalgorithmcanbefine-tunedusingthe up-down a contrastive versionofthewake-sleepal-gorithmHintonet al.(1995)thatdoesnotsufferfromthe mode-averaging showsthepatternrecognitionperformanceofa ,thegeneralizationperformanceofthenetwor kis.

4 (2002) , section7 showswhathappensinthemindofthenetworkwhe nit is fullgenerative model,soit iseasyto lookintoitsmind wesimplygenerateanimagefromitshigh-level , wewillconsidernetscomposedofFigure2:Asim plelogisticbeliefnetcontainingtwo inde-pendent,rarecausesthatbecomehighlya nti-correlatedwhenweobserve 10ontheearth-quake nodemeansthat,in theabsenceofany observation,thisnodeise10timesmorelikely tobeoff theearth-quake nodeis onandthetrucknodeis off, thejumpnodehasa totalinputof0whichmeansthatit a muchbetterexplanationoftheobservationtha tthehousejumpedthantheoddsofe 20whichapplyifneitherofthehiddencausesis active. Butit is wastefulto turnonbothhiddencausestoexplaintheobserv ationbecausetheprobabilityofthembothhapp eningise 10 e 10=e nodeis turnedonit explainsaway variableis anadditive functionofthestatesofitsdirectly-connect edneigh-bours(seeAppendixA fordetails).

5 2 ComplementarypriorsThephenomenonofexplai ningaway(illustratedinfigure2) ,theposteriordistributionoverthehid-denv ariablesis intractableexceptina few specialcasessuchasmixturemodelsorlinearm odelswithadditive ChainMonteCarlomethods(Neal,1992)canbeus edtosamplefromtheposterior, butthey (NealandHinton,1998)approximatethetruepo steriorwitha moretractabledistributionandthey canbeusedto improve a is comfortingthatlearningisguaranteedtoimpr ove a variationalboundevenwhentheinferenceofth ehiddenstatesisdoneincorrectly,butit wouldbemuchbettertofinda wayofeliminatingex-plainingawayaltogethe r, eveninmodelswhosehiddenvari-ableshave is widelyassumedthatthisis logisticbeliefnet(Neal,1992)is ,theprobabilityofturningonunitiisa logisticfunctionofthestatesofitsimmediat eancestors,j, andoftheweights,wij,onthedirectedconnect ionsfromtheancestors:p(si= 1) =11 + exp( bi Pjsjwij)(1)wherebiisthebiasofuniti.

6 Ifa logisticbeliefnetonlyhasonehiddenlayer, thepriordistributionoverthehiddenvariabl esisfactorialbecausetheirbinarystatesare chosenindependentlywhenthemodelis complementary ,whenthelikelihoodtermis multipliedbytheprior, wewillgeta posteriorthatis isnotat allobviousthatcomplementarypriorsexist,b ut figure3showsa simpleexampleofaninfinitelogisticbeliefn etwithtiedweightsinwhichthepriorsarecomp lementaryateveryhiddenlayer(seeAppendixA fora moregeneraltreatmentoftheconditionsunder whichcomplementarypriorsexist).Theuseoft iedweightstoconstructcomplementarypriors mayseemlike a ,however, it leadstoa cangeneratedatafromtheinfinitedirectedne tinfig-ure3 bystartingwitha randomconfigurationat aninfinitelydeephiddenlayer1andthenperfo rminga top-down ances-tral passinwhichthebinarystateofeachvariablei na layerischosenfromtheBernoullidistributio ndeterminedbythetop-downinputcomingfromi tsactive ,it is justlike any otherdirectednets,however, wecansam-plefromthetrueposteriordistribu tionoverallofthehiddenlayersbystartingwi tha Ap-pendixAshowsthatthisproceduregivesunb iasedsamplesbecausethecomplementaryprior ateachlayerensuresthattheposteriordistri butionreallyis , Chain.

7 Soweneedtostartata layerthatisdeepcomparedwiththetimeit exactlythesameastheinferenceprocedureuse dinthewake-sleepalgorithm(Hintonet al.,1995) fora generative weight,w00ij, froma unitjinlayerH0tounitiinlayerV0(seefigure 3).Ina logisticbeliefnet,themaximumlikelihoodle arningrulefora singledata-vector,v0, is:@logp(v0)@w00ij=<h0j(v0i ^v0i)>(2)where< >denotesanaverageoverthesampledstatesand^ v0iis theprobabilitythatunitiwouldbeturnedonif ,V1, fromthesampledbinarystatesin thefirsthiddenlayer,H0, is exactlythesameprocessasrecon-structingth edata,sov1iis a samplefroma Bernoullirandomvariablewithprobability^v 0i. Thelearningrulecanthereforebewrittenas:@ logp(v0)@w00ij=<h0j(v0i v1i)>(3)Thedependenceofv1ionh0jis because^v0iis anexpectationthatisconditionalonh0j.

8 Sincetheweightsarereplicated,thefullderi vative fora generative weightis obtainedbysummingthederivativesofthegene rative weightsbetweenallpairsoflay-ers:@logp(v0 )@wij=<h0j(v0i v1i)>+<v1i(h0j h1j)>+<h1j(v1i v2i)>+:::(4) divergencelearningIt maynotbeimmediatelyobviousthattheinfinit edirectednetinfigure3 isequivalenttoa RestrictedBoltzmannMa-chine(RBM).AnRBMha sa singlelayerofhiddenunitswhicharenotconne ctedtoeachotherandhave undirected,symmetricalconnectionstoa gen-eratedatafromanRBM,wecanstartwitha randomstateinoneofthelayersandthenperfor malternatingGibbssam-pling:Alloftheunits inonelayerareupdatedinparallelgiventhecu rrentstatesoftheunitsintheotherlayerandt hisis repeateduntilthesystemis performmaximumlikelihoodlearninginanRBM, wecanusethedifferencebetweentwo ,wij, betweena visibleunitianda hiddenunit,jwemeasurethecorrelation< v0ih0j>whena representtheparametersthatareusedtoinfer samplesfromtheposteriordistributionat eachhiddenlayerofthenetwhena datavectoris , ,usingal-ternatingGibbssampling,werunthe Markov chainshowninfigure4 untilit reachesitsstationarydistributionandmeasu rethecorrelation<v1ih1j>.

9 Thegradientofthelogprobabilityofthetrain ingdatais then@logp(v0)@wij=<v0ih0j> <v1ih1j>(5)Thislearningruleisthesameasthemaximum likelihoodlearningrulefortheinfinitelogi sticbeliefnetwithtiedweights,andeachstep ofGibbssamplingcorrespondstocomputingthe exactposteriordistributionina ,KL(P0jjP1 ), betweenthedistributionofthedata,P0, andtheequilibriumdistributiondefinedbyth emodel,P1 . Incontrastive divergencelearning(Hinton,2002),weonlyru ntheMarkov equivalenttoignoringthederivatives3 Eachfullstepconsistsofupdatinghgivenvthe nupdatingvgivenh.> <00jihvijijijij t = infinityt = 0 t = 1 t = 2 t = infinity> < jihvFigure4:Thisdepictsa Markov , totheinputsreceivedfromthethecurrentstatesofthevisibleunitsinthebottomlayer, thenthevisibleunitsareallupdatedin initializedbysettingthebinarystatesofthevisibleunitstobethesameasa data-vector.

10 Thecorrelationsintheactivitiesofa visibleanda ofthelogprobabilityoftheposteriordistrib utioninlayerVn, whichis alsothederivative oftheKullback-Leiblerdivergencebe-tweent heposteriordistributionin layerVn,Pn , di-vergencelearningminimizesthedifferenc eoftwo Kullback-Leiblerdivergences:KL(P0jjP1 ) KL(Pn jjP1 )(6)Ignoringsamplingnoise,thisdifference is nevernegativebecauseGibbssamplingis usedtoproducePn is importanttono-ticethatPn dependsonthecurrentmodelparametersandthe wayinwhichPn changesastheparameterschangeisbeingignor edbycontrastive divergencelearningrulescanbefoundinCarre ira-PerpinanandHinton(2005).Contrastive divergencelearningina restrictedBoltzmannmachineis efficientenoughtobepractical(MayrazandHi n-ton,2001).


Related search queries