Decision T - stats.org.uk

BayesianLearninginProbabilisticDecisionT reesMichaelI JordanMITC ollaboratorsRobertJacobs Rochester LeiXu HongKong Geo reyHinton Toronto StevenNowlan Synaptics MarinaMeila MIT LawrenceSaul MIT Outline decisiontrees probabilisticdecisiontrees EMalgorithmandextensions modelselection Bayesiancomputations empiricalresults systemidenti cation classi cation theoreticalresults trainingseterror testseterrorSomeproblemswithmulti layeredneuralnetworks thelearningalgorithmsareslow hardtounderstandthenetwork hardtobuildinpriorknowledge poorperformanceonnon stationarydata notnaturalforsomefunctionsSupervisedlear ning akaregression classi cation Weassumethatthelearnerisprovidedwithatra iningset X f x t y t gT wherexisaninputvectorandyisanoutputvecto r Wewillgaugeperformanceonatestset Xs f x t y t gTs Decisiontreesx < < < dropthedatasetdownthetree ateachnode tryto ndasplitoftheinputspace ahalf plane thatyieldsthelargestgainin purity onleftandright buildalargetreeandprunebackwardtocre ateanestedsequenceoftrees pickthebesttreefromthesequenceusingcross validationRegressiontreesx < < < x T1y= x Ty= x Ty= x T234 splittingisbasedonRSSS omeadvantages oftenmuchfasterthanneuralnetworks oftenmoreinterpretable

AllowoperatingpointstobeutilizedSomedisa dvantages non smoothregressionsurface coordinatedependent batchmethodsProbabilisticDecisionTrees Hierarchicalmixturesofexperts HME Jordan Jacobs Whyprobabilities smootherregressionsurface errorbarsfromlikelihood Bayesiantheory e g SEMalgorithm convergenceresultsfromlikelihood Bayesiantheory canhandlecategoricalvariablesandmissingd atainprincipledways betterperformance e g leverageissue ProbabilisticDecisionTrees dropinputsdownthetreeanduseprobabilis ticmodelsfordecisions atleavesoftreesuseprobabilisticmodelstog enerateoutputsfrominputs useaBayes rulerecursiontocomputepos teriorcreditfornonterminalsinthetreeTheb asicideaistoconvertthedecisiontreeintoam ixturemodel 1 i nx yijyy ij1ijkijn 1ini1i1ininij ij1ijkijn Modelthedecisionsinthedecisiontreeusingc ategoricalprobabilitymodels let i ij ijk representmultinomialde cisionvariablesatthenonterminals thesevariableswillbetreatedas missing data cf statesofanHMM eachpathdownthetreede nesacomponentofamixture 1 i nx yijyy ij1ijkijn 1ini1i1ininij ij1ijkijn Decisionmodelsatthenonterminals P ijx P ijjx i i P ijkjx i ij Outputmodelsattheleaves P yjx i ij ijk ijk

Thetotalprobabilityofanoutputygivenaninp utxisgivenbythesumacrossallpathsfromther oottotheleaves P yjx XiP ijx XjP ijjx i i XkP ijkjx i ij P yjx i ij ijk ijk Thisisa conditional mixturemodel Momentsofthismixturedistributionareread ilycomputedbytreetraversalprocesses De ne E yjx i E yjx i ij E yjx i ij ijk E yjx i ij ijk andde negi P ijx gjji P ijjx i i gkjij P ijkjx i ij omittingtheparametersforsimplicity Then Xigi i i Xjgjji ij ij Xkgkjij ijk ijk f Tijkx 1 i n ij ij1ijkijn1ini1i1ininijij1ijkijn gggg1inggggg 1|ij|in|i1|ijk|ijn|ijComponentModelsDeci sionmodels P ijx isaclassi cationmodel anyparametricclassi cationmodelisappropriate weuseamultinomiallogitmodel thisyields soft lineardiscriminants softversionofaCART C treeLeafmodels weusesimplegeneralizedlinearmodels Regression linearregression Binaryclassi cation logisticregression Multiwayclassi cation multinomiallogitmodel canalsohandlecountestimates failurees timates etc Multinomiallogitmodel thedeterministiccomponent gi e iPje jwhere i Tixsoftlineardiscriminants thedirectionsofthe ideterminetheori entationsofthediscriminantsurfaces i e splits themagnitudesofthe ideterminethesharpnessofthesplits theprobabilisticcomponent P yjx gy gy gynnwhereyi f gandPiyi theloglikelihood

L X XpXiy p ilogg p iwhichisthecross entropyfunction thegradient l i XpXi y p i g p i x p ComputingtheHessianandsubstitutingintoth eNewton Raphsonformulayieldsasimple quadratically convergentiterativealgorithmknownasIRLS Iteratively ReweightedLeastSquares TheLogLikelihoodE Xplog Xig p iXjg p jjiXkg p kjij Pijk y p jx p Problem Thelogisoutsideofthesums Howcanweoptimizesuchariskfunctionef ciently Solution EMTheEM Expectation Maximization Algorithm Baum etal Dempster Laird Rubin Specialcases mixturelikelihoodclustering softK means manymissingdataalgorithms Baum WelchalgorithmforHMM sApplicationstosupervisedlearning regression classi cation EM Tutorial Supposethattheproblemofmaximizingalikeli hoodwouldbesimpli edifthevaluesofsomeadditionalvariables called missingvariables wereknown Thesevaluesarenotknown butgiventhecurrentvaluesoftheparameters theycanbeestimated theEstep Treattheestimatedvaluesasprovisionallyco rrectandmaximizethelikelihoodintheusualw ay theMstep Wenowhavebetterparametervalues sotheEstepcanberepeated Iterate EM Tutorial cont missing data Z complete data Y fX Zg complete likelihood lc Y Thecompletelikelihoodisarandomvariable soaverageouttherandomness Estep Q t E lc Y jX t Thisyieldsa xedfunctionQ whichcanbeop timized

Mstep t argmax Q t ApplyingEMtotheHMEarchitectureThemissing dataaretheunknownvaluesofthedecisionsint hedecisiontree De neindicatorvariableszi zjji zkjij Completelikelihood lc Y XpXiz p iXjz p jji log g p ig p jji Pijk y p jx p Incompletelikelihood l X Xplog Xig p iXjg p jji Pijk y p jx p Weneedtocomputetheexpectedvaluesofthemis singindicatorvariables Notethat e g E z p ijx p y p P p ijx p y p Example one leveltree ateachleaf linearregressionwithGaussianerrorsForthe ithleafandthetthdatapoint h t i g t ie ky t t ik Pjg t je ky t t jk where t i Tix t Thisposteriorisanormalizeddistancemea surethatre ectstherelativemagnitudesoftheresidualsy t t i Posteriorprobabilitieshi P ijx y hjji P ijjx y i hkjij P ijkjx y i ij cf priorprobabilities gi P ijx gjji P ijjx i gkjij P ijkjx i ij Bayes ruleyields hi giPjgjjiPkgkjijPijk yjx PigiPjPjgjjiPkgkjijPijk yjx hjji gjjiPkgkjijPijk yjx PjgjjiPkgkjijPijk yjx hkjij gkjijPijk yjx PkgkjijPijk yjx Bayes ruleyields hi giPjgjjiPkgkjijPijk yjx PigiPjPjgjjiPkgkjijPijk yjx hjji gjjiPkgkjijPijk yjx PjgjjiPkgkjijPijk yjx hkjij gkjijPijk yjx PkgkjijPijk yjx Posteriorpropagation ij1ijkijn1inhhhhhhhhhyyykij|ij|1ij|ni|1j i|ni|TheEstep computetheposteriorprobabilities up down algorithm TheMstep

TheQfunctiondecouplesintoasetofsepa ratemaximumlikelihoodproblems Atthenonterminals tmultinomiallogitmod els withtheposteriorsh t i h t jji etc servingasthetargets Attheleaves obtainweightedlikelihoodswheretheweights aretheproductoftheposteriorsfromroottole afTheMstep inmoredetail ThemaximizationofQ t decouplesintoasetofweightedMLEproblems t i argmax iXpXih p ilogg p i across entropycost t ij argmax ijXpXih p iXjh p jjilogg p jji aweightedcross entropycost t ij argmax ijXpXih p iXjh p jjilogPijk y p jx p ageneralweightedloglikelihood EachoftheseareweightedMLproblemsforgener alizedlinearmodels GLIM s Theycanbesolvede cientlyusingiteratively reweightedleastsquares IRLS HMEP arameterEstimation 1 i niji1inxgggg1inggggg ij|kij|1ij|n|i1|in|ik ij1ijkijn1inhhhhhhhhhyyykij|ij|1ij|ni|1j i|ni| dropthedatasetdownthetree foreachdatapoint computetheposteriorprobabilitiesforevery branchofthetree ateachnonterminal usetheposteriorprob abilitiesas soft classi cationtargets ateachleaf talocalmodel whereeachdatapointisweightedbytheproduct oftheposteriorprobabilitiesfromtherootto thatleafModelselectionHowdowechoosethest ructureofthetree initializewithCARTorC cf K means canpreservelocalvariableselection ridgeregression cross validationstoppingwithina

Xeddeephierarchy EMiterations grow thee ectivedegreesoffreedom Bayesianissues Dirichletpriors Gibbs samplingisstraightforward GaussianapproximationofposteriorviaSEMca lculationofHessian Mean eldapproximationofposteriorRegression ASystemIdenti cationProblem Forwarddynamicsofafour joint three dimensionalarm Twelveinputvariables fouroutputvariables pointsinthetrainingset pointsinthetestset Four leveltree withbinarybranches ComparetobackpropagationinanMLP with hiddenunits ComparetoCART MARSB atchalgorithmsEpochsRelative (Algorithm 2)Summary batchalgorithmsArchitectureRelativeError Epochslinear NAbackprop HME Algorithm HME Algorithm CART NACART linear NAMARS NAAnOn LineVariantofHMEU setechniquesfromrecursiveestimationtheor y Ljung S oderstr om !

Toobtainthefol lowingon linealgorithm Expertnetworks U t ij U t ij"h t ih t jji y t t ij x t TR t ij whereRijisupdatedasfollows R t ij R t ij R t ijx t x t TR t ij h t ij "x t TR t ijx t and isadecayparameter Top levelgatingnetworks v t i v t i"S t i lnh t i t i x t S t i S t i S t ix t x t TS t i "x t TS t ix t Lower levelgatingnetworks v t ij v t ij"S t ijh t i lnh t jji t ij x t S t ij S t ij S t ijx t x t TS t ij h t i "x t TS t ijx t Classi cationTaskBaselineCARTHMEB ayesHeart ! !Pima Orbitals ! Errorratesarecomputedusing foldcross validation Convergenceresults Jordan Xu Theorem AssumethatthetrainingsetXisgeneratedbyth emixturemodel realizable case LetusdenoteP diag P k g P PK P P K H l TwherePiarecovariancematricesofthecompo nentmodels Thenwithprobabilityone Letting M m hereM m betheminimumandmaximumeigenvaluesofthene g ativede nitematrix P TH P wehavel l k rk l l kP k k jrjk vuuuut m l l wherer M m M Wealsohave jrj whenM Foranyinitialpoint D limk k whenM TestSetError Saul Jordan Hardsplitmodely x pN w x # v x " pN w x # v x Considerastructurallyidenticallyteacherw ithweightvectorsw w v OrderparametersR BBBBBB

RvX X Y R C Y C R CCCCCCA N BBBBBB v vw vw vv w w w w w v w w w w w CCCCCCA Loss v w w x y x w x # v x " y x w x # v x Empiricalrisk trainingenergy E PXp v w w x

Decision T - stats.org.uk

Tags:

Information

Transcription of Decision T - stats.org.uk

Related search queries

Decision T - stats.org.uk

Tags:

Information

Documents from same domain

Related documents

Related search queries