Transcription of Latent Dirichlet Allocation
1 JournalofMachineLearningResearch3 CA94720,USAA ndrewY. UniversityStanford, CA94305,USAM ichaelI. CA94720,USAE ditor:JohnLaffertyAbstractWe describelatentDirichletallocation(LDA),a generative is a three-level hierarchicalBayesianmodel,inwhicheachite mofa collectionis modeledasa finitemixtureover ,inturn, ,thetopicprobabilitiesprovideanexplicitr epresentationofa reportresultsindocumentmodeling,textclas sification,andcollaborative filtering,comparingtoa collectionthatenableefficientprocessingo flargecollectionswhilepreservingtheessen tialstatisticalrelationshipsthatareusefu lforbasictaskssuchasclassification,novel tydetection,summarization, (IR)(Baeza-YatesandRibeiro-Neto,1999).Th ebasicmethodologyproposedbyIRresearchers fortextcorpora amethodologysuccessfullydeployedinmodern Internetsearchengines reduceseachdocumentinthecorpustoa vectorofrealnumbers, (SaltonandMcGill,1983),a basicvocabularyof words or terms is chosen,and,foreachdocumentinthecorpus,a countis ,thistermfrequency countiscomparedtoaninversedocumentfreque ncy count,whichmeasuresthenumberofoccurrence sofac ,Andrew Y.
2 NgandMichaelI. , NG,ANDJORDAN wordintheentirecorpus(generallyona logscale,andagainsuitablynormalized).The endresultis a notablyin itsbasicidentificationofsetsofwordsthata rediscriminative fordocumentsinthecollection theapproachalsoprovidesa rela-tivelysmallamountofreductionindescr iptionlengthandrevealslittleinthewayofin ter- addresstheseshortcomings,IRresearchersha ve proposedseveralotherdimensionalityreduct iontechniques,mostnotablylatentsemantici ndexing(LSI)(Deerwesteret al.,1990).LSIusesa singularvaluedecompositionoftheXmatrixto identifya ,Deerwesteret ,whicharelinearcombinationsoftheoriginal tf-idffeatures, substantiatetheclaimsregardingLSI,andtos tudyitsrelative strengthsandweaknesses,it isusefultodevelopa generative probabilisticmodeloftextcorporaandtostud ytheabilityofLSItorecoveraspectsofthegen erative modelfromdata(Papadimitriouet al.,1998).Givena generativemodeloftext,however, it is notclearwhyoneshouldadopttheLSImethodolo gy onecanattempttoproceedmoredirectly, (1999),whopresentedtheprobabilisticLSI(p LSI)model,alsoknownastheaspectmodel, asanalternative , ,modelseachwordina documentasa samplefroma mixturemodel,wherethemixturecomponentsar emultinomialrandomvariablesthatcanbeview edasrepresentationsof topics.
3 Thuseachwordis generatedfroma singletopic,anddifferentwordsina listofmixingproportionsforthesemixtureco mponentsandtherebyreducedtoa probabilitydistributionona the reduceddescription s workis a usefulsteptowardprobabilisticmodelingoft ext,it is incompleteinthatit ,eachdocumentisrepresentedasa listofnumbers(themixingproportionsfortop ics), :(1)thenumberofparame-tersinthemodelgrow slinearlywiththesizeofthecorpus,whichlea dstoseriousproblemswithoverfitting,and(2 )it is notclearhow toassignprobabilitytoa seehow toproceedbeyondpLSI, bag-of-words assumption thattheorderofwordsina , thisis anassumptionofexchangeabilityforthewords ina document(Aldous,1985).Moreover, althoughlessoftenstatedformally, thesemethodsalsoassumethatdocumentsareex changeable;thespecificorderingofthedocum entsina (1990)establishesthatany collectionofex-changeablerandomvariables hasa representationasa mixturedistribution ,if wewishtoconsiderexchangeablerepresentati onsfordocumentsandwords, (LDA)
4 Is importanttoemphasizethatanassumptionofex changeabilityis , exchange-abilityessentiallycanbeinterpre tedasmeaning conditionallyindependentandidenticallydi s-tributed, wheretheconditioningis withrespecttoanunderlyinglatentparameter ofa , thejointdistributionoftherandomvariables issimpleandfactoredwhilemarginallyoverth elatentparameter, ,whileanassumptionofexchangeabilityis clearlya majorsimplifyingassumptioninthedomainoft extmodeling,anditsprincipaljustification is thatit leadsto methodsthatarecomputationallyefficient,t heexchangeabilityassumptionsdonotnecessa rilyleadtomethodsthatarerestrictedtosimp lefrequency aimtodemonstrateinthecurrentpaperthat,by takingthedeFinettitheoremseriously, is alsoworthnotingthattherearea largenumberofgeneralizationsofthebasicno tionofexchangeability, includingvariousformsofpartialexchangeab ility, andthatrepresentationtheo-remsareavailab leforthesecasesaswell(Diaconis,1988).Thu s,whiletheworkthatwediscussinthecurrentp aperfocusesonsimple bag-of-words models,whichleadtomixturedistributionsfo rsinglewords(unigrams),ourmethodsarealso applicabletorichermodelsthatinvolve ,textclassificationandcollaborative , Section8 usethelanguageoftextcollectionsthroughou tthepaper, referringtoentitiessuchas words, documents, and corpora.
5 Thisisusefulinthatit helpstoguideintuition, is importanttonote,however, thattheLDAmodelis notnecessarilytiedtotext,andhasapplicati onstootherproblemsinvolvingcollectionsof data,includingdatafromdomainssuchascolla borative filtering, , ,wepresentexperimentalresultsinthecollab orative , wedefinethefollowingterms: Awordis thebasicunitofdiscretedata,definedtobean itemfroma vocabularyindexedbyf1; : : : ;Vg. We representwordsusingunit-basisvectorsthat have a ,usingsuperscriptstodenotecomponents,the vthwordin thevocabularyis representedbyaV-vectorwsuchthatwv=1 andwu=0 foru6=v. Adocumentis a sequenceofNwordsdenotedbyw= (w1;w2; : : : ;wN), wherewnis thenthwordinthesequence. Acorpusis a collectionofMdocumentsdenotedbyD=fw1;w2; : : : ; , NG,ANDJORDANWe wishtofinda probabilisticmodelofa corpusthatnotonlyassignshighprobabilityt omembersofthecorpus,butalsoassignshighpr obabilitytoother similar (LDA)is a generative probabilisticmodelofa ,whereeachtopicis charac-terizedbya assumesthefollowinggenerative processforeachdocumentwina Poisson( ).
6 Dir( ). :(a)Choosea topiczn Multinomial( ).(b)Choosea wordwnfromp(wnjzn; ), a ,someofwhichweremove ,thedimensionalitykoftheDirichletdistrib ution(andthusthedimensionalityofthetopic variablez) is ,thewordprobabilitiesareparameter-izedby ak Vmatrix where i j=p(wj=1jzi=1), whichfornow wetreatasa fixedquantitythatis , thePoissonassumptionis ,notethatNisindependentofalltheotherdata generatingvariables( andz). It is cantake valuesinthe(k 1)-simplex (ak-vector liesinthe(k 1)-simplex if i 0, ki=1 i=1),andhasthefollowingprobabilitydensit yonthissimplex:p( j ) = ki=1 i ki=1 ( i) 1 11 k 1k;(1)wheretheparameter is ak-vectorwithcomponents i>0, andwhere (x)is a convenientdistributiononthesimplex it is intheexponentialfamily, hasfinitedimensionalsufficientstatistics ,andis conjugateto , and , thejointdistributionofa topicmixture , a setofNtopicsz, anda setofNwordswis givenby:p( ;z;wj ; ) =p( j )N n=1p(znj )p(wnjzn; );(2) refertothelatentmultinomialvariablesinth eLDA modelastopics,soastoexploittext-oriented intuitions,butwemake zw MNFigure1 plates ,whiletheinnerplaterepresentstherepeated choiceoftopicsandwordswithina (znj )is simply ifortheuniqueisuchthatzin= andsummingoverz, weobtainthemarginaldistributionofa document:p(wj ; ) =Zp( j ) N n=1 znp(znj )p(wnjzn; )!
7 D :(3)Finally, takingtheproductofthemarginalprobabiliti esofsingledocuments,weobtaintheproba-bil ityofa corpus:p(Dj ; ) =M d=1Zp( dj ) Nd n=1 zdnp(zdnj d)p(wdnjzdn; )!d d:TheLDAmodelis representedasa , therearethreelevelstotheLDA and arecorpus-level parameters,assumedtobesampledonceinthepr ocessofgeneratinga daredocument-levelvariables, , thevariableszdnandwdnareword-level isimportanttodistinguishLDAfroma a two-levelmodelinwhicha Dirichletis sampledoncefora corpus,a multinomialclusteringvariableis selectedonceforeachdocumentinthecorpus,a nda ,sucha modelrestrictsa documenttobeingassociatedwitha ,ontheotherhand,involvesthreelevels,andn otablythetopicnodeis , areoftenstudiedinBayesianstatisticalmode ling,wherethey arereferredtoashierarchicalmodels(Gelman et al.,1995),ormorepreciselyascon-ditionall yindependenthierarchicalmodels(KassandSt effey, 1989).Suchmodelsarealsooftenreferredtoas parametricempiricalBayesmodels, a termthatrefersnotonlytoa particularmodelstructure,butalsotothemet hodsusedforestimatingparametersinthemode l(Morris,1983).
8 In-deed,aswediscussinSection5,weadoptthe empiricalBayesapproachtoestimatingparame terssuchas and in simpleimplementationsofLDA,but , NG, andexchangeabilityAfinitesetofrandomvari ablesfz1; : : : ;zNgis saidtobeexchangeableif is a permutationoftheintegersfrom1 toN:p(z1; : : : ;zN) =p(z (1); : : : ;z (N)):Aninfinitesequenceofrandomvariables isinfinitelyexchangeableif s representationtheoremstatesthatthejointd istributionofaninfinitelyexchangeableseq uenceofrandomvariablesis asif a randomparameterweredrawnfromsomedistribu tionandthentherandomvariablesinquestionw ereindependentandidenticallydistributed, ,weassumethatwordsaregeneratedbytopics(b yfixedconditionaldistributions)andthatth osetopicsareinfinitelyexchangeablewithin a s theorem,theprob-abilityofa sequenceofwordsandtopicsmustthereforehav e theform:p(w;z) =Zp( ) N n=1p(znj )p(wnjzn)!d ;where istherandomparameterofa obtaintheLDAdistributionondocumentsinEq. (3)bymarginalizingoutthetopicvariablesan dendowing witha continuousmixture ofunigramsTheLDAmodelshowninFigure1 is , however, wecanunderstandLDA asa two-level , letusformtheworddistributionp(wj ; ):p(wj ; ) = zp(wjz; )p(zj ):Notethatthisis a randomquantitysinceit dependson.
9 998 LATENTDIRICHLETALLOCATION Figure2:Anexampledensityonunigramdistrib utionsp(wj ; )underLDA the2-Dsimplex deterministicdistributionthatassignsprob abilityonetooneofthewords; two ofthewords;andthecentroidofthetriangleis (wjz)foreachofthefourtopics,andthesurfac eshownontopofthesimplex is anexampleofa densityoverthe(V 1)-simplex (multinomialdistributionsofwords) now definethefollowinggenerative processfora Dir( ). :(a)Choosea wordwnfromp(wnj ; ).Thisprocessdefinesthemarginaldistribut ionofa documentasa continuousmixturedistribution:p(wj ; ) =Zp( j ) N n=1p(wnj ; )!d ;wherep(wnj ; )arethemixturecomponentsandp( j ) depictsthedistributiononp(wj ; )whichisinducedfroma (V 1)-simplex is attainedwithonlyk+kVparametersyetexhibit sa , NG,ANDJORDANwMN(a)unigramzwMN(b)mixtureo funigramszwMNd(c)pLSI/aspectmodelFigure3 theunigrammodel,amixtureofunigrams, ,wepresenta unifiedgeometricinterpreta-tionofthesemo delswhichhighlightstheirkey ,thewordsofeverydocumentaredrawnindepend entlyfroma singlemultinomialdistribution:p(w) =N n=1p(wn):Thisis ofunigramsIf weaugmenttheunigrammodelwitha discreterandomtopicvariablez(Figure3b),w eobtainamixture ofunigramsmodel(Nigamet al.)
10 ,2000).Underthismixturemodel,eachdocumen tis gen-eratedbyfirstchoosinga topiczandthengeneratingNwordsindependent lyfromtheconditionalmultinomialp(wjz). Theprobabilityofa documentis:p(w) = zp(z)N n=1p(wnjz):1000 LATENTDIRICHLETALLOCATIONW henestimatedfroma corpus, illustrate,thisassumptionis oftentoolimitingtoeffectivelymodela , achievedata costofjustoneadditionalparameter:therear ek 1 parametersassociatedwithp(z)inthemixture ofunigrams,versusthekparametersassociate dwithp( j ) (pLSI)is anotherwidelyuseddocumentmodel(Hofmann,1 999).ThepLSImodel,illustratedinFigure3c, positsthata documentlabeldanda wordwnareconditionallyindependentgivenan unobservedtopicz:p(d;wn) =p(d) zp(wnjz)p(zjd):ThepLSImodelattemptstorel axthesimplifyingassumptionmadeinthemixtu reofunigramsmodelthateachdocumentis sense,it doescapturethepossibilitythata documentmaycontainmultipletopicssincep(z jd)servesasthemixtureweightsofthetopicsf ora particulardocumentd. However, it is importanttonotethatdis a dummyindex intothelistofdocumentsinthetrainingset.