Latent Dirichlet Allocation - Home - Stanford Artificial ...

JournalofMachineLearningResearch3 CA94720,USAA ndrewY. UniversityStanford, CA94305,USAM ichaelI. CA94720,USAE ditor:JohnLaffertyAbstractWe describelatentDirichletallocation(LDA),a generative is a three-level hierarchicalBayesianmodel,inwhicheachite mofa collectionis modeledasa finitemixtureover ,inturn, ,thetopicprobabilitiesprovideanexplicitr epresentationofa reportresultsindocumentmodeling,textclas sification,andcollaborative filtering,comparingtoa collectionthatenableefficientprocessingo flargecollectionswhilepreservingtheessen tialstatisticalrelationshipsthatareusefu lforbasictaskssuchasclassification,novel tydetection,summarization, (IR)(Baeza-YatesandRibeiro-Neto,1999).

ThebasicmethodologyproposedbyIRresearche rsfortextcorpora amethodologysuccessfullydeployedinmodern Internetsearchengines reduceseachdocumentinthecorpustoa vectorofrealnumbers, (SaltonandMcGill,1983),a basicvocabularyof words or terms is chosen,and,foreachdocumentinthecorpus,a countis ,thistermfrequency countiscomparedtoaninversedocumentfreque ncy count,whichmeasuresthenumberofoccurrence sofac ,Andrew Y. NgandMichaelI. , NG,ANDJORDAN wordintheentirecorpus(generallyona logscale,andagainsuitablynormalized).The endresultis a notablyin itsbasicidentificationofsetsofwordsthata rediscriminative fordocumentsinthecollection theapproachalsoprovidesa rela-tivelysmallamountofreductionindescr iptionlengthandrevealslittleinthewayofin ter- addresstheseshortcomings,IRresearchersha ve proposedseveralotherdimensionalityreduct iontechniques,mostnotablylatentsemantici ndexing(LSI)(Deerwesteret al.)

,1990).LSIusesa singularvaluedecompositionoftheXmatrixto identifya ,Deerwesteret ,whicharelinearcombinationsoftheoriginal tf-idffeatures, substantiatetheclaimsregardingLSI,andtos tudyitsrelative strengthsandweaknesses,it isusefultodevelopa generative probabilisticmodeloftextcorporaandtostud ytheabilityofLSItorecoveraspectsofthegen erative modelfromdata(Papadimitriouet al.,1998).Givena generativemodeloftext,however, it is notclearwhyoneshouldadopttheLSImethodolo gy onecanattempttoproceedmoredirectly, (1999),whopresentedtheprobabilisticLSI(p LSI)model,alsoknownastheaspectmodel, asanalternative , ,modelseachwordina documentasa samplefroma mixturemodel,wherethemixturecomponentsar emultinomialrandomvariablesthatcanbeview edasrepresentationsof topics.

Thuseachwordis generatedfroma singletopic,anddifferentwordsina listofmixingproportionsforthesemixtureco mponentsandtherebyreducedtoa probabilitydistributionona the reduceddescription s workis a usefulsteptowardprobabilisticmodelingoft ext,it is incompleteinthatit ,eachdocumentisrepresentedasa listofnumbers(themixingproportionsfortop ics), :(1)thenumberofparame-tersinthemodelgrow slinearlywiththesizeofthecorpus,whichlea dstoseriousproblemswithoverfitting,and(2 )it is notclearhow toassignprobabilitytoa seehow toproceedbeyondpLSI, bag-of-words assumption thattheorderofwordsina , thisis anassumptionofexchangeabilityforthewords ina document(Aldous,1985).Moreover, althoughlessoftenstatedformally, thesemethodsalsoassumethatdocumentsareex changeable.

Thespecificorderingofthedocumentsina (1990)establishesthatany collectionofex-changeablerandomvariables hasa representationasa mixturedistribution ,if wewishtoconsiderexchangeablerepresentati onsfordocumentsandwords, (LDA) is importanttoemphasizethatanassumptionofex changeabilityis , exchange-abilityessentiallycanbeinterpre tedasmeaning conditionallyindependentandidenticallydi s-tributed, wheretheconditioningis withrespecttoanunderlyinglatentparameter ofa , thejointdistributionoftherandomvariables issimpleandfactoredwhilemarginallyoverth elatentparameter, ,whileanassumptionofexchangeabilityis clearlya majorsimplifyingassumptioninthedomainoft extmodeling,anditsprincipaljustification is thatit leadsto methodsthatarecomputationallyefficient,t heexchangeabilityassumptionsdonotnecessa rilyleadtomethodsthatarerestrictedtosimp lefrequency aimtodemonstrateinthecurrentpaperthat.

BytakingthedeFinettitheoremseriously, is alsoworthnotingthattherearea largenumberofgeneralizationsofthebasicno tionofexchangeability, includingvariousformsofpartialexchangeab ility, andthatrepresentationtheo-remsareavailab leforthesecasesaswell(Diaconis,1988).Thu s,whiletheworkthatwediscussinthecurrentp aperfocusesonsimple bag-of-words models,whichleadtomixturedistributionsfo rsinglewords(unigrams),ourmethodsarealso applicabletorichermodelsthatinvolve ,textclassificationandcollaborative , Section8 usethelanguageoftextcollectionsthroughou tthepaper, referringtoentitiessuchas words, documents, and corpora. Thisisusefulinthatit helpstoguideintuition, is importanttonote,however, thattheLDAmodelis notnecessarilytiedtotext,andhasapplicati onstootherproblemsinvolvingcollectionsof data,includingdatafromdomainssuchascolla borative filtering, , ,wepresentexperimentalresultsinthecollab orative , wedefinethefollowingterms: Awordis thebasicunitofdiscretedata,definedtobean itemfroma vocabularyindexedbyf1; : : : ;Vg.

We representwordsusingunit-basisvectorsthat have a ,usingsuperscriptstodenotecomponents,the vthwordin thevocabularyis representedbyaV-vectorwsuchthatwv=1 andwu=0 foru6=v. Adocumentis a sequenceofNwordsdenotedbyw= (w1;w2; : : : ;wN), wherewnis thenthwordinthesequence. Acorpusis a collectionofMdocumentsdenotedbyD=fw1;w2; : : : ; , NG,ANDJORDANWe wishtofinda probabilisticmodelofa corpusthatnotonlyassignshighprobabilityt omembersofthecorpus,butalsoassignshighpr obabilitytoother similar (LDA)is a generative probabilisticmodelofa ,whereeachtopicis charac-terizedbya assumesthefollowinggenerative processforeachdocumentwina Poisson( ). Dir( ). :(a)Choosea topiczn Multinomial( ).(b)Choosea wordwnfromp(wnjzn; ), a ,someofwhichweremove ,thedimensionalitykoftheDirichletdistrib ution(andthusthedimensionalityofthetopic variablez) is ,thewordprobabilitiesareparameter-izedby ak Vmatrix where i j=p(wj=1jzi=1), whichfornow wetreatasa fixedquantitythatis , thePoissonassumptionis ,notethatNisindependentofalltheotherdata generatingvariables( andz).

It is cantake valuesinthe(k 1)-simplex (ak-vector liesinthe(k 1)-simplex if i 0, ki=1 i=1),andhasthefollowingprobabilitydensit yonthissimplex:p( j ) = ki=1 i ki=1 ( i) 1 11 k 1k;(1)wheretheparameter is ak-vectorwithcomponents i>0, andwhere (x)is a convenientdistributiononthesimplex it is intheexponentialfamily, hasfinitedimensionalsufficientstatistics ,andis conjugateto , and , thejointdistributionofa topicmixture , a setofNtopicsz, anda setofNwordswis givenby:p( ;z;wj ; ) =p( j )N n=1p(znj )p(wnjzn; );(2) refertothelatentmultinomialvariablesinth eLDA modelastopics,soastoexploittext-oriented intuitions,butwemake zw MNFigure1 plates ,whiletheinnerplaterepresentstherepeated choiceoftopicsandwordswithina (znj )is simply ifortheuniqueisuchthatzin= andsummingoverz, weobtainthemarginaldistributionofa document:p(wj ; ) =Zp( j ) N n=1 znp(znj )p(wnjzn; )!

D :(3)Finally, takingtheproductofthemarginalprobabiliti esofsingledocuments,weobtaintheproba-bil ityofa corpus:p(Dj ; ) =M d=1Zp( dj ) Nd n=1 zdnp(zdnj d)p(wdnjzdn; )!d d:TheLDAmodelis representedasa , therearethreelevelstotheLDA and arecorpus-level parameters,assumedtobesampledonceinthepr ocessofgeneratinga daredocument-levelvariables, , thevariableszdnandwdnareword-level isimportanttodistinguishLDAfroma a two-levelmodelinwhicha Dirichletis sampledoncefora corpus,a multinomialclusteringvariableis selectedonceforeachdocumentinthecorpus,a nda ,sucha modelrestrictsa documenttobeingassociatedwitha ,ontheotherhand,involvesthreelevels,andn otablythetopicnodeis , areoftenstudiedinBayesianstatisticalmode ling,wherethey arereferredtoashierarchicalmodels(Gelman et al.)

,1995),ormorepreciselyascon-ditionallyin dependenthierarchicalmodels(KassandSteff ey, 1989).Suchmodelsarealsooftenreferredtoas parametricempiricalBayesmodels, a termthatrefersnotonlytoa particularmodelstructure,butalsotothemet hodsusedforestimatingparametersinthemode l(Morris,1983).In-deed,aswediscussinSect ion5,weadopttheempiricalBayesapproachtoe stimatingparameterssuchas and in simpleimplementationsofLDA,but , NG, andexchangeabilityAfinitesetofrandomvari ablesfz1; : : : ;zNgis saidtobeexchangeableif is a permutationoftheintegersfrom1 toN:p(z1; : : : ;zN) =p(z (1); : : : ;z (N)):Aninfinitesequenceofrandomvariables isinfinitelyexchangeableif s representationtheoremstatesthatthejointd istributionofaninfinitelyexchangeableseq uenceofrandomvariablesis asif a randomparameterweredrawnfromsomedistribu tionandthentherandomvariablesinquestionw ereindependentandidenticallydistributed, ,weassumethatwordsaregeneratedbytopics(b yfixedconditionaldistributions)andthatth osetopicsareinfinitelyexchangeablewithin a s theorem,theprob-abilityofa sequenceofwordsandtopicsmustthereforehav e theform:p(w;z) =Zp( ) N n=1p(znj )p(wnjzn)!

Latent Dirichlet Allocation - Home - Stanford Artificial ...

Tags:

Information

Transcription of Latent Dirichlet Allocation - Home - Stanford Artificial ...

Related search queries

Latent Dirichlet Allocation - Home - Stanford Artificial ...

Tags:

Information

Documents from same domain

Related documents

Related search queries