Transcription of Latent Dirichlet Allocation
1 JournalofMachineLearningResearch3 CA94720,USAA ndrewY. UniversityStanford, CA94305,USAM ichaelI. CA94720,USAE ditor:JohnLaffertyAbstractWe describelatentDirichletallocation(LDA),a generative is a three-level hierarchicalBayesianmodel,inwhicheachite mofa collectionis modeledasa finitemixtureover ,inturn, ,thetopicprobabilitiesprovideanexplicitr epresentationofa reportresultsindocumentmodeling,textclas sification,andcollaborative filtering,comparingtoa collectionthatenableefficientprocessingo flargecollectionswhilepreservingtheessen tialstatisticalrelationshipsthatareusefu lforbasictaskssuchasclassification,novel tydetection,summarization, (IR)(Baeza-YatesandRibeiro-Neto,1999).
2 ThebasicmethodologyproposedbyIRresearche rsfortextcorpora amethodologysuccessfullydeployedinmodern Internetsearchengines reduceseachdocumentinthecorpustoa vectorofrealnumbers, (SaltonandMcGill,1983),a basicvocabularyof words or terms is chosen,and,foreachdocumentinthecorpus,a countis ,thistermfrequency countiscomparedtoaninversedocumentfreque ncy count,whichmeasuresthenumberofoccurrence sofac ,Andrew Y. NgandMichaelI. , NG,ANDJORDAN wordintheentirecorpus(generallyona logscale,andagainsuitablynormalized).The endresultis a notablyin itsbasicidentificationofsetsofwordsthata rediscriminative fordocumentsinthecollection theapproachalsoprovidesa rela-tivelysmallamountofreductionindescr iptionlengthandrevealslittleinthewayofin ter- addresstheseshortcomings,IRresearchersha ve proposedseveralotherdimensionalityreduct iontechniques,mostnotablylatentsemantici ndexing(LSI)(Deerwesteret al.)
3 ,1990).LSIusesa singularvaluedecompositionoftheXmatrixto identifya ,Deerwesteret ,whicharelinearcombinationsoftheoriginal tf-idffeatures, substantiatetheclaimsregardingLSI,andtos tudyitsrelative strengthsandweaknesses,it isusefultodevelopa generative probabilisticmodeloftextcorporaandtostud ytheabilityofLSItorecoveraspectsofthegen erative modelfromdata(Papadimitriouet al.,1998).Givena generativemodeloftext,however, it is notclearwhyoneshouldadopttheLSImethodolo gy onecanattempttoproceedmoredirectly, (1999),whopresentedtheprobabilisticLSI(p LSI) model ,alsoknownastheaspectmodel, asanalternative , ,modelseachwordina documentasa samplefroma mixturemodel,wherethemixturecomponentsar emultinomialrandomvariablesthatcanbeview edasrepresentationsof topics.
4 Thuseachwordis generatedfroma singletopic,anddifferentwordsina listofmixingproportionsforthesemixtureco mponentsandtherebyreducedtoa probabilitydistributionona the reduceddescription s workis a usefulsteptowardprobabilisticmodelingoft ext,it is incompleteinthatit ,eachdocumentisrepresentedasa listofnumbers(themixingproportionsfortop ics), :(1)thenumberofparame-tersinthemodelgrow slinearlywiththesizeofthecorpus,whichlea dstoseriousproblemswithoverfitting,and(2 )it is notclearhow toassignprobabilitytoa seehow toproceedbeyondpLSI, bag-of-words assumption thattheorderofwordsina , thisis anassumptionofexchangeabilityforthewords ina document(Aldous,1985).
5 Moreover, althoughlessoftenstatedformally, thesemethodsalsoassumethatdocumentsareex changeable;thespecificorderingofthedocum entsina (1990)establishesthatany collectionofex-changeablerandomvariables hasa representationasa mixturedistribution ,if wewishtoconsiderexchangeablerepresentati onsfordocumentsandwords, (LDA) is importanttoemphasizethatanassumptionofex changeabilityis , exchange-abilityessentiallycanbeinterpre tedasmeaning conditionallyindependentandidenticallydi s-tributed, wheretheconditioningis withrespecttoanunderlyinglatentparameter ofa , thejointdistributionoftherandomvariables issimpleandfactoredwhilemarginallyoverth elatentparameter, ,whileanassumptionofexchangeabilityis clearlya majorsimplifyingassumptioninthedomainoft extmodeling.
6 Anditsprincipaljustificationis thatit leadsto methodsthatarecomputationallyefficient,t heexchangeabilityassumptionsdonotnecessa rilyleadtomethodsthatarerestrictedtosimp lefrequency aimtodemonstrateinthecurrentpaperthat,by takingthedeFinettitheoremseriously, is alsoworthnotingthattherearea largenumberofgeneralizationsofthebasicno tionofexchangeability, includingvariousformsofpartialexchangeab ility, andthatrepresentationtheo-remsareavailab leforthesecasesaswell(Diaconis,1988).Thu s,whiletheworkthatwediscussinthecurrentp aperfocusesonsimple bag-of-words models,whichleadtomixturedistributionsfo rsinglewords(unigrams),ourmethodsarealso applicabletorichermodelsthatinvolve ,textclassificationandcollaborative , Section8 usethelanguageoftextcollectionsthroughou tthepaper, referringtoentitiessuchas words, documents, and corpora.
7 Thisisusefulinthatit helpstoguideintuition, is importanttonote,however, thattheLDAmodelis notnecessarilytiedtotext,andhasapplicati onstootherproblemsinvolvingcollectionsof data,includingdatafromdomainssuchascolla borative filtering, , ,wepresentexperimentalresultsinthecollab orative , wedefinethefollowingterms: Awordis thebasicunitofdiscretedata,definedtobean itemfroma vocabularyindexedbyf1; : : : ;Vg. We representwordsusingunit-basisvectorsthat have a ,usingsuperscriptstodenotecomponents,the vthwordin thevocabularyis representedbyaV-vectorwsuchthatwv=1 andwu=0 foru6=v.
8 Adocumentis a sequenceofNwordsdenotedbyw= (w1;w2; : : : ;wN), wherewnis thenthwordinthesequence. Acorpusis a collectionofMdocumentsdenotedbyD=fw1;w2; : : : ; , NG,ANDJORDANWe wishtofinda probabilisticmodelofa corpusthatnotonlyassignshighprobabilityt omembersofthecorpus,butalsoassignshighpr obabilitytoother similar (LDA)is a generative probabilisticmodelofa ,whereeachtopicis charac-terizedbya assumesthefollowinggenerative processforeachdocumentwina Poisson( ). Dir( ). :(a)Choosea topiczn Multinomial( ).
9 (b)Choosea wordwnfromp(wnjzn; ), a ,someofwhichweremove ,thedimensionalitykoftheDirichletdistrib ution(andthusthedimensionalityofthetopic variablez) is ,thewordprobabilitiesareparameter-izedby ak Vmatrix where i j=p(wj=1jzi=1), whichfornow wetreatasa fixedquantitythatis , thePoissonassumptionis ,notethatNisindependentofalltheotherdata generatingvariables( andz). It is cantake valuesinthe(k 1)-simplex (ak-vector liesinthe(k 1)-simplex if i 0, ki=1 i=1),andhasthefollowingprobabilitydensit yonthissimplex:p( j ) = ki=1 i ki=1 ( i) 1 11 k 1k;(1)wheretheparameter is ak-vectorwithcomponents i>0, andwhere (x)is a convenientdistributiononthesimplex it is intheexponentialfamily, hasfinitedimensionalsufficientstatistics ,andis conjugateto , and , thejointdistributionofa topicmixture , a setofNtopicsz, anda setofNwordswis givenby:p( ;z;wj.)
10 =p( j )N n=1p(znj )p(wnjzn; );(2) refertothelatentmultinomialvariablesinth eLDA modelastopics,soastoexploittext-oriented intuitions,butwemake zw MNFigure1 plates ,whiletheinnerplaterepresentstherepeated choiceoftopicsandwordswithina (znj )is simply ifortheuniqueisuchthatzin= andsummingoverz, weobtainthemarginaldistributionofa document:p(wj ; ) =Zp( j ) N n=1 znp(znj )p(wnjzn; )!d :(3)Finally, takingtheproductofthemarginalprobabiliti esofsingledocuments,weobtaintheproba-bil ityofa corpus:p(Dj ; ) =M d=1Zp( dj ) Nd n=1 zdnp(zdnj d)p(wdnjzdn; )!