Example: air traffic controller

Latent Dirichlet Allocation

JournalofMachineLearningResearch3 CA94720,USAA ndrewY. UniversityStanford, CA94305,USAM ichaelI. CA94720,USAE ditor:JohnLaffertyAbstractWe describelatentDirichletallocation(LDA),a generative is a three-level hierarchicalBayesianmodel,inwhicheachite mofa collectionis modeledasa finitemixtureover ,inturn, ,thetopicprobabilitiesprovideanexplicitr epresentationofa reportresultsindocumentmodeling,textclas sification,andcollaborative filtering,comparingtoa collectionthatenableefficientprocessingo flargecollectionswhilepreservingtheessen tialstatisticalrelationshipsthatareusefu lforbasictaskssuchasclassification,novel tydetection,summarization, (IR)(Baeza-YatesandRibeiro-Neto,1999).

The LDA model is presented in Section 3 and is compared to related latent variable models in Section 4. We discuss inference and parameter estimation for LDA in Section 5. An illustrative example of fitting LDA to data is provided in Section 6. Empirical results in text modeling, text

Tags:

  Model, Talent, Variable, Latent variable models

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Latent Dirichlet Allocation

1 JournalofMachineLearningResearch3 CA94720,USAA ndrewY. UniversityStanford, CA94305,USAM ichaelI. CA94720,USAE ditor:JohnLaffertyAbstractWe describelatentDirichletallocation(LDA),a generative is a three-level hierarchicalBayesianmodel,inwhicheachite mofa collectionis modeledasa finitemixtureover ,inturn, ,thetopicprobabilitiesprovideanexplicitr epresentationofa reportresultsindocumentmodeling,textclas sification,andcollaborative filtering,comparingtoa collectionthatenableefficientprocessingo flargecollectionswhilepreservingtheessen tialstatisticalrelationshipsthatareusefu lforbasictaskssuchasclassification,novel tydetection,summarization, (IR)(Baeza-YatesandRibeiro-Neto,1999).

2 ThebasicmethodologyproposedbyIRresearche rsfortextcorpora amethodologysuccessfullydeployedinmodern Internetsearchengines reduceseachdocumentinthecorpustoa vectorofrealnumbers, (SaltonandMcGill,1983),a basicvocabularyof words or terms is chosen,and,foreachdocumentinthecorpus,a countis ,thistermfrequency countiscomparedtoaninversedocumentfreque ncy count,whichmeasuresthenumberofoccurrence sofac ,Andrew Y. NgandMichaelI. , NG,ANDJORDAN wordintheentirecorpus(generallyona logscale,andagainsuitablynormalized).The endresultis a notablyin itsbasicidentificationofsetsofwordsthata rediscriminative fordocumentsinthecollection theapproachalsoprovidesa rela-tivelysmallamountofreductionindescr iptionlengthandrevealslittleinthewayofin ter- addresstheseshortcomings,IRresearchersha ve proposedseveralotherdimensionalityreduct iontechniques,mostnotablylatentsemantici ndexing(LSI)(Deerwesteret al.)

3 ,1990).LSIusesa singularvaluedecompositionoftheXmatrixto identifya ,Deerwesteret ,whicharelinearcombinationsoftheoriginal tf-idffeatures, substantiatetheclaimsregardingLSI,andtos tudyitsrelative strengthsandweaknesses,it isusefultodevelopa generative probabilisticmodeloftextcorporaandtostud ytheabilityofLSItorecoveraspectsofthegen erative modelfromdata(Papadimitriouet al.,1998).Givena generativemodeloftext,however, it is notclearwhyoneshouldadopttheLSImethodolo gy onecanattempttoproceedmoredirectly, (1999),whopresentedtheprobabilisticLSI(p LSI) model ,alsoknownastheaspectmodel, asanalternative , ,modelseachwordina documentasa samplefroma mixturemodel,wherethemixturecomponentsar emultinomialrandomvariablesthatcanbeview edasrepresentationsof topics.

4 Thuseachwordis generatedfroma singletopic,anddifferentwordsina listofmixingproportionsforthesemixtureco mponentsandtherebyreducedtoa probabilitydistributionona the reduceddescription s workis a usefulsteptowardprobabilisticmodelingoft ext,it is incompleteinthatit ,eachdocumentisrepresentedasa listofnumbers(themixingproportionsfortop ics), :(1)thenumberofparame-tersinthemodelgrow slinearlywiththesizeofthecorpus,whichlea dstoseriousproblemswithoverfitting,and(2 )it is notclearhow toassignprobabilitytoa seehow toproceedbeyondpLSI, bag-of-words assumption thattheorderofwordsina , thisis anassumptionofexchangeabilityforthewords ina document(Aldous,1985).

5 Moreover, althoughlessoftenstatedformally, thesemethodsalsoassumethatdocumentsareex changeable;thespecificorderingofthedocum entsina (1990)establishesthatany collectionofex-changeablerandomvariables hasa representationasa mixturedistribution ,if wewishtoconsiderexchangeablerepresentati onsfordocumentsandwords, (LDA) is importanttoemphasizethatanassumptionofex changeabilityis , exchange-abilityessentiallycanbeinterpre tedasmeaning conditionallyindependentandidenticallydi s-tributed, wheretheconditioningis withrespecttoanunderlyinglatentparameter ofa , thejointdistributionoftherandomvariables issimpleandfactoredwhilemarginallyoverth elatentparameter, ,whileanassumptionofexchangeabilityis clearlya majorsimplifyingassumptioninthedomainoft extmodeling.

6 Anditsprincipaljustificationis thatit leadsto methodsthatarecomputationallyefficient,t heexchangeabilityassumptionsdonotnecessa rilyleadtomethodsthatarerestrictedtosimp lefrequency aimtodemonstrateinthecurrentpaperthat,by takingthedeFinettitheoremseriously, is alsoworthnotingthattherearea largenumberofgeneralizationsofthebasicno tionofexchangeability, includingvariousformsofpartialexchangeab ility, andthatrepresentationtheo-remsareavailab leforthesecasesaswell(Diaconis,1988).Thu s,whiletheworkthatwediscussinthecurrentp aperfocusesonsimple bag-of-words models,whichleadtomixturedistributionsfo rsinglewords(unigrams),ourmethodsarealso applicabletorichermodelsthatinvolve ,textclassificationandcollaborative , Section8 usethelanguageoftextcollectionsthroughou tthepaper, referringtoentitiessuchas words, documents, and corpora.

7 Thisisusefulinthatit helpstoguideintuition, is importanttonote,however, thattheLDAmodelis notnecessarilytiedtotext,andhasapplicati onstootherproblemsinvolvingcollectionsof data,includingdatafromdomainssuchascolla borative filtering, , ,wepresentexperimentalresultsinthecollab orative , wedefinethefollowingterms: Awordis thebasicunitofdiscretedata,definedtobean itemfroma vocabularyindexedbyf1; : : : ;Vg. We representwordsusingunit-basisvectorsthat have a ,usingsuperscriptstodenotecomponents,the vthwordin thevocabularyis representedbyaV-vectorwsuchthatwv=1 andwu=0 foru6=v.

8 Adocumentis a sequenceofNwordsdenotedbyw= (w1;w2; : : : ;wN), wherewnis thenthwordinthesequence. Acorpusis a collectionofMdocumentsdenotedbyD=fw1;w2; : : : ; , NG,ANDJORDANWe wishtofinda probabilisticmodelofa corpusthatnotonlyassignshighprobabilityt omembersofthecorpus,butalsoassignshighpr obabilitytoother similar (LDA)is a generative probabilisticmodelofa ,whereeachtopicis charac-terizedbya assumesthefollowinggenerative processforeachdocumentwina Poisson( ). Dir( ). :(a)Choosea topiczn Multinomial( ).

9 (b)Choosea wordwnfromp(wnjzn; ), a ,someofwhichweremove ,thedimensionalitykoftheDirichletdistrib ution(andthusthedimensionalityofthetopic variablez) is ,thewordprobabilitiesareparameter-izedby ak Vmatrix where i j=p(wj=1jzi=1), whichfornow wetreatasa fixedquantitythatis , thePoissonassumptionis ,notethatNisindependentofalltheotherdata generatingvariables( andz). It is cantake valuesinthe(k 1)-simplex (ak-vector liesinthe(k 1)-simplex if i 0, ki=1 i=1),andhasthefollowingprobabilitydensit yonthissimplex:p( j ) = ki=1 i ki=1 ( i) 1 11 k 1k;(1)wheretheparameter is ak-vectorwithcomponents i>0, andwhere (x)is a convenientdistributiononthesimplex it is intheexponentialfamily, hasfinitedimensionalsufficientstatistics ,andis conjugateto , and , thejointdistributionofa topicmixture , a setofNtopicsz, anda setofNwordswis givenby:p( ;z;wj.)

10 =p( j )N n=1p(znj )p(wnjzn; );(2) refertothelatentmultinomialvariablesinth eLDA modelastopics,soastoexploittext-oriented intuitions,butwemake zw MNFigure1 plates ,whiletheinnerplaterepresentstherepeated choiceoftopicsandwordswithina (znj )is simply ifortheuniqueisuchthatzin= andsummingoverz, weobtainthemarginaldistributionofa document:p(wj ; ) =Zp( j ) N n=1 znp(znj )p(wnjzn; )!d :(3)Finally, takingtheproductofthemarginalprobabiliti esofsingledocuments,weobtaintheproba-bil ityofa corpus:p(Dj ; ) =M d=1Zp( dj ) Nd n=1 zdnp(zdnj d)p(wdnjzdn; )!


Related search queries