Latent Dirichlet Allocation - Journal of Machine …

Journal of Machine learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03 Latent Dirichlet AllocationDavid M. Science DivisionUniversity of CaliforniaBerkeley, CA 94720, USAA ndrew Y. Science DepartmentStanford UniversityStanford, CA 94305, USAM ichael I. Science Division and Department of StatisticsUniversity of CaliforniaBerkeley, CA 94720, USAE ditor:John LaffertyAbstractWe describelatent Dirichlet Allocation (LDA), a generative probabilistic model for collections ofdiscrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which eachitem of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, inturn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context oftext modeling, the topic probabilities provide an explicit representation of a document. We presentefficient approximate inference techniques based on variational methods and an EM algorithm forempirical Bayes parameter estimation.

We report results in document modeling, text classification,and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic IntroductionIn this paper we consider the problem of modeling text corpora and other collections of discretedata. The goal is to find short descriptions of the members of a collection that enable efficientprocessing of large collections while preserving the essential statistical relationships that are usefulfor basic tasks such as classification, novelty detection, summarization, and similarity and progress has been made on this problem by researchers in the field of informa-tion retrieval (IR) (Baeza-Yates and Ribeiro-Neto, 1999). The basic methodology proposed byIR researchers for text corpora a methodology successfully deployed in modern Internet searchengines reduces each document in the corpus to a vector of real numbers, each of which repre-sents ratios of counts. In the populartf-idfscheme (Salton and McGill, 1983), a basic vocabularyof words or terms is chosen, and, for each document in the corpus, a count is formed of thenumber of occurrences of each word.

After suitable normalization, this term frequency count iscompared to an inverse document frequency count, which measures the number of occurrences of ac 2003 David M. Blei, Andrew Y. Ng and Michael I. ,NG,ANDJORDAN word in the entire corpus (generally on a log scale, and again suitably normalized). The end resultis a term-by-document matrixXwhose columns contain thetf-idfvalues for each of the documentsin the corpus. Thus thetf-idfscheme reduces documents of arbitrary length to fixed-length lists thetf-idfreduction has some appealing features notably in its basic identification of setsof words that are discriminative for documents in the collection the approach also provides a rela-tively small amount of reduction in description length and reveals little in the way of inter- or intra-document statistical structure. To address these shortcomings, IR researchers have proposed severalother dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwesteret al.)

, 1990). LSI uses a singular value decomposition of theXmatrix to identify a linear subspacein the space oftf-idffeatures that captures most of the variance in the collection. This approach canachieve significant compression in large collections. Furthermore, Deerwester et al. argue that thederived features of LSI, which are linear combinations of the originaltf-idffeatures, can capturesome aspects of basic linguistic notions such as synonymy and substantiate the claims regarding LSI, and to study its relative strengths and weaknesses, it isuseful to develop a generative probabilistic model of text corpora and to study the ability of LSI torecover aspects of the generative model from data (Papadimitriou et al., 1998). Given a generativemodel of text, however, it is not clear why one should adopt the LSI methodology one can attemptto proceed more directly, fitting the model to data using maximum likelihood or Bayesian significant step forward in this regard was made by Hofmann (1999), who presented theprobabilistic LSI (pLSI)model, also known as theaspect model, as an alternative to LSI.

The pLSIapproach, which we describe in detail in Section , models each word in a document as a samplefrom a mixture model, where the mixture components are multinomial random variables that can beviewed as representations of topics. Thus each word is generated from a single topic, and differentwords in a document may be generated from different topics. Each document is represented asa list of mixing proportions for these mixture components and thereby reduced to a probabilitydistribution on a fixed set of topics. This distribution is the reduced description associated withthe Hofmann s work is a useful step toward probabilistic modeling of text, it is incompletein that it provides no probabilistic model at the level of documents. In pLSI, each document isrepresented as a list of numbers (the mixing proportions for topics), and there is no generativeprobabilistic model for these numbers. This leads to several problems: (1) the number of parame-ters in the model grows linearly with the size of the corpus, which leads to serious problems withoverfitting, and (2) it is not clear how to assign probability to a document outside of the training see how to proceed beyond pLSI, let us consider the fundamental probabilistic assumptionsunderlying the class of dimensionality reduction methods that includes LSI and pLSI.

All of thesemethods are based on the bag-of-words assumption that the order of words in a document canbe neglected. In the language of probability theory, this is an assumption ofexchangeabilityfor thewords in a document (Aldous, 1985). Moreover, although less often stated formally, these methodsalso assume that documents are exchangeable; the specific ordering of the documents in a corpuscan also be classic representation theorem due to de Finetti (1990) establishes that any collection of ex-changeable random variables has a representation as a mixture distribution in general an infinitemixture. Thus, if we wish to consider exchangeable representations for documents and words, weneed to consider mixture models that capture the exchangeability of both words and line of thinking leads to thelatent Dirichlet Allocation (LDA)model that we present in thecurrent is important to emphasize that an assumption of exchangeability is not equivalent to an as-sumption that the random variables are independent and identically distributed.

Rather, exchange-ability essentially can be interpreted as meaning conditionallyindependent and identically dis-tributed, where the conditioning is with respect to an underlying Latent parameter of a probabilitydistribution. Conditionally, the joint distribution of the random variables is simple and factoredwhile marginally over the Latent parameter, the joint distribution can be quite complex. Thus, whilean assumption of exchangeability is clearly a major simplifying assumption in the domain of textmodeling, and its principal justification is that it leads to methods that are computationally efficient,the exchangeability assumptions do not necessarily lead to methods that are restricted to simplefrequency counts or linear operations. We aim to demonstrate in the current paper that, by takingthe de Finetti theorem seriously, we can capture significant intra-document statistical structure viathe mixing is also worth noting that there are a large number of generalizations of the basic notion ofexchangeability, including various forms of partial exchangeability, and that representation theo-rems are available for these cases as well (Diaconis, 1988).

Thus, while the work that we discuss inthe current paper focuses on simple bag-of-words models, which lead to mixture distributions forsingle words (unigrams), our methods are also applicable to richer models that involve mixtures forlarger structural units such asn-grams or paper is organized as follows. In Section 2 we introduce basic notation and LDA model is presented in Section 3 and is compared to related Latent variable models inSection 4. We discuss inference and parameter estimation for LDA in Section 5. An illustrativeexample of fitting LDA to data is provided in Section 6. Empirical results in text modeling, textclassification and collaborative filtering are presented in Section 7. Finally, Section 8 presents Notation and terminologyWe use the language of text collections throughout the paper, referring to entities such as words, documents, and corpora. This is useful in that it helps to guide intuition, particularly whenwe introduce Latent variables which aim to capture abstract notions such as topics.

It is importantto note, however, that the LDA model is not necessarily tied to text, and has applications to otherproblems involving collections of data, including data from domains such as collaborative filtering,content-based image retrieval and bioinformatics. Indeed, in Section , we present experimentalresults in the collaborative filtering , we define the following terms: Awordis the basic unit of discrete data, defined to be an item from a vocabulary indexed byf1,.. ,Vg. We represent words using unit-basis vectors that have a single component equal toone and all other components equal to zero. Thus, using superscripts to denote components,thevth word in the vocabulary is represented by aV-vectorwsuch thatwv=1 andwu=0 foru6=v. Adocumentis a sequence ofNwords denoted byw=(w1,w2,.. ,wN), wherewnis thenthword in the sequence. Acorpusis a collection ofMdocuments denoted byD=fw1,w2,.. , ,NG,ANDJORDANWe wish to find a probabilistic model of a corpus that not only assigns high probability tomembers of the corpus, but also assigns high probability to other similar Latent Dirichlet allocationLatent Dirichlet Allocation (LDA) is a generative probabilistic model of a corpus.

The basic idea isthat documents are represented as random mixtures over Latent topics, where each topic is charac-terized by a distribution over assumes the following generative process for each documentwin a corpusD:1. ChooseN Poisson( ).2. Choose Dir( ).3. For each of theNwordswn:(a) Choose a topiczn Multinomial( ).(b) Choose a wordwnfromp(wnjzn, ), a multinomial probability conditioned on the simplifying assumptions are made in this basic model, some of which we remove in subse-quent sections. First, the dimensionalitykof the Dirichlet distribution (and thus the dimensionalityof the topic variablez) is assumed known and fixed. Second, the word probabilities are parameter-ized by ak Vmatrix where ij=p(wj=1jzi=1), which for now we treat as a fixed quantitythat is to be estimated. Finally, the Poisson assumption is not critical to anything that follows andmore realistic document length distributions can be used as needed. Furthermore, note thatNisindependent of all the other data generating variables ( andz).

Latent Dirichlet Allocation - Journal of Machine …

Tags:

Information

Transcription of Latent Dirichlet Allocation - Journal of Machine …

Related search queries

Latent Dirichlet Allocation - Journal of Machine …

Tags:

Information

Documents from same domain

Related documents

Related search queries