Finding scientific topics

Colloquium Finding scientific topics Thomas L. Griffiths* and Mark Steyvers . *Department of Psychology, Stanford University, Stanford, CA 94305; Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139-4307; and Department of Cognitive Sciences, University of California, Irvine, CA 92697. A first step in identifying the content of a document is determining our algorithm to a corpus consisting of abstracts from PNAS. which topics that document addresses. We describe a generative from 1991 to 2001, determining the number of topics needed to model for documents, introduced by Blei, Ng, and Jordan [Blei, account for the information contained in this corpus and ex- D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, tracting a set of topics . We use these topics to illustrate the 993-1022], in which each document is generated by choosing a relationships between different scientific disciplines, assessing distribution over topics and then choosing each word in the trends and hot topics '' by analyzing topic dynamics and using document from a topic selected according to this distribution.

We the assignments of words to topics to highlight the semantic then present a Markov chain Monte Carlo algorithm for inference content of documents. in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of Documents, topics , and Statistical Inference topics . We show that the extracted topics capture meaningful A scientific paper can deal with multiple topics , and the words structure in the data, consistent with the class designations pro- that appear in that paper reflect the particular set of topics it vided by the authors of the articles, and outline further applica- addresses. In statistical natural language processing, one com- tions of this analysis, including identifying hot topics '' by exam- mon way of modeling the contributions of different topics to a ining temporal dynamics and tagging abstracts to illustrate document is to treat each topic as a probability distribution over semantic content.

Words, viewing a document as a probabilistic mixture of these topics (1 6). If we have T topics , we can write the probability of the ith word in a given document as W hen scientists decide to write a paper, one of the first things they do is identify an interesting subset of the many . T. possible topics of scientific investigation. The topics addressed by a paper are also one of the first pieces of information a person P w i P w i z i j P z i j , [1]. tries to extract when reading a scientific abstract. scientific j 1. experts know which topics are pursued in their field, and this information plays a role in their assessments of whether papers where zi is a latent variable indicating the topic from which the are relevant to their interests, which research areas are rising or ith word was drawn and P(wi zi j) is the probability of the word falling in popularity, and how papers relate to one another.

Here, wi under the jth topic. P(zi j) gives the probability of choosing we present a statistical method for automatically extracting a a word from topics j in the current document, which will vary representation of documents that provides a first-order approx- across different documents. imation to the kind of knowledge available to domain experts. Intuitively, P(w z) indicates which words are important to a Our method discovers a set of topics expressed by documents, topic, whereas P(z) is the prevalence of those topics within a providing quantitative measures that can be used to identify the document. For example, in a journal that published only articles content of those documents, track changes in content over time, in mathematics or neuroscience, we could express the probability distribution over words with two topics , one relating to mathe- and express the similarity between documents.

We use our matics and the other relating to neuroscience. The content of the method to discover the topics covered by papers in PNAS in a topics would be reflected in P(w z); the mathematics'' topic purely unsupervised fashion and illustrate how these topics can would give high probability to words like theory, space, or be used to gain insight into some of the structure of science. problem, whereas the neuroscience'' topic would give high The statistical model we use in our analysis is a generative probability to words like synaptic, neurons, and hippocampal. model for documents; it reduces the complex process of pro- Whether a particular document concerns neuroscience, mathe- ducing a scientific paper to a small number of simple probabi- matics, or computational neuroscience would depend on its listic steps and thus specifies a probability distribution over all distribution over topics , P(z), which determines how these topics possible documents.

Generative models can be used to postulate are mixed together in forming documents. The fact that multiple complex latent structures responsible for a set of observations, topics can be responsible for the words occurring in a single making it possible to use statistical inference to recover this document discriminates this model from a standard Bayesian structure. This kind of approach is particularly useful with text, classifier, in which it is assumed that all the words in the where the observed data (the words) are explicitly intended to document come from a single class. The soft classification''. communicate a latent structure (their meaning). The particular provided by this model, in which each document is characterized generative model we use, called Latent Dirichlet Allocation, was in terms of the contributions of multiple topics , has applications introduced in ref.

1. This generative model postulates a latent in many domains other than text (7). structure consisting of a set of topics ; each document is produced by choosing a distribution over topics , and then generating each word at random from a topic chosen by using this distribution. This paper results from the Arthur M. Sackler Colloquium of the National Academy of The plan of this article is as follows. In the next section, we Sciences, Mapping Knowledge Domains,'' held May 9 11, 2003, at the Arnold and Mabel describe Latent Dirichlet Allocation and present a Markov chain Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. Monte Carlo algorithm for inference in this model, illustrating To whom correspondence should be addressed. E-mail: the operation of our algorithm on a small dataset. We then apply 2004 by The National Academy of Sciences of the USA.

5228 5235 PNAS April 6, 2004 vol. 101 suppl. 1 cgi doi . Viewing documents as mixtures of probabilistic topics makes T T. it possible to formulate the problem of discovering the set of W w nj w . P w z , [2]. topics that are used in a collection of documents. Given D W nj W . j 1. documents containing T topics expressed over W unique words, we can represent P(w z) with a set of T multinomial distributions in which n j(w) is the number of times word w has been assigned over the W words, such that P(w z j) w(j), and P(z) with to topic j in the vector of assignments z, and ( ) is the standard a set of D multinomial distributions over the T topics , such that gamma function. The second term results from integrating out for a word in document d, P(z j) j(d). To discover the set , to give of topics used in a corpus w {w1, w2, .. , wn}, where each wi . belongs to some document di, we want to obtain an estimate of D D.

That gives high probability to the words that appear in the T j nj d . P z , [3]. corpus. One strategy for obtaining such an estimate is to simply T n d T . d 1. attempt to maximize P(w , ), following from Eq. 1 directly by using the Expectation-Maximization (8) algorithm to find max- where n j(d) is the number of times a word from document d has imum likelihood estimates of and (2, 3). However, this been assigned to topic j. Our goal is then to evaluate the posterior approach is susceptible to problems involving local maxima and distribution. is slow to converge (1, 2), encouraging the development of models that make assumptions about the source of . P w, z . Latent Dirichlet Allocation (1) is one such model, combining P z w . [4]. zP w, z . Eq. 1 with a prior probability distribution on to provide a complete generative model for documents. This generative Unfortunately, this distribution cannot be computed directly, model specifies a simple probabilistic procedure by which new because the sum in the denominator does not factorize and documents can be produced given just a set of topics , allowing involves Tn terms, where n is the total number of word instances to be estimated without requiring the estimation of.

In Latent in the corpus. Dirichlet Allocation, documents are generated by first picking a Computing P(z w) involves evaluating a probability distribu- distribution over topics from a Dirichlet distribution, which tion on a large discrete state space, a problem that arises often determines P(z) for words in that document. The words in the in statistical physics. Our setting is similar, in particular, to the document are then generated by picking a topic j from this Potts model ( , ref. 10), with an ensemble of discrete variables distribution and then picking a word from that topic according z, each of which can take on values in {1, 2, .. , T}, and an to P(w z j), which is determined by a fixed (j). The estimation energy function given by H(z) log P(w, z) log P(w z) . problem becomes one of max imizing P(w , ) log P(z). Unlike the Potts model, in which the energy function P(w , )P( )d , where P( ) is a Dirichlet ( ) distribution.

Finding scientific topics

Tags:

Information

Advertisement

Transcription of Finding scientific topics

Related search queries

Finding scientific topics

Tags:

Information

Advertisement

Related documents

The ecosystem approach FAO FISHERIES PAPER

The Structure, Format, Content, and Style of a …

OSLO MANUAL - OECD

Technologies for Legionella Control in Premise …

Related search queries