Probabilistic topic models - Columbia University

Review articles For example, consider using themes Surveying a suite of algorithms that offer a to explore the complete history of the New York Times. At a broad level, some solution to managing large document archives. of the themes might correspond to the sections of the newspaper for- by David M. Blei eign policy, national affairs, sports. We could zoom in on a theme of in- Probabilistic terest, such as foreign policy, to reveal various aspects of it Chinese foreign policy, the conflict in the Middle East, the 's relationship with Russia. We topic models could then navigate through time to reveal how these specific themes have changed, tracking, for example, the changes in the conflict in the Middle East over the last 50 years. And, in all of this exploration, we would be pointed to the original articles relevant to the themes. The thematic structure would be a new kind of window through which to explore and digest the collection.

But we do not interact with elec- As our coll ect i v e knowledge continues to be tronic archives in this way. While more and more texts are available online, we digitized and stored in the form of news, blogs, Web simply do not have the human power pages, scientific articles, books, images, sound, video, to read and study them to provide the kind of browsing experience described and social networks it becomes more difficult to above. To this end, machine learning find and discover what we are looking for. We need researchers have developed probabilis- new computational tools to help organize, search, and tic topic modeling, a suite of algorithms that aim to discover and annotate large understand these vast amounts of information. archives of documents with thematic Right now, we work with online information using information. topic modeling algorithms are statistical methods that ana- two main tools search and links. We type keywords lyze the words of the original texts to into a search engine and find a set of documents discover the themes that run through related to them.

We look at the documents in that them, how those themes are connected to each other, and how they change over set, possibly navigating to other linked documents. This is a powerful way of interacting with our online key insights archive, but something is missing.. topic models are algorithms for discovering the main themes that Imagine searching and exploring documents pervade a large and otherwise based on the themes that run through them. We might unstructured collection of documents. topic models can organize the collection zoom in and zoom out to find specific or broader according to the discovered themes. themes; we might look at how those themes changed . topic modeling algorithms can be applied to massive collections of documents. through time or how they are connected to each other. Recent advances in this field allow us to analyze streaming collections, like you Rather than finding documents through keyword might find from a Web API.

Search alone, we might first find the theme that we . topic modeling algorithms can be adapted to many kinds of data. Among are interested in, and then examine the documents other applications, they have been used to find patterns in genetic data, images, related to that theme. and social networks. a p r i l 2 0 1 2 | vo l. 55 | n o. 4 | c om m u n ic at ion s of the acm 77. review articles time. (See, for example, Figure 3 for genes, are highlighted in yellow. If we document in the collection, we gener- topics found by analyzing the Yale Law took the time to highlight every word in ate the words in a two-stage process. Journal.) topic modeling algorithms the article, you would see that this arti- Randomly choose a distribution do not require any prior annotations or cle blends genetics, data analysis, and over topics. labeling of the documents the topics evolutionary biology in different pro- For each word in the document emerge from the analysis of the origi- portions.

(We exclude words, such as a. Randomly choose a topic from nal texts. topic modeling enables us and but or if, which contain little the distribution over topics in to organize and summarize electronic topical content.) Furthermore, know- step #1. archives at a scale that would be impos- ing that this article blends those topics b. Randomly choose a word from the sible by human annotation. would help you situate it in a collection corresponding distribution over of scientific articles. the vocabulary. Latent Dirichlet Allocation LDA is a statistical model of docu- This statistical model reflects the We first describe the basic ideas behind ment collections that tries to capture intuition that documents exhibit mul- latent Dirichlet allocation (LDA), which this intuition. It is most easily described tiple topics. Each document exhib- is the simplest topic The intu- by its generative process, the imaginary its the topics in different proportion ition behind LDA is that documents random process by which the model (step #1); each word in each docu- exhibit multiple topics.

For example, assumes the documents arose. (The ment is drawn from one of the topics consider the article in Figure 1. This interpretation of LDA as a Probabilistic (step #2b), where the selected topic is article, entitled Seeking Life's Bare model is fleshed out later.) chosen from the per-document distri- (Genetic) Necessities, is about using We formally define a topic to be a bution over topics (step #2a).b data analysis to determine the number distribution over a fixed vocabulary. For In the example article, the distri- of genes an organism needs to survive example, the genetics topic has words bution over topics would place prob- (in an evolutionary sense). about genetics with high probability ability on genetics, data analysis, and By hand, we have highlighted differ- and the evolutionary biology topic has ent words that are used in the article. words about evolutionary biology with b We should explain the mysterious name, latent Words about data analysis, such as high probability.

We assume that these Dirichlet allocation. The distribution that is computer and prediction, are high- topics are specified before any data used to draw the per-document topic distribu- lighted in blue; words about evolutionary has been Now for each tions in step #1 (the cartoon histogram in Figure 1) is called a Dirichlet distribution. In the genera- biology, such as life and organism, tive process for LDA, the result of the Dirichlet are highlighted in pink; words about a Technically, the model assumes that the top- is used to allocate the words of the document to genetics, such as sequenced and ics are generated first, before the documents. different topics. Why latent? Keep reading. Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of topics, which are distributions over words, exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic .

The topics and topic assignments in this figure are illustrative they are not fit from real data. See Figure 2 for topics fit from data. topic proportions and Topics Documents assignments gene dna genetic .,, life evolve organism .,, brain neuron nerve .. data number computer .,, 78 communicatio ns o f the acm | a p r i l 201 2 | vo l . 5 5 | no. 4. review articles Figure 2. Real inference with LDA. We fit a 100- topic LDA model to 17,000 articles from the journal Science. At left are the inferred topic proportions for the example article in Figure 1. At right are the top 15 most frequent words from the most frequent topics found in this article. Genetics Evolution Disease Computers . human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system Probability gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks 1 8 16 26 36 46 56 66 76 86 96.

Mapping new parasites software Topics project two united new sequences common tuberculosis simulations evolutionary biology, and each word algorithm assumed that there were 100 about subjects like genetics and data is drawn from one of those three top- topics.) We then computed the inferred analysis are replaced by topics about ics. Notice that the next article in topic distribution for the example discrimination and contract law. the collection might be about data article (Figure 2, left), the distribution The utility of topic models stems analysis and neuroscience; its distri- over topics that best describes its par- from the property that the inferred hid- bution over topics would place prob- ticular collection of words. Notice that den structure resembles the thematic ability on those two topics. This is this topic distribution, though it can structure of the collection. This inter- the distinguishing characteristic of use any of the topics, has only acti- pretable hidden structure annotates latent Dirichlet a llocation all the vated a handful of them.

Further, we each document in the collection a documents in the collection share can examine the most probable terms task that is painstaking to perform the same set of topics, but each docu- from each of the most probable topics by hand and these annotations can ment exhibits those topics in differ- (Figure 2, right). On examination, we be used to aid tasks like information ent proportion. see that these terms are recognizable retrieval, classification, and corpus As we described in the introduc- as terms about genetics, survival, and In this way, topic model- tion, the goal of topic modeling is data analysis, the topics that are com- ing provides an algorithmic solution to to automatically discover the topics bined in the example article. managing, organizing, and annotating from a collection of documents. The We emphasize that the algorithms large archives of texts. documents themselves are observed, have no information about these sub- LDA and Probabilistic models .

Probabilistic topic models - Columbia University

Information

Transcription of Probabilistic topic models - Columbia University

Related search queries

Probabilistic topic models - Columbia University

Information

Documents from same domain

Related documents

Related search queries