Transcription of Termite: Visualization Techniques for Assessing …
1 termite : Visualization Techniques forAssessing textual Topic ModelsJason Chuang, Christopher D. Manning, Jeffrey HeerStanford University Computer Science Department{jcchuang, manning, models aid analysis of text corpora by identifying la-tent topics based on co-occurring words. Real-world de-ployments of topic models, however, often require intensiveexpert verification and model refinement. In this paper wepresent termite , a visual analysis tool for Assessing topicmodel quality. termite uses a tabular layout to promotecomparison of terms both within and across latent contribute a novel saliency measure for selecting relevantterms and a seriation algorithm that both reveals clusteringstructure and promotes the legibility of related terms. Ina series of examples, we demonstrate how termite allowsanalysts to identify coherent and significant and Subject [Artificial Intelligence]: Natural Language Process-ing; [Information Interfaces]: User InterfacesGeneral TermsAlgorithms, Design, Human FactorsKeywordsTopic Models, Text Visualization , Seriation1.}
2 INTRODUCTIONR ecent growth in text data affords an opportunity to studyand analyze language at an unprecedented scale. The sizeof text corpora, however, often exceeds the limit of what aperson can read and process. While statistical topic modelshave the potential to aid large-scale exploration, a reviewof the literature reveals a scarcity of real world analyses in-volving topic models. When the models are deployed, theyinvolve time-consuming verification and model present termite , a Visualization system for the term-topic distributions produced by topic models. Our systemcontributes two novel Techniques to aid topic model assess-ment. First, we describe asaliency measurefor rankingand filtering terms. By surfacing more discriminative terms,our measure enables faster assessment and comparison oftopics. Second, we introduce aseriation methodfor sort-ing terms to reveal clustering patterns.
3 Our technique hasPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a 12, May 21-25, 2012, Capri Island, ItalyCopyright 2012 ACM 978-1-4503-1287-5/12/05 ..$ desirable properties: preservation of term reading orderand early termination when sorting subsets of words. Wedemonstrate how these Techniques enable rapid classifica-tion of coherent or junk topics and reveal topical RELATED WORKL atent Dirichlet allocation (LDA) [3] is a popular ap-proach for uncoveringlatent topics: multinomial probabil-ity distributions over terms, generated by soft clustering ofwords based on document co-occurrence.
4 While LDA pro-duces some sensible topics, a prominent issue is the presenceof junk topics [1] comprised of incoherent or insignificantterm groupings. Model outputs often need to be verified bydomain experts and modified [5] to ensure they correspondto meaningful concepts in the domain of et al. [12] applied LDA to study research trends incomputational linguistics across 14,000 publications. Theauthors recruited experts to validate the quality of the latenttopics. These experts retained only 36 out of 100 topics, andmanually inserted 10 additional topics not produced by themodel. Talley et al. [24] examined 110,000 NIH grants andapplied LDA to uncover 700 latent topics. The modelingprocess included a significant amount of revision: modifyingthe vocabulary to include acronyms and multi-word phrases,removing nonsensical topics, conducting parameter search,and comparing the resulting evaluations of topical quality rely heavily on ex-perts examining lists of the most probable words in a topic[4, 19, 20].
5 For example, in biological texts one might finda topic with terms dna, replication, rna, repair, complex,interaction,.. Prior work in Visualization suggests somealternative forms of presentation. Matrix views can surfacerelationships among a large number of items [2, 14] or be-tween two data dimensions [9] if an appropriate ordering (orseriation) is applied [10, 26]. Interaction might then allowusers to explore alternative orderings [22]. An appropriatemodel of words ( , statistically significant instead of fre-quent terms, phrases instead of words) can further aid com-parison [7, 27]. Incorporating word relatedness into a visual-ization can surface high-level patterns in the text [6, 13]. Incontrast to existing tools for summarizing LDA model out-put [11], termite aims to support the domain-specific taskof building and refining topic THE termite SYSTEM DESIGNWhen using topic models to analyze a text collection, itis critical that the discovered latent topics be relevant tothe domain task.
6 Prior work suggests that the quality of atopic is often determined by the coherence of its constituentwords [1] and its relative importance to the analysis task [25]in comparison to other topics. Effective means for assessingFigure 1:Top 30 frequent (left) vs. salient (right) saliency measure rankstree,context,tasks,focus,networksab ove the more frequent but less informativewordsbased,paper,approach,tec hnique,method. Distinc-tive terms enable speedier identification: Topic 6 con-cerns focus+context Techniques ; this topical compositionis ambiguous when examining the frequent quality are thus an important step toward makingtopic models more useful for real-world goal with termite is to supporteffective evaluationof term distributions associated with LDA topics. The toolis designed to help assess the quality of individual topicsand all topics as a whole.
7 The primary Visualization usedin termite is a matrix view; rows correspond to terms andcolumns to topics. In the following examples we use LDAmodels [21] with 25 to 50 topics, trained on abstracts from372 IEEE InfoVis conference papers from 1995 to 2010 [23].Theterm-topic matrix(Figures 1 3) shows term distri-butions for all latent topics. Unlike lists of per-topic words(the current standard practice), matrices support compar-ison across both topics and terms. We use circular areato encode term probabilities. Texts typically exhibit longtails of low probability words. Area has a higher dynamicrange than length encodings (quadratic vs. linear scaling)and curvature enables perception of area even when circlesoverlap. We also experimented with parallel tag clouds [7]where text is displayed directly in the matrix; the result wasnot sufficiently compact for even a modest number of candrill downto examine a specific topic by click-ing on a circle or topic label in the matrix.
8 The visualizationthen reveals two additional views. The word frequency view(Figure 3, middle) shows the topic s word usage relative tothe full corpus. The document view (Figure 3, right) showsthe representative documents belonging to the Displaying Informative TermsShowing all words in the term-topic matrix is neither de-sirable nor feasible due to large vocabularies with thousandsof words. termite canfilterthe display to show the mostprobableorsalientterms. Users can choose between 10 and250 terms. On most monitors displaying over 250 wordsrequires a significant amount of scrolling and reduces theeffectiveness of the 1:Word similarity based on G2statisticsG2estimates the likelihood of an eventvtaking place whenanother eventuis also observed. The likelihood can be com-puted [8] using the following 2 2 contingency table:eventsu uva=P(u|v)b=P( u|v) vc=P(u| v)d=P( u| v)The G2statistic is then defined as:G2=aloga(c+d)c(a+b)+blogb(c+d)d(a+b)F or word co-occurrences, G2represents the likelihood of awordvappearing in a document/sentence when another wordualso appears in the same document/sentence.
9 For bigrams,G2examines all adjacent pairs of words, and estimates thelikelihood ofvbeing the second word whenuis the first defineterm saliencyas follows. For a given wordw,we compute its conditional probabilityP(T|w): the likeli-hood that observed wordwwas generated by latent also compute the marginal probabilityP(T): the like-lihood that any randomly-selected wordw was generatedby topicT. We define thedistinctivenessof wordwas theKullback-Leibler divergence [15] betweenP(T|w) andP(T):distinctiveness(w) =XTP(T|w) logP(T|w)P(T)This formulation describes (in an information-theoreticsense) how informative the specific termwis for determin-ing the generating topic, versus a randomly-selected termw . For example, if a wordwoccurs in all topics, observingthe word tells us little about the document s topical mixture;thus the word would receive a low distinctiveness a term is defined by the product:saliency(w) =P(w) distinctiveness(w)As shown in Figure 1, filtering terms by saliency can aidrapid classification and disambiguation of topics.
10 Given thesame number of words, the list of most probable terms con-tains more generic words ( ,based,paper,approach) thanthe list of distinctive terms ( ,tree,context,tasks). Oursaliency measure speeds identification of topical composition( , Topic 6 on focus+context Techniques ). By producinga more sparse term-topic matrix, our measure can enablefaster differentiation among the topics and identification ofpotential junk topics lacking salient Ordering the Term-Topic MatrixTermite provides two options fortopic ordering: byindex(the arbitrary topic index produced by LDA) andbytopic size(the number of observed terms assigned to atopic). Prior work suggests that small (rare) topics tend tocontain more nonsensical and incoherent terms [19]. Topicordering by size can help surface such also provides three options forterm ordering:alphabetically, byfrequency, or usingseriation.