Computing Semantic Relatedness Using Wikipedia …

Computing Semantic Relatedness usingWikipedia- based explicit Semantic AnalysisEvgeniy GabrilovichandShaul MarkovitchDepartment of Computer ScienceTechnion Israel Institute of Technology, 32000 Haifa, Semantic Relatedness of natural lan-guage texts requires access to vast amounts ofcommon-sense and domain-specific world knowl-edge. We propose explicit Semantic Analysis(ESA), a novel method that represents the mean-ing of texts in a high-dimensional space of conceptsderived from Wikipedia . We use machine learningtechniques to explicitly represent the meaning ofany text as a weighted vector of Wikipedia -basedconcepts. Assessing the Relatedness of texts inthis space amounts to comparing the correspondingvectors Using conventional metrics ( , cosine).Compared with the previous state of the art, usingESA results in substantial improvements in corre-lation of computed Relatedness scores with humanjudgments: fromr= individualwords and fromr= texts. Impor-tantly, due to the use of natural concepts, the ESAmodel is easy to explain to human IntroductionHow related are cat and mouse ?

And what about prepar-ing a manuscript and writing an article ? Reasoning aboutsemantic Relatedness of natural language utterances is rou-tinely performed by humans but remains an unsurmountableobstacle for computers. Humans do not judge text relatednessmerely at the level of text words. Words trigger reasoningat a much deeper level that manipulatesconcepts the ba-sic units of meaning that serve humans to organize and sharetheir knowledge. Thus, humans interpret the specific wordingof a document in the much larger context of their backgroundknowledge and has long been recognized that in order to process nat-ural language, computers require access to vast amountsof common-sense and domain-specific world knowledge[Buchanan and Feigenbaum, 1982; Lenat and Guha, 1990].However, prior work on Semantic Relatedness was based onpurely statistical techniques that did not make use of back-ground knowledge[Baeza-Yates and Ribeiro-Neto, 1999;Deerwesteret al., 1990], or on lexical resources that incor-porate very limited knowledge about the world[Budanitskyand Hirst, 2006; Jarmasz, 2003].

We propose a novel method, called explicit SemanticAnalysis (ESA), for fine-grained Semantic representation ofunrestricted natural language texts. Our method representsmeaning in a high-dimensional space of natural concepts de-rived from Wikipedia ( ), thelargest encyclopedia in existence. We employ text classi-fication techniques that allow us to explicitly represent themeaning of any textin terms ofWikipedia- based evaluate the effectiveness of our method on automaticallycomputing the degree of Semantic Relatedness between frag-ments of natural language contributions of this paper are threefold. First, wepresent explicit Semantic Analysis, a new approach to rep-resenting semantics of natural language texts Using naturalconcepts. Second, we propose a uniform way for computingrelatedness of both individual words and arbitrarily long textfragments. Finally, the results of Using ESA for computingsemantic Relatedness of texts are superior to the existing stateof the art.

Moreover, Using Wikipedia - based concepts makesour model easy to interpret, as we illustrate with a number ofexamples in what explicit Semantic AnalysisOur approach is inspired by the desire to augment text rep-resentation with massive amounts of world knowledge. Werepresent texts as a weighted mixture of a predetermined setofnaturalconcepts, which are defined by humans themselvesand can be easily explained. To achieve this aim, we use con-cepts defined by Wikipedia articles, , COMPUTERSCI-ENCE,INDIA,orLANGUAGE. An important advantage ofour approach is thus the use of vast amounts of highly orga-nized human knowledge encoded in Wikipedia . Furthermore, Wikipedia undergoesconstant development so its breadth anddepth steadily increase over opted to use Wikipedia because it is currently thelargest knowledge repository on the Web. Wikipedia is avail-able in dozens of languages, while its English version is thelargest of all with 400+ million words in over one millionarticles (compared to 44 million words in 65,000 articles inEncyclopaedia Britannica1).

Interestingly, the open editingapproach yields remarkable quality a recent study[Giles,2005]found Wikipedia accuracy to rival that of (visited on May 12, 2006).IJCAI-071606We use machine learning techniques to build asemanticinterpreterthat maps fragments of natural language text intoa weighted sequence of Wikipedia concepts ordered by theirrelevance to the input. This way, input texts are representedas weighted vectors of concepts, calledinterpretation meaning of a text fragment is thus interpreted in termsof its affinity with a host of Wikipedia concepts. Comput-ing Semantic Relatedness of texts then amounts to comparingtheir vectors in the space defined by the concepts, for exam-ple, Using the cosine metric[Zobel and Moffat, 1998].Oursemantic analysis isexplicitin the sense that we manipulatemanifest concepts grounded in human cognition, rather than latent concepts used by Latent Semantic that input texts are givenin the same formasWikipedia articles, that is, as plain text.

Therefore, we can useconventional text classification algorithms[Sebastiani, 2002]to rank the concepts represented by these articles accordingto their relevance to the given text fragment. It is this key ob-servation that allows us to use encyclopedia directly, withoutthe need for deep language understanding or pre-catalogedcommon-sense knowledge. The choice of encyclopedia arti-cles as concepts is quite natural, as each article is focused ona single issue, which it discusses in Wikipedia concept is represented as an attribute vec-tor of words that occur in the corresponding article. Entriesof these vectors are assigned weights Using TFIDF scheme[Salton and McGill, 1983]. These weights quantify thestrength of association between words and speed up Semantic interpretation, we build aninvertedindex, which maps each word into a list of concepts in whichit appears. We also use the inverted index to discard insignif-icant associations between words and concepts by removingthose concepts whose weights for a given word are too implemented the Semantic interpreter as a centroid- based classifier[Han and Karypis, 2000], which, given a textfragment, ranks all the Wikipedia concepts by their relevanceto the fragment.

Given a text fragment, we first represent it asa vector Using TFIDF scheme. The Semantic interpreter iter-ates over the text words, retrieves corresponding entries fromthe inverted index, and merges them into a weighted vectorof concepts that represents the given text. LetT={wi}be input text, and let vi be its TFIDF vector, whereviisthe weight of kj be an inverted index entryfor wordwi,wherekjquantifies the strength of associationof wordwiwith Wikipedia conceptcj,{cj c1,..,cN}(whereNis the total number of Wikipedia concepts). Then,the Semantic interpretation vectorVfor textTis a vector oflengthN, in which the weight of each conceptcjis definedas wi Tvi kj. Entries of this vector reflect the relevanceof the corresponding concepts to textT. To compute seman-tic Relatedness of a pair of text fragments we compare theirvectors Using the cosine 1 illustrates the process of Wikipedia - based seman-tic interpretation. Further implementation details are avail-able in[Gabrilovich, In preparation].

In our earlier work[Gabrilovich and Markovitch, 2006],we used a similar method for generating features for text cat-egorization. Since text categorization is a supervised learningtask, words occurring in the training documents serve as valu-R elatednessestimationB uilding Semantic I nterpreterB uilding weightedinvertedindexWikipediaWeighted listof concepts(= W ikipediaarticles)Weightedinverted indexU s i ng S ema nti c I nter pr eterSemanticInterpreterText1 Text2 Vectorcompar isonword1wordiwordnWeightedvector ofW ikipediaconceptsFigure 1: Semantic interpreter#Input: equipment Input: investor 1 ToolInvestment2 Digital Equipment CorporationAngel investor3 Military technology and equipmentStock trader4 CampingMutual fund5 Engineering vehicleMargin (finance)6 WeaponModern portfolio theory7 Original equipment manufacturerEquity investment8 French ArmyExchange-traded fund9 Electronic test equipmentHedge fund10 Distance Measuring EquipmentPonzi schemeTable 1: First ten concepts in sample interpretation features; consequently, in that work we used Wikipediaconcepts to augment the bag of words.

On the other hand, Computing Semantic Relatedness of a pair of texts is essen-tially a one-off task, therefore, we replace the bag of wordsrepresentation with the one based on illustrate our approach, we show the ten highest-scoringWikipedia concepts in the interpretation vectors for sampletext fragments. When concepts in each vector are sorted in thedecreasing order of their score, the top ten concepts are themost relevant ones for the input text. Table 1 shows the mostrelevant Wikipedia concepts for individual words ( equip-ment and investor , respectively), while Table 2 uses longerpassages as examples. It is particularly interesting to jux-tapose the interpretation vectors for fragments that containambiguous words. Table 3 shows the first entries in the vec-tors for phrases that contain ambiguous words bank and jaguar . As can be readily seen, our Semantic interpreta-tion methodology is capable of performing word sense dis-ambiguation, by considering ambiguous words in the contextof their Empirical EvaluationWe implemented our ESA approach Using a Wikipedia snap-shot as of March 26, 2006.

After parsing the Wikipedia XMLdump, we obtained Gb of text in 1,187,839 articles. UponIJCAI-071607#Input: intelligence cannot say conclu-sively that Saddam Hussein has weapons ofmass destruction, an information gap that iscomplicating White House efforts to build sup-port for an attack on Saddam s Iraqi CIA has advised top administration offi-cials to assume that Iraq has some weapons ofmass destruction. But the agency has not givenPresident Bush a smoking gun, according intelligence and administration officials. Input: The development of T-cell leukaemia following the oth-erwise successful treatment of three patients with X-linked se-vere combined immune deficiency (X-SCID) in gene-therapy tri-als Using haematopoietic stem cells has led to a re-evaluationof this approach. Using a mouse model for gene therapy of X-SCID, we find that the corrective therapeutic gene IL2RG itselfcan act as a contributor to the genesis of T-cell lymphomas, withone-third of animals being affected.

Computing Semantic Relatedness Using Wikipedia …

Tags:

Information

Transcription of Computing Semantic Relatedness Using Wikipedia …

Related search queries

Computing Semantic Relatedness Using Wikipedia …

Tags:

Information

Documents from same domain

Related documents

Related search queries