Computing Semantic Relatedness Using Wikipedia …

Computing Semantic Relatedness usingWikipedia- based Explicit Semantic AnalysisEvgeniy GabrilovichandShaul MarkovitchDepartment of Computer ScienceTechnion Israel Institute of Technology, 32000 Haifa, Semantic Relatedness of natural lan-guage texts requires access to vast amounts ofcommon-sense and domain-specific world knowl-edge. We propose Explicit Semantic Analysis(ESA), a novel method that represents the mean-ing of texts in a high-dimensional space of conceptsderived from Wikipedia . We use machine learningtechniques to explicitly represent the meaning ofany text as a weighted vector of Wikipedia -basedconcepts.

Assessing the Relatedness of texts inthis space amounts to comparing the correspondingvectors Using conventional metrics ( , cosine).Compared with the previous state of the art, usingESA results in substantial improvements in corre-lation of computed Relatedness scores with humanjudgments: fromr= individualwords and fromr= texts. Impor-tantly, due to the use of natural concepts, the ESAmodel is easy to explain to human IntroductionHow related are cat and mouse ? And what about prepar-ing a manuscript and writing an article ? Reasoning aboutsemantic Relatedness of natural language utterances is rou-tinely performed by humans but remains an unsurmountableobstacle for computers.

Humans do not judge text relatednessmerely at the level of text words. Words trigger reasoningat a much deeper level that manipulatesconcepts the ba-sic units of meaning that serve humans to organize and sharetheir knowledge. Thus, humans interpret the specific wordingof a document in the much larger context of their backgroundknowledge and has long been recognized that in order to process nat-ural language, computers require access to vast amountsof common-sense and domain-specific world knowledge[Buchanan and Feigenbaum, 1982; Lenat and Guha, 1990].However, prior work on Semantic Relatedness was based onpurely statistical techniques that did not make use of back-ground knowledge[Baeza-Yates and Ribeiro-Neto, 1999;Deerwesteret al.]

, 1990], or on lexical resources that incor-porate very limited knowledge about the world[Budanitskyand Hirst, 2006; Jarmasz, 2003].We propose a novel method, called Explicit SemanticAnalysis (ESA), for fine-grained Semantic representation ofunrestricted natural language texts. Our method representsmeaning in a high-dimensional space of natural concepts de-rived from Wikipedia ( ), thelargest encyclopedia in existence. We employ text classi-fication techniques that allow us to explicitly represent themeaning of any textin terms ofWikipedia- based evaluate the effectiveness of our method on automaticallycomputing the degree of Semantic Relatedness between frag-ments of natural language contributions of this paper are threefold.

First, wepresent Explicit Semantic Analysis, a new approach to rep-resenting semantics of natural language texts Using naturalconcepts. Second, we propose a uniform way for computingrelatedness of both individual words and arbitrarily long textfragments. Finally, the results of Using ESA for computingsemantic Relatedness of texts are superior to the existing stateof the art. Moreover, Using Wikipedia - based concepts makesour model easy to interpret, as we illustrate with a number ofexamples in what Explicit Semantic AnalysisOur approach is inspired by the desire to augment text rep-resentation with massive amounts of world knowledge.

Werepresent texts as a weighted mixture of a predetermined setofnaturalconcepts, which are defined by humans themselvesand can be easily explained. To achieve this aim, we use con-cepts defined by Wikipedia articles, , COMPUTERSCI-ENCE,INDIA,orLANGUAGE. An important advantage ofour approach is thus the use of vast amounts of highly orga-nized human knowledge encoded in Wikipedia . Furthermore, Wikipedia undergoesconstant development so its breadth anddepth steadily increase over opted to use Wikipedia because it is currently thelargest knowledge repository on the Web. Wikipedia is avail-able in dozens of languages, while its English version is thelargest of all with 400+ million words in over one millionarticles (compared to 44 million words in 65,000 articles inEncyclopaedia Britannica1).

Interestingly, the open editingapproach yields remarkable quality a recent study[Giles,2005]found Wikipedia accuracy to rival that of (visited on May 12, 2006).IJCAI-071606We use machine learning techniques to build asemanticinterpreterthat maps fragments of natural language text intoa weighted sequence of Wikipedia concepts ordered by theirrelevance to the input. This way, input texts are representedas weighted vectors of concepts, calledinterpretation meaning of a text fragment is thus interpreted in termsof its affinity with a host of Wikipedia concepts. Comput-ing Semantic Relatedness of texts then amounts to comparingtheir vectors in the space defined by the concepts, for exam-ple, Using the cosine metric[Zobel and Moffat, 1998].

Oursemantic analysis isexplicitin the sense that we manipulatemanifest concepts grounded in human cognition, rather than latent concepts used by Latent Semantic that input texts are givenin the same formasWikipedia articles, that is, as plain text. Therefore, we can useconventional text classification algorithms[Sebastiani, 2002]to rank the concepts represented by these articles accordingto their relevance to the given text fragment. It is this key ob-servation that allows us to use encyclopedia directly, withoutthe need for deep language understanding or pre-catalogedcommon-sense knowledge. The choice of encyclopedia arti-cles as concepts is quite natural, as each article is focused ona single issue, which it discusses in Wikipedia concept is represented as an attribute vec-tor of words that occur in the corresponding article.

Entriesof these vectors are assigned weights Using TFIDF scheme[Salton and McGill, 1983]. These weights quantify thestrength of association between words and speed up Semantic interpretation, we build aninvertedindex, which maps each word into a list of concepts in whichit appears. We also use the inverted index to discard insignif-icant associations between words and concepts by removingthose concepts whose weights for a given word are too implemented the Semantic interpreter as a centroid- based classifier[Han and Karypis, 2000], which, given a textfragment, ranks all the Wikipedia concepts by their relevanceto the fragment.

Given a text fragment, we first represent it asa vector Using TFIDF scheme. The Semantic interpreter iter-ates over the text words, retrieves corresponding entries fromthe inverted index, and merges them into a weighted vectorof concepts that represents the given text. LetT={wi}be input text, and let vi be its TFIDF vector, whereviisthe weight of kj be an inverted index entryfor wordwi,wherekjquantifies the strength of associationof wordwiwith Wikipedia conceptcj,{cj c1,..,cN}(whereNis the total number of Wikipedia concepts). Then,the Semantic interpretation vectorVfor textTis a vector oflengthN, in which the weight of each conceptcjis definedas wi Tvi kj.

Computing Semantic Relatedness Using Wikipedia …

Tags:

Information

Advertisement

Transcription of Computing Semantic Relatedness Using Wikipedia …

Related search queries

Computing Semantic Relatedness Using Wikipedia …

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries