Text Retrieval using Linear Algebra - Computer …

Text Retrieval using Linear AlgebraJoshuaDrewJosh Drewgraduated summa cum laude in 2002 fromBall State with a degree in Computer Science anda minor in Physics. This article is based on his researchduring 2001 and 2002. Josh is currently a senior pro-grammer with SpinWeb Net Designs, Retrieval is an important area of research. As information and methodsof its storage have proliferated, the need to have efficient methods of locatingsubsets of this information has increased as well. A widely-researched textsearching method involves modeling a text collection in a term-by-documentmatrix, and evaluating the documents relevance to a query with simple lin-ear Algebra . This document is an abstract of research performed throughout2001 and 2002, and presents one such system, developed to search the this method, a matrixAis constructed such that the frequency of eachtermiin each documentj, is stored atAi,j.

To build this matrix for bsu-cs,a simple index-creator was programmed. Beginning with a list of known Webpages, the script analyzed each document encountered. Word frequencies wererecorded, and links to other bsu-cs Web pages were extracted. Those links wereboth added to the page list, and tabulated as possible indicators of documentimportance (more links to pageximplies that pagexis more important). Notethat common words such as the and how have little intrinsic meaning [1].Therefore, a list of stop words was created. The members of this list wereexcluded by the matrix-forming code. After several hours of parsing, the index-creator, in conjunction with a script to organize each page s raw data, produceda 26257 by 3557, sparse a user supplies a list of key words, the relevance of those words to eachof the documents in the term-by-document matrixA= [~a1, ~a2.]

,~an] can bedetermined. A common measure of such is the cosine of the angle betweenthe query vector,~qand each document (column) vector,~aj[3]. If the query isrepresented by a vector~qand the term-by-document matrix asA, the cosineof the angle between~qand a document~ajis found by:cos j= (~aj ~q)/( ~aj 2 ~q 2);j= 1,2, .. , nwhere the Euclidean vector norm ~x 2is defined by ~x 2= (~x ~x)1/2, the Undergraduate Mathematics Exchange, Vol. 1, No. 1 (Fall 2003)5root of the dot product of~xwith itself [3]. The results of this computation canbe stored for all documents,j= 1, .. , n. If the cosine of the angle between twovectors is one, the vectors are parallel. However, if the cosine of the angle is zero,the vectors are orthogonal [6]. Therefore, a cosine closer to one implies that adocument,~aj, is relevant to a user s search vector,~q.

In general, documentsnot exceeding some cosine threshold, which is determined experimentally, arereturned to the user [4]. The C++ program, search 1, uses this idea, and athreshold of , to provide users with query rank of a matrix is the dimension of the column space of that matrix [6].So, for the term-by-document matrixA, a rank-reduction involves removingunnecessary documents, a Web page mirror. These are eliminated withthe consequence of improved search project uses a truncated singular value decomposition (SVD) [5] anmbynmatrix. Then a SVD ofAalways exists and is givenbyA=U SVTwith anmbymorthogonal matrixU, anmbynmatrixS,and annbynorthogonal matrixV, such that the only non-zero entries ofSare along its diagonal and are non-increasing and non-negative (they arecalled thesingular values).

Focusing on the firstksingular values, we obtain atruncated SVD if we take only the firstkcolumns ofU, thekbyksubmatrixofScontaining our selected singular values, and the firstkcolumns is, we can approximateA U SVT, whereUis anmbykmatrix withorthonormal columns,Sakbykdiagonal matrix, andVannbykmatrixwith orthonormal benefits arise from this usage: Noise and uncertainty, present in all large databases, are reduced [1]. Word usage is estimated across documents [2], helping to compensate forpolysemy, the situation in which a word has multiple meanings. Queries may return relevant documents containing none of the user ssearch terms. The SVD models a latent semantic structure assumed toexist in the document collection [2]. Calculations can be performed faster because A is replaced by a information Retrieval methods are crucial to making today s largetext collections useful.

This project establishes a foundation on which studentsat the undergraduate or graduate level can explore new possibilities such as: Developing new weighting schemes based upon term and document im-portance and the likelihood of relevance to a user s query. using an adjacency matrix of documents to rank pages; taking advantageof the fact that the number ofk-step sequences between vertexiandjin a graph with the adjacency matrixMis the (i, j) entry source code for search can be found on the project Web Undergraduate Mathematics Exchange, Vol. 1, No. 1 (Fall 2003) Modularizing server-side processes by separating this project s search-performing program into a client and a daemon which communicate project produced a working search engine for the bsu-cs sever. Allrelevant source code, input, output, and documentation produced is sincere thanks go to Dr.

James Baglama, my faculty advisor, and ScottHinkley for their invaluable contributions to this [1] Berry, M. Browne, Understanding search engines: mathematicalmodeling and text Retrieval , SIAM (1999).[2] , S. T. Dumais, O Brien, using Linear Algebra for intelli-gent information Retrieval , SIAM (1995) 573 595.[3] Berry, Z. Drmac, Jessup,Matrices, vector spaces, and informa-tion Retrieval , SIAM (2) 335 362.[4] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman,Indexingby latent semantic analysis, Journal of the American Society for InformationScience41(1990) 391 407.[5] G. Golub, Loan, Matrix computations (Second Ed.), Johns-HopkinsUniversity Press (1989).[6] Lay, Linear Algebra and its applications (Second Ed.), Addison-Wesley(2000). Undergraduate Mathematics Exchange, Vol.

1, No. 1 (Fall 2003)7

Text Retrieval using Linear Algebra - Computer …

Tags:

Information

Transcription of Text Retrieval using Linear Algebra - Computer …

Related search queries

Text Retrieval using Linear Algebra - Computer …

Tags:

Information

Related documents

Related search queries