Document Similarity in Information Retrieval

Document Similarity in Information Retrieval Mausam (Based on slides of W. Arms, Thomas Hofmann, Ata Kaban, Melanie Martin) Standard Web Search Engine Architecture crawl the web create an inverted index store documents, check for duplicates, extract links inverted index DocIds Slide adapted from Marti Hearst / UC Berkeley] Search engine servers user query show results To user Indexing Subsystem Documents break into tokens stop list* stemming* term weighting* Index database text non-stoplist tokens tokens stemmed terms terms with weights *Indicates optional operation. assign Document IDs documents Document numbers and *field numbers Search Subsystem Index database query parse query stemming* stemmed terms stop list* non-stoplist tokens query tokens Boolean operations* ranking* relevant Document set ranked Document set retrieved Document set *Indicates optional operation.

Summary: Vector Similarity Computation with Weights Documents in a collection are assigned terms from a set of n terms The term vector space W is defined as: if term k does not occur in document d i, w ik = 0 if term k occurs in document d i, w ik is greater than zero (wik is called the weight of term k in document d i) Similarity between d i

Fullscreen Download

Tags:

Similarity

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Spam in document Broken preview Other abuse

PDF4PRO ^⚡AMP

Modern search engine that looking for books and documents around the web

Document Similarity in Information Retrieval

Tags:

Information

Transcription of Document Similarity in Information Retrieval

Related search queries

Document Similarity in Information Retrieval

Tags:

Information

Documents from same domain

Related documents

Related search queries