PDF4PRO ⚡AMP

Modern search engine that looking for books and documents around the web

Example: barber

Document Similarity in Information Retrieval

Document Similarity in Information Retrieval Mausam (Based on slides of W. Arms, Thomas Hofmann, Ata Kaban, Melanie Martin) Standard Web Search Engine Architecture crawl the web create an inverted index store documents, check for duplicates, extract links inverted index DocIds Slide adapted from Marti Hearst / UC Berkeley] Search engine servers user query show results To user Indexing Subsystem Documents break into tokens stop list* stemming* term weighting* Index database text non-stoplist tokens tokens stemmed terms terms with weights *Indicates optional operation. assign Document IDs documents Document numbers and *field numbers Search Subsystem Index database query parse query stemming* stemmed terms stop list* non-stoplist tokens query tokens Boolean operations* ranking* relevant Document set ranked Document set retrieved Document set *Indicates optional operation.

Summary: Vector Similarity Computation with Weights Documents in a collection are assigned terms from a set of n terms The term vector space W is defined as: if term k does not occur in document d i, w ik = 0 if term k occurs in document d i, w ik is greater than zero (wik is called the weight of term k in document d i) Similarity between d i

Tags:

  Similarity

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Spam in document Broken preview Other abuse

Transcription of Document Similarity in Information Retrieval

Related search queries