![]() |
VOOZH | about |
An Inverted Index is a data structure used in information retrieval systems to efficiently retrieve documents or web pages containing a specific term or set of terms. In an inverted index, the index is organized by terms (words), and each term points to a list of documents or web pages that contain that term.
Note: Inverted indexes are widely used in search engines, database systems, and other applications where efficient text search is required.
An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap-like data structure that directs you from a word to a document or a web page. They are especially useful for large collections of documents, where searching through all the documents would be prohibitively slow.
To create an inverted index for these documents, we first tokenize the documents into terms, as follows.
Document 1: The quick brown fox jumped over the lazy dog.
Document 2: The lazy dog slept in the sun.
Next, we create an index of the terms, where each term points to a list of documents that contain that term, as follows.
The -> Document 1, Document 2
Quick -> Document 1
Brown -> Document 1
Fox -> Document 1
Jumped -> Document 1
Over -> Document 1
Lazy -> Document 1, Document 2
Dog -> Document 1, Document 2
Slept -> Document 2
In -> Document 2
Sun -> Document 2
To search for documents containing a particular term or set of terms, the search engine queries the inverted index for those terms and retrieves the list of documents associated with each term. The search engine can then use this information to rank the documents based on relevance to the query and present them to the user in order of importance.
There are two types of inverted indexes:
Suppose we want to search the texts "hello everyone, " "this article is based on an inverted index, " and "which is hashmap-like data structure". If we index by (text, word within the text), the index with a location in the text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)
The word "hello" is in document 1 ("hello everyone") starting at word 1, so has an entry (1, 1), and the word "is" is in documents 2 and 3 at '3rd' and '2nd' positions respectively (here position is based on the word).
Note: The index may have weights, frequencies, or other indicators.
Example:
Words Document
ant doc1
demo doc2
world doc1, doc2
The first two lines define two sample documents to be used as input to the algorithm.
Step 1: Tokenize the input documents by converting them to lowercase and splitting them into individual words. Then combine the resulting tokens from both documents into a single list of unique terms.
Step 2: Create an empty dictionary to store the inverted index, and then iterate through each term in the list of unique terms. For each term, create an empty list of documents, and then check if the term appears in each input document.
Note: If the term appears in a document, add the document to the list for that term. Finally, add an entry to the inverted index dictionary for the current term, with the list of documents that contain that term as its value.
Step 3: Iterate through the entries in the inverted index dictionary and print out each term along with the list of documents that contain it.
jumped -> Document 1 fox -> Document 1 lazy -> Document 1, Document 2 the -> Document 1, Document 2 in -> Document 2 dog. -> Document 1 quick -> Document 1 dog -> Document 2 slept -> Document 2 sun. -> Document 2 brown -> Document 1 over -> Document 1
Read related article - Difference b/w Inverted and Forward Index