VOOZH about

URL: https://www.geeksforgeeks.org/data-science/semantic-document-search/

⇱ Semantic Document Search - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Semantic Document Search

Last Updated : 31 Oct, 2025

Semantic search allows computers to understand the meaning behind user queries rather than relying only on exact keyword matching. Using FAISS (Facebook AI Similarity Search), we can build a high-performance system that searches through hundreds or even thousands of documents by meaning and not just by text overlap. This approach enables smarter, faster and more context-aware information retrieval.

Implementation

Let's see how the model will work:

👁 objec
Working
  • Document Ingestion: PDF, DOCX and TXT files are read and converted into text.
  • Chunking: Each document is split into smaller, meaningful parts.
  • Embedding Generation: Text chunks are converted into dense vector representations.
  • FAISS Indexing: Embeddings are stored in FAISS for efficient similarity search.
  • Query Encoding: A user query is embedded into the same vector space.
  • Similarity Search: FAISS finds top matches based on cosine similarity.
  • Results Display: The most relevant document snippets are shown.

Let's implement this model,

Used samples can be downloaded from here.

Step 1: Install Dependencies

We need to install the required dependencies such as faiss-cpu, sentence-transformers, python-docx.

Step 2: Import Libraries

We will import the necessary libraires such as os, docx, numpy, SentenceTransformers, faiss.

Step 3: Extract Text from Documents

We need to define the function for document loading,

  • Reads different file formats like .pdf, .docx and .txt.
  • Ensures content is extracted as plain text for embedding generation.

Step 4: Split Text into Chunks

  • Divides long documents into smaller segments (chunks).
  • Improves search accuracy and performance by focusing on smaller text units.

Step 5: Load and Process Documents

Here:

  • Reads all files in the documents/ folder.
  • Splits them into manageable text chunks and stores the source file name for each.

Output:

👁 Screenshot-2025-10-28-125125
Result

Step 6: Generate Text Embeddings

Here we will:

  • Converts text chunks into vector representations using SentenceTransformer.
  • Normalizes vectors for cosine similarity in FAISS.
  • Shows embedding progress for transparency.

Output:

👁 Screenshot-2025-10-28-125118
Output

Step 7: Create FAISS Index

  • Initializes a FAISS IndexFlatIP index (for cosine similarity).
  • Adds all text embeddings into the FAISS index for fast retrieval.

Output:

FAISS index created with 3 vectors.

Step 8: Define Cleaning and Search Functions

1. clean_text(): Removes unwanted formatting and extra spaces.

2. semantic_search_best():

  • Converts the query into a vector.
  • Searches the FAISS index for similar embeddings.
  • Displays the best matches with readable snippets.

Step 9: Run Semantic Search

  • Retrieves top semantically relevant chunks for each query.
  • Displays source document name, similarity score and wrapped text preview.

Output:

👁 Screenshot-2025-10-28-125106
Result

Output:

👁 Screenshot-2025-10-28-125059
Result

The source code can be downloaded from here.

Advantages

  • High Performance: FAISS enables instant retrieval even across thousands of vectors.
  • Semantic Understanding: Captures meaning beyond keywords.
  • Multi-format Support: Works with PDF, DOCX and TXT files.
  • Scalable: Easily extensible to large datasets.
Comment

Explore