Semantic Document Search

Last Updated : 31 Oct, 2025

Semantic search allows computers to understand the meaning behind user queries rather than relying only on exact keyword matching. Using FAISS (Facebook AI Similarity Search), we can build a high-performance system that searches through hundreds or even thousands of documents by meaning and not just by text overlap. This approach enables smarter, faster and more context-aware information retrieval.

Implementation

Let's see how the model will work:

👁 objec

Working

Document Ingestion: PDF, DOCX and TXT files are read and converted into text.
Chunking: Each document is split into smaller, meaningful parts.
Embedding Generation: Text chunks are converted into dense vector representations.
FAISS Indexing: Embeddings are stored in FAISS for efficient similarity search.
Query Encoding: A user query is embedded into the same vector space.
Similarity Search: FAISS finds top matches based on cosine similarity.
Results Display: The most relevant document snippets are shown.

Let's implement this model,

Used samples can be downloaded from here.

Step 1: Install Dependencies

We need to install the required dependencies such as faiss-cpu, sentence-transformers, python-docx.

Step 2: Import Libraries

We will import the necessary libraires such as os, docx, numpy, SentenceTransformers, faiss.

Step 3: Extract Text from Documents

We need to define the function for document loading,

Reads different file formats like .pdf, .docx and .txt.
Ensures content is extracted as plain text for embedding generation.

Step 4: Split Text into Chunks

Divides long documents into smaller segments (chunks).
Improves search accuracy and performance by focusing on smaller text units.

Step 5: Load and Process Documents

Here:

Reads all files in the documents/ folder.
Splits them into manageable text chunks and stores the source file name for each.

Output:

👁 Screenshot-2025-10-28-125125

Result

Step 6: Generate Text Embeddings

Here we will:

Converts text chunks into vector representations using SentenceTransformer.
Normalizes vectors for cosine similarity in FAISS.
Shows embedding progress for transparency.

Output:

👁 Screenshot-2025-10-28-125118

Output

Step 7: Create FAISS Index

Initializes a FAISS IndexFlatIP index (for cosine similarity).
Adds all text embeddings into the FAISS index for fast retrieval.

Output:

FAISS index created with 3 vectors.

Step 8: Define Cleaning and Search Functions

1. clean_text(): Removes unwanted formatting and extra spaces.

2. semantic_search_best():

Converts the query into a vector.
Searches the FAISS index for similar embeddings.
Displays the best matches with readable snippets.

Step 9: Run Semantic Search

Retrieves top semantically relevant chunks for each query.
Displays source document name, similarity score and wrapped text preview.

Output:

👁 Screenshot-2025-10-28-125106

Result

Output:

👁 Screenshot-2025-10-28-125059

Result

The source code can be downloaded from here.

Advantages

High Performance: FAISS enables instant retrieval even across thousands of vectors.
Semantic Understanding: Captures meaning beyond keywords.
Multi-format Support: Works with PDF, DOCX and TXT files.
Scalable: Easily extensible to large datasets.

Comment

Article Tags:

Data Science

Artificial Intelligence

NLP

Data Science

Explore

Introduction to Machine Learning

Python for Machine Learning

Introduction to Statistics

Feature Engineering

Model Evaluation and Tuning

Data Science Practice

Courses

URL: https://www.geeksforgeeks.org/data-science/semantic-document-search/