VOOZH about

URL: https://www.geeksforgeeks.org/artificial-intelligence/rag-with-langchain/

⇱ RAG with LangChain - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

RAG with LangChain

Last Updated : 29 Jul, 2025

Retrieval-Augmented Generation (RAG) is an advanced paradigm in natural language processing that combines the strengths of retrieval-based methods and large generative language models (LLMs). By grounding the generation process in external knowledge sources, RAG significantly improves response accuracy, reduces model hallucinations and enables domain-adaptive, knowledge-intensive applications.

What Is Retrieval-Augmented Generation (RAG)?

RAG is a hybrid architecture that augments a large language model’s (LLM) text generation capabilities by retrieving and integrating relevant external information from documents, databases or knowledge bases.

  • Instead of solely relying on the LLM’s internal parameters, the model queries an external retriever.
  • The retriever searches through a large corpus, returning document snippets semantically relevant to the user's query.
  • The LLM conditions its generation on both the input query and these retrieved documents.
  • This process grounds the LLM’s output in concrete, up-to-date data, leading to more factual and contextually correct answers.

Benefits of RAG

  • Improves accuracy and reduces hallucinations by relying on real data.
  • Enables LLMs with smaller context windows to access large knowledge bases.
  • Supports easy updates of knowledge without retraining large models.
  • Ideal for domain-specific assistants, question-answering, chatbots with factual grounding.

Major Components of a RAG System

👁 RAG-architecture
RAG Architecture
  1. Document Loader and Chunker : Raw documents (text, PDFs, web pages) are loaded and split into smaller, manageable chunks for indexing.
  2. Embeddings Model : Converts text chunks and user queries into semantic vector representations capturing meaning.
  3. Vector Store (Index) : Stores embeddings and supports efficient similarity search to retrieve relevant chunks.
  4. Retriever : Queries the vector store to find top relevant chunks based on the user’s input.
  5. Generative LLM : Takes the query plus retrieved chunks as input and generates a grounded, coherent response.
  6. Chain/Orchestration : Integrates all above components into an end-to-end pipeline, managing retrieval, prompt construction and generation.
  7. Memory (optional but important) : Maintains conversation history or other contextual information for multi-turn interactions (covered in the LangChain-specific implementation).

LangChain: A Modular Framework for RAG

LangChain is a Python SDK designed to build LLM-powered applications offering easy composition of document loading, embedding, retrieval, memory and large model invocation. LangChain’s modular architecture makes assembling RAG pipelines straightforward.

RAG Implementation with LangChain and Gemini 2.5 Flash

Prerequisites

Install necessary Python packages:

  • langchain: Build applications using large language models.
  • langchain-google-genai: Use Google’s generative AI models in LangChain.
  • faiss-cpu: Fast similarity search for vector embeddings on CPU.
  • sentence-transformers: Create text embeddings for semantic search and tasks.
  • google-colab: Tools for running Python notebooks on Google Colab with hardware support.

Below example integrates these steps:

  • Load and split documents into chunks.
  • Use local SentenceTransformers to convert chunks into semantic vector embeddings stored in FAISS.
  • When a user query comes in, embed it and use FAISS to quickly retrieve relevant chunks.
  • Pass those retrieved chunks as context to Google Gemini 2.5 Flash model via LangChain’s RetrievalQA chain.
  • Gemini generates a context-aware answer based on the retrieved information.
  • Run this entire pipeline interactively, for example, in Google Colab with securely managed API keys.

1. Set Up Google Cloud API Key in Colab

This setup authenticates access to Gemini 2.5 Flash via LangChain’s Google GenAI integration.Add the API KEY in the secrets folder of Google Colab.

2. Embed Document Text Directly in the Code

3. Split Document into Chunks for Better Retrieval

Splitting ensures manageable chunks for embedding and avoids truncation in retrieval.

  • The RecursiveCharacterTextSplitter function splits a large text into smaller chunks of a specified size by recursively trying to split on different separators (like paragraphs, lines, spaces) until suitable chunk sizes are obtained, allowing effective processing of long documents.

4. Use Sentence Transformers for Local Embeddings

This model runs entirely locally, requiring no external API calls, suitable for embedding text chunks and queries.

Output:

👁 Sen
SentenceTransformers for Embeddings

5. Create FAISS Vector Store for Efficient Similarity Search

FAISS provides rapid approximate nearest neighbor search over vectors.

6. Initialize Gemini 2.5 Flash LLM Using the API Key

This loads the powerful Gemini 2.5 Flash model for text generation.

7. Build the RetrievalQA Chain

LangChain’s RetrievalQA combines retrieval and generation seamlessly.

8. Query the RAG System and Print Results

The model returns a grounded answer, supported by the relevant document chunks.

Output

👁 Output-
Output

Google Colab :RAG with LangChain

LangChain Memory Integration

While the above example covers single-turn queries, LangChain supports memory modules to store conversational history over multi-turn interactions. This lets RAG systems maintain user context and state across queries to build coherent, personalized dialogues.

For example ConversationBufferMemory stores the full conversation history and appends it as context in each prompt to Gemini, further grounding answers in past exchanges.

Further Enhancements

  • Long-term Memory: Incorporate LangChain memory modules for multi-turn conversational RAG.
  • Advanced Prompting: Customize prompt templates for better instruction to LLM.
  • Different Vector Stores: Use Pinecone, Weaviate or Chroma for scalability or managed hosting.
  • Alternative Embeddings: Use OpenAI embeddings, Google Vertex AI embeddings or local transformer models.
  • Dynamic Document Loading: Load documents from PDFs, web or databases dynamically.
  • Chain Types: Use “map-reduce” or “refine” chain types for large document collections.
Comment

Explore