Voozh

April 17, 2026

26 min read

LangChain has become the most widely adopted framework for building applications powered by large language models. With over 100,000 GitHub stars and millions of monthly PyPI downloads, it provides the abstractions developers need to connect LLMs to real-world data sources, APIs, and tools. This langchain tutorial walks you through building a complete RAG-powered chatbot from scratch in 13 steps, covering everything from environment setup to production deployment with LangSmith observability.

Whether you are building your first LLM application or migrating from an older LangChain version, this guide uses LangChain v1.x with LangChain Expression Language (LCEL) and LangGraph integration. By the end, you will have a fully functional retrieval-augmented generation system that can answer questions from your own documents with source citations, streaming output, and conversation memory.

Prerequisites and Environment Setup

Before starting this langchain tutorial, make sure you have the following software installed and API keys ready. LangChain v1.x dropped Python 3.9 support, so you need Python 3.10 or higher. The framework has been restructured into separate packages since the 1.0 stable release in October 2025, so the installation process differs from older tutorials you may have seen.

Requirement	Minimum Version	Recommended Version	Purpose
Python	3.10	3.12	Runtime
langchain	1.0.0	1.2.x	Core framework
langchain-openai	0.3.0	Latest	OpenAI provider
langchain-community	0.3.0	Latest	Third-party integrations
langgraph	1.0.0	1.1.x	Agent orchestration
chromadb	0.5.0	Latest	Vector store
langsmith	0.2.0	Latest	Observability and tracing
pip	23.0	24.x	Package manager

You will also need an OpenAI API key for the LLM and embedding models. If you prefer a different provider, LangChain supports Anthropic, Google, Cohere, and dozens of others through provider-specific packages. The modular architecture introduced in v1.0 means you install only the providers you actually use, which keeps your dependency tree clean and avoids version conflicts.

LangSmith is optional but strongly recommended for debugging and monitoring. It provides request-level tracing so you can see exactly what prompts are sent, what the LLM returns, and how long each step takes. LangSmith Self-Hosted v0.13 added Insights support in January 2026, giving teams running their own infrastructure feature parity with the cloud version.

Step 1: Install LangChain and Create the Project Structure

Start by creating a virtual environment and installing the required packages. The LangChain ecosystem was split into multiple packages with the v1.0 release, so you need to install each one separately. This modular approach prevents dependency bloat and lets you update providers independently of the core framework.

👁 Step 1: Install LangChain and Create the Project Structure

# Create project directory and virtual environment
mkdir langchain-rag-chatbot && cd langchain-rag-chatbot
python -m venv .venv
source .venv/bin/activate # On Windows: .venvScriptsactivate

# Install core packages
pip install langchain langchain-openai langchain-community
pip install langgraph chromadb
pip install langsmith python-dotenv

# Verify installation
python -c "import langchain; print(f'LangChain {langchain.__version__}')"

Next, create the project structure. A clean folder layout makes it easier to manage prompts, document loaders, and the retrieval chain separately. Create a .env file for your API keys and never commit it to version control.

# Project structure
langchain-rag-chatbot/
├── .env # API keys (add to .gitignore)
├── app.py # Main application entry point
├── ingest.py # Document loading and indexing
├── chain.py # RAG chain definition
├── prompts/
│ └── rag_prompt.py # Prompt templates
├── docs/ # Source documents to index
│ └── sample.txt
├── vectorstore/ # Chroma persistent storage
└── requirements.txt

# .env file contents
OPENAI_API_KEY=sk-your-key-here
LANGSMITH_API_KEY=lsv2-your-key-here
LANGSMITH_TRACING=true
LANGSMITH_PROJECT=rag-chatbot

The LANGSMITH_TRACING=true environment variable enables automatic tracing of every LangChain call. Once set, all your chains, retrievers, and LLM calls appear in the LangSmith dashboard without any code changes. This is one of the most powerful debugging features in the LangChain ecosystem and it costs nothing for small-scale development.

Step 2: Load and Split Documents for Indexing

RAG (Retrieval-Augmented Generation) works by finding relevant chunks of text from your documents and feeding them to the LLM as context. The first step is loading your source documents and splitting them into appropriately sized chunks. LangChain provides over 80 document loaders for different file formats, including PDF, HTML, Markdown, CSV, and more.

The chunk size directly affects retrieval quality. Chunks that are too large dilute the relevant information with noise. Chunks that are too small lose important context. A chunk size of 1000 characters with 200 characters of overlap is a good starting point for most use cases. The overlap ensures that sentences split across chunk boundaries are still captured.

# ingest.py
from langchain_community.document_loaders import (
 TextLoader,
 DirectoryLoader,
 PyPDFLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv
import os

load_dotenv()

def load_documents(docs_dir: str = "./docs"):
 """Load all supported documents from a directory."""
 loaders = {
 ".txt": TextLoader,
 ".pdf": PyPDFLoader,
 }
 documents = []
 for file in os.listdir(docs_dir):
 ext = os.path.splitext(file)[1].lower()
 if ext in loaders:
 loader = loaders[ext](os.path.join(docs_dir, file))
 documents.extend(loader.load())
 print(f"Loaded {len(documents)} documents")
 return documents


def split_documents(documents, chunk_size=1000, chunk_overlap=200):
 """Split documents into chunks for indexing."""
 splitter = RecursiveCharacterTextSplitter(
 chunk_size=chunk_size,
 chunk_overlap=chunk_overlap,
 separators=["nn", "n", ". ", " ", ""],
 )
 chunks = splitter.split_documents(documents)
 print(f"Split into {len(chunks)} chunks")
 return chunks


def create_vectorstore(chunks, persist_dir: str = "./vectorstore"):
 """Create a Chroma vector store from document chunks."""
 embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 vectorstore = Chroma.from_documents(
 documents=chunks,
 embedding=embeddings,
 persist_directory=persist_dir,
 )
 print(f"Created vector store with {vectorstore._collection.count()} vectors")
 return vectorstore


if __name__ == "__main__":
 docs = load_documents()
 chunks = split_documents(docs)
 vectorstore = create_vectorstore(chunks)

The RecursiveCharacterTextSplitter tries each separator in order, starting with double newlines (paragraph breaks), then single newlines, then sentences, and finally individual words. This hierarchy preserves semantic structure as much as possible. For code-heavy documents, consider using Language.PYTHON or other language-specific splitters that understand code syntax.

Step 3: Build the Retrieval Chain with LCEL

LangChain Expression Language (LCEL) is the composable syntax for building chains in LangChain v1.x. Instead of the older LLMChain and SequentialChain classes, you pipe components together using the | operator. LCEL provides automatic batching, streaming, and async support without any extra code. In LangChain v1.1.0, LCEL gained summarization middleware that can automatically compress conversation history using model profiles.

The retrieval chain takes a user question, searches the vector store for relevant chunks, formats them into a prompt with the original question, and sends everything to the LLM. This is the core RAG pattern that powers most production LangChain applications.

# chain.py
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from dotenv import load_dotenv

load_dotenv()


def get_vectorstore(persist_dir: str = "./vectorstore"):
 """Load the existing Chroma vector store."""
 embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 return Chroma(
 persist_directory=persist_dir,
 embedding_function=embeddings,
 )


def format_docs(docs):
 """Format retrieved documents into a single string."""
 return "nn---nn".join(
 f"Source: {doc.metadata.get('source', 'unknown')}n{doc.page_content}"
 for doc in docs
 )


def create_rag_chain():
 """Build the RAG chain using LCEL."""
 vectorstore = get_vectorstore()
 retriever = vectorstore.as_retriever(
 search_type="similarity",
 search_kwargs={"k": 4},
 )
 
 llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
 
 prompt = ChatPromptTemplate.from_messages([
 ("system", """You are a helpful assistant that answers questions 
based on the provided context. Always cite which source document 
your answer comes from. If the context does not contain enough 
information to answer, say so honestly.

Context:
{context}"""),
 MessagesPlaceholder(variable_name="chat_history", optional=True),
 ("human", "{question}"),
 ])
 
 # LCEL chain: retrieve docs in parallel with passing the question
 rag_chain = (
 RunnableParallel(
 context=retriever | format_docs,
 question=RunnablePassthrough(),
 )
 | prompt
 | llm
 | StrOutputParser()
 )
 
 return rag_chain


# Quick test
if __name__ == "__main__":
 chain = create_rag_chain()
 response = chain.invoke("What is this document about?")
 print(response)

The RunnableParallel component is key here. It runs the retriever and the question passthrough simultaneously, which reduces latency. The retriever returns document objects, which the format_docs function converts into a formatted string that the prompt template can use. The entire chain streams by default when you call chain.stream() instead of chain.invoke().

Step 4: Add Conversation Memory with Message History

A chatbot needs to remember previous messages in the conversation. LangChain v1.x handles this through message history wrappers that store and retrieve chat messages. The recommended approach uses RunnableWithMessageHistory, which wraps any LCEL chain with automatic history management. Each conversation gets a unique session ID so multiple users can interact simultaneously.

👁 Step 4: Add Conversation Memory with Message History

For production applications, you would store message history in Redis, PostgreSQL, or another persistent backend. For development and testing, the in-memory ChatMessageHistory store works fine. The LangChain community package includes history backends for over a dozen storage systems.

# Add to chain.py
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

# In-memory store for development
message_store: dict[str, ChatMessageHistory] = {}


def get_session_history(session_id: str) -> BaseChatMessageHistory:
 """Retrieve or create message history for a session."""
 if session_id not in message_store:
 message_store[session_id] = ChatMessageHistory()
 return message_store[session_id]


def create_conversational_chain():
 """Build a RAG chain with conversation memory."""
 vectorstore = get_vectorstore()
 retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
 llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
 
 # Prompt with chat history placeholder
 prompt = ChatPromptTemplate.from_messages([
 ("system", """You are a helpful assistant. Answer based on context.
Cite sources. If unsure, say so.

Context:
{context}"""),
 MessagesPlaceholder(variable_name="chat_history"),
 ("human", "{question}"),
 ])
 
 rag_chain = (
 RunnableParallel(
 context=lambda x: format_docs(
 retriever.invoke(x["question"])
 ),
 question=lambda x: x["question"],
 chat_history=lambda x: x.get("chat_history", []),
 )
 | prompt
 | llm
 | StrOutputParser()
 )
 
 # Wrap with message history
 conversational_chain = RunnableWithMessageHistory(
 rag_chain,
 get_session_history,
 input_messages_key="question",
 history_messages_key="chat_history",
 )
 
 return conversational_chain

When you invoke the conversational chain, pass a config dictionary with the session ID. The wrapper automatically loads previous messages and appends the new exchange after each turn. This pattern scales cleanly because the history logic is completely decoupled from the chain logic.

Step 5: Implement Streaming Output for Real-Time Responses

Streaming is essential for any chatbot interface because LLM responses can take several seconds to generate. Without streaming, users stare at a blank screen until the full response is ready. With LCEL, streaming is built in. Every chain that ends with an LLM or output parser supports the .stream() method, which yields tokens as they are generated.

LangGraph v1.1 introduced type-safe streaming with the version="v2" parameter. This returns unified StreamPart objects with type, ns (namespace), and data keys, making it easier to handle different event types in complex multi-step chains. For a simple RAG chain, the basic LCEL streaming is sufficient.

# app.py - Streaming chatbot interface
from chain import create_conversational_chain

def main():
 chain = create_conversational_chain()
 session_id = "user-001"
 config = {"configurable": {"session_id": session_id}}
 
 print("RAG Chatbot Ready. Type 'quit' to exit.n")
 
 while True:
 question = input("You: ").strip()
 if question.lower() in ("quit", "exit"):
 break
 if not question:
 continue
 
 print("Bot: ", end="", flush=True)
 
 # Stream tokens as they arrive
 for chunk in chain.stream(
 {"question": question},
 config=config,
 ):
 print(chunk, end="", flush=True)
 
 print("n")


if __name__ == "__main__":
 main()

The flush=True parameter ensures each token appears immediately in the terminal instead of being buffered. In a web application, you would use Server-Sent Events (SSE) or WebSockets to push tokens to the browser. FastAPI with StreamingResponse pairs well with LangChain streaming for production APIs. If you need to build a REST API layer, see our FastAPI tutorial for the full setup.

Step 6: Add Tool Calling and Function Execution

Modern LLM applications go beyond simple question answering. Tool calling lets the model decide when to invoke external functions, such as searching the web, querying a database, or calling an API. LangChain v1.x uses the @tool decorator to define tools, and the model’s built-in function calling capabilities to select and invoke them. LangChain v1.1.0 introduced structured output inference through model profiles, which automatically detects whether a model supports native structured output.

Tools are Python functions with type hints and docstrings. LangChain converts the function signature into a JSON schema that the LLM uses to understand what parameters the tool expects. The docstring becomes the tool description, so write it clearly.

# tools.py
from langchain_core.tools import tool
from datetime import datetime
import json


@tool
def get_current_time() -> str:
 """Get the current date and time. Use when the user asks about
 the current time or today's date."""
 return datetime.now().strftime("%Y-%m-%d %H:%M:%S")


@tool
def calculate(expression: str) -> str:
 """Evaluate a mathematical expression. Use when the user asks
 to calculate something.
 
 Args:
 expression: A valid Python math expression (e.g., '2 + 2', '100 * 0.15')
 """
 try:
 # Only allow safe math operations
 allowed = set("0123456789+-*/.(). ")
 if not all(c in allowed for c in expression):
 return "Error: Invalid characters in expression"
 result = eval(expression) 
 return str(result)
 except Exception as e:
 return f"Error: {e}"


@tool 
def search_documents(query: str) -> str:
 """Search the knowledge base for relevant information.
 
 Args:
 query: The search query to find relevant documents
 """
 from chain import get_vectorstore, format_docs
 vectorstore = get_vectorstore()
 docs = vectorstore.similarity_search(query, k=3)
 return format_docs(docs)


# Collect all tools
all_tools = [get_current_time, calculate, search_documents]

To use these tools with a chat model, bind them using llm.bind_tools(tools). This tells the model about available tools and lets it decide when to call them. The model returns a structured tool call message instead of plain text, and LangChain handles parsing the arguments and invoking the function. For more complex agent workflows with loops and conditional logic, use LangGraph as described in our LangGraph tutorial.

Step 7: Build an Agent with LangGraph for Complex Workflows

While LCEL chains follow a fixed sequence of steps, agents can dynamically decide which actions to take based on the LLM’s reasoning. LangGraph, which reached v1.0 alongside LangChain in October 2025, provides a graph-based framework for building agents with explicit state management, conditional branching, and human-in-the-loop workflows. LangGraph v1.1 fixed critical time travel issues with interrupts and subgraphs, making production agent deployments more reliable.

👁 Step 7: Build an Agent with LangGraph for Complex Workflows

The create_react_agent function is the quickest way to build an agent that reasons about which tools to use. It creates a ReAct (Reasoning + Acting) loop where the model thinks about what to do, executes a tool, observes the result, and repeats until it has enough information to answer.

# agent.py
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langchain_core.messages import HumanMessage
from tools import all_tools
from dotenv import load_dotenv

load_dotenv()


def create_agent():
 """Create a ReAct agent with tools."""
 llm = ChatOpenAI(model="gpt-4o", temperature=0)
 
 # System message defines the agent's behavior
 system_message = """You are a helpful AI assistant with access to tools.
Use the search_documents tool to find information from the knowledge base.
Use calculate for math. Use get_current_time for date/time questions.
Always explain your reasoning before using a tool."""
 
 agent = create_react_agent(
 model=llm,
 tools=all_tools,
 prompt=system_message,
 )
 
 return agent


def run_agent():
 """Interactive agent loop."""
 agent = create_agent()
 config = {"configurable": {"thread_id": "session-001"}}
 
 print("Agent Ready. Type 'quit' to exit.n")
 
 while True:
 user_input = input("You: ").strip()
 if user_input.lower() in ("quit", "exit"):
 break
 
 print("Agent: ", end="", flush=True)
 
 for event in agent.stream(
 {"messages": [HumanMessage(content=user_input)]},
 config=config,
 ):
 if "agent" in event:
 msg = event["agent"]["messages"][-1]
 if hasattr(msg, "content") and msg.content:
 print(msg.content, end="", flush=True)
 
 print("n")


if __name__ == "__main__":
 run_agent()

The agent output shows each step of its reasoning. When the model decides to use a tool, LangGraph executes the tool function and feeds the result back to the model. The loop continues until the model generates a final text response without tool calls. For multi-agent systems where different agents collaborate on complex tasks, see our CrewAI tutorial.

Step 8: Connect LangSmith for Observability and Debugging

LangSmith is LangChain’s observability platform that captures detailed traces of every chain execution. When something goes wrong in a multi-step RAG pipeline, LangSmith shows you exactly which step failed, what inputs it received, and how long it took. This is invaluable for debugging retrieval quality issues, prompt engineering, and latency optimization.

If you set the LANGSMITH_TRACING=true environment variable in Step 1, tracing is already active. Every call to chain.invoke() or agent.stream() automatically sends trace data to LangSmith. You can view traces at docs.smith.langchain.com. The platform was rebranded with the Agent Builder renamed to LangSmith Fleet in March 2026.

Beyond basic tracing, LangSmith supports evaluation datasets, A/B testing of prompts, and annotation queues for human feedback. You can create test datasets with expected outputs and run your chain against them to measure accuracy over time. This is how production teams catch regressions when they change prompts or swap models.

# evaluation.py - LangSmith evaluation example
from langsmith import Client
from langchain_openai import ChatOpenAI
from chain import create_rag_chain

client = Client()

# Create a dataset for evaluation
dataset = client.create_dataset("rag-eval", description="RAG accuracy tests")

# Add test examples
client.create_examples(
 inputs=[
 {"question": "What is the main topic of the document?"},
 {"question": "What are the key findings?"},
 {"question": "Who authored this report?"},
 ],
 outputs=[
 {"answer": "The document covers AI infrastructure trends."},
 {"answer": "Key findings include 35% growth in chip demand."},
 {"answer": "The report was authored by the research team."},
 ],
 dataset_id=dataset.id,
)

# Run evaluation
from langsmith.evaluation import evaluate

def predict(inputs: dict) -> dict:
 chain = create_rag_chain()
 return {"answer": chain.invoke(inputs["question"])}

results = evaluate(
 predict,
 data=dataset.name,
 experiment_prefix="rag-v1",
)
print(f"Evaluation complete. View results in LangSmith dashboard.")

LangSmith Self-Hosted v0.13 brought Insights support to self-hosted deployments, so enterprise teams can run the full observability stack on their own infrastructure. This is critical for organizations with data residency requirements that prevent sending trace data to external services.

Step 9: Add Multiple Document Types and Advanced Retrieval

A production RAG system needs to handle more than plain text files. LangChain provides document loaders for PDFs, Word documents, HTML pages, Notion exports, Slack messages, and many more. You can also improve retrieval quality beyond basic similarity search by using hybrid search, re-ranking, or multi-query retrieval strategies.

Multi-query retrieval generates multiple rephrased versions of the user’s question and runs each one against the vector store. The results are combined and deduplicated, which catches relevant documents that might be missed by a single query formulation. This technique improves recall by 15-30% in most benchmarks.

# advanced_retrieval.py
from langchain.retrievers import MultiQueryRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv

load_dotenv()


def create_multi_query_retriever():
 """Retriever that generates multiple query variants."""
 embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 vectorstore = Chroma(
 persist_directory="./vectorstore",
 embedding_function=embeddings,
 )
 base_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
 llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
 
 retriever = MultiQueryRetriever.from_llm(
 retriever=base_retriever,
 llm=llm,
 )
 return retriever


def create_compressed_retriever():
 """Retriever that extracts only relevant parts of documents."""
 embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 vectorstore = Chroma(
 persist_directory="./vectorstore",
 embedding_function=embeddings,
 )
 base_retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
 llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
 
 compressor = LLMChainExtractor.from_llm(llm)
 retriever = ContextualCompressionRetriever(
 base_compressor=compressor,
 base_retriever=base_retriever,
 )
 return retriever

The contextual compression retriever takes the retrieved documents and uses an LLM to extract only the parts that are relevant to the question. This reduces the amount of irrelevant text in the prompt context, which improves answer quality and reduces token costs. The trade-off is an additional LLM call for compression, so use a smaller model like gpt-4o-mini for the compression step.

Retrieval Strategy	Recall Improvement	Latency Impact	Cost Impact	Best For
Basic similarity	Baseline	Fastest	Lowest	Simple use cases
Multi-query	+15-30%	+200-400ms	+1 LLM call	Ambiguous queries
Contextual compression	+10-20%	+500-800ms	+1 LLM call per doc	Long documents
Hybrid (BM25 + vector)	+20-35%	+100-200ms	Minimal	Keyword-heavy domains
Re-ranking (Cohere)	+25-40%	+300-500ms	+API call	High-accuracy needs
Parent document	+15-25%	+50-100ms	Minimal	Structured documents

Step 10: Build a FastAPI Web Server for the Chatbot

A terminal chatbot is useful for testing, but real applications need a web API. FastAPI is the ideal choice because it natively supports async operations and streaming responses, both of which are critical for LLM applications. LangChain’s async methods (ainvoke, astream) work smoothly with FastAPI’s async endpoints.

👁 Step 10: Build a FastAPI Web Server for the Chatbot

# server.py
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from chain import create_conversational_chain
import asyncio
import uvicorn

app = FastAPI(title="LangChain RAG Chatbot API")
chain = create_conversational_chain()


class ChatRequest(BaseModel):
 question: str
 session_id: str = "default"


class ChatResponse(BaseModel):
 answer: str
 session_id: str


@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
 """Non-streaming chat endpoint."""
 config = {"configurable": {"session_id": request.session_id}}
 response = await chain.ainvoke(
 {"question": request.question},
 config=config,
 )
 return ChatResponse(answer=response, session_id=request.session_id)


@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
 """Streaming chat endpoint using Server-Sent Events."""
 config = {"configurable": {"session_id": request.session_id}}
 
 async def generate():
 async for chunk in chain.astream(
 {"question": request.question},
 config=config,
 ):
 yield f"data: {chunk}nn"
 yield "data: [DONE]nn"
 
 return StreamingResponse(
 generate(),
 media_type="text/event-stream",
 )


@app.get("/health")
async def health():
 return {"status": "ok"}


if __name__ == "__main__":
 uvicorn.run(app, host="0.0.0.0", port=8000)

Run the server with python server.py and test it with curl:

# Test non-streaming endpoint
curl -X POST http://localhost:8000/chat 
 -H "Content-Type: application/json" 
 -d '{"question": "What is this document about?", "session_id": "test-001"}'

# Expected output:
# {"answer":"Based on the provided context, the document covers...","session_id":"test-001"}

# Test streaming endpoint
curl -X POST http://localhost:8000/chat/stream 
 -H "Content-Type: application/json" 
 -d '{"question": "Summarize the key points", "session_id": "test-001"}'

# Expected output (streamed):
# data: Based
# data: on
# data: the
# data: document
# ...
# data: [DONE]

The streaming endpoint uses Server-Sent Events (SSE), which is the same protocol that OpenAI and Anthropic use for their streaming APIs. Any SSE-compatible frontend library can consume this endpoint. For a complete FastAPI production setup with error handling and middleware, see our FastAPI REST API tutorial.

Step 11: Handle Errors and Edge Cases in Production

Production LangChain applications encounter API rate limits, network timeouts, malformed inputs, and empty retrieval results. Strong error handling prevents these issues from crashing your application and provides meaningful feedback to users. LangChain includes built-in retry logic for LLM calls, but you need to handle retrieval and chain-level errors yourself.

The most common production issue is empty retrieval results. When the vector store returns no relevant documents, the LLM hallucinates answers because it has no context to ground its response. Always check retrieval results before passing them to the LLM, and provide a clear fallback message when no relevant documents are found.

# error_handling.py
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables.config import RunnableConfig
import logging

logger = logging.getLogger(__name__)


def create_robust_chain():
 """Chain with thorough error handling."""
 from chain import get_vectorstore, format_docs
 
 vectorstore = get_vectorstore()
 retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
 
 llm = ChatOpenAI(
 model="gpt-4o",
 temperature=0.1,
 max_retries=3, # Retry on transient API errors
 request_timeout=30, # 30-second timeout per request
 )
 
 def safe_retrieve(question: str) -> str:
 """Retrieve with fallback for empty results."""
 try:
 docs = retriever.invoke(question)
 if not docs:
 logger.warning(f"No documents found for: {question}")
 return "No relevant documents found in the knowledge base."
 return format_docs(docs)
 except Exception as e:
 logger.error(f"Retrieval error: {e}")
 return "Unable to search the knowledge base at this time."
 
 def safe_generate(inputs: dict) -> str:
 """Generate with error handling."""
 try:
 from langchain_core.prompts import ChatPromptTemplate
 prompt = ChatPromptTemplate.from_messages([
 ("system", "Answer based on context:n{context}"),
 ("human", "{question}"),
 ])
 chain = prompt | llm
 response = chain.invoke(inputs)
 return response.content
 except Exception as e:
 logger.error(f"Generation error: {e}")
 return "I'm having trouble generating a response. Please try again."
 
 # Compose with error handling
 chain = (
 RunnableLambda(lambda q: {
 "context": safe_retrieve(q),
 "question": q,
 })
 | RunnableLambda(safe_generate)
 )
 
 return chain

Configure the max_retries parameter on the LLM to automatically retry on transient errors like rate limits (HTTP 429) and server errors (HTTP 500). The default is 2 retries with exponential backoff. For high-throughput applications, set up a token bucket rate limiter to stay within your API provider’s limits proactively rather than relying on retry logic.

Step 12: Write Tests for Your LangChain Application

Testing LLM applications is different from testing traditional software because outputs are non-deterministic. You cannot assert that the model returns an exact string. Instead, test the structure of your chains, the behavior of your tools, and the quality of your retrieval pipeline. Use mock LLMs for unit tests and real LLMs for integration tests.

# test_chain.py
import pytest
from unittest.mock import patch, MagicMock
from langchain_core.documents import Document


def test_format_docs():
 """Test document formatting function."""
 from chain import format_docs
 
 docs = [
 Document(
 page_content="Hello world",
 metadata={"source": "test.txt"},
 ),
 Document(
 page_content="Second doc",
 metadata={"source": "test2.txt"},
 ),
 ]
 result = format_docs(docs)
 assert "Hello world" in result
 assert "Source: test.txt" in result
 assert "---" in result # Separator between docs


def test_document_splitting():
 """Test that documents are split correctly."""
 from ingest import split_documents
 from langchain_core.documents import Document
 
 long_text = "A" * 2000 # 2000 characters
 docs = [Document(page_content=long_text, metadata={"source": "test"})]
 chunks = split_documents(docs, chunk_size=500, chunk_overlap=50)
 
 assert len(chunks) > 1
 assert all(len(c.page_content) <= 500 for c in chunks)


def test_tool_calculate():
 """Test the calculate tool."""
 from tools import calculate
 
 assert calculate.invoke("2 + 2") == "4"
 assert calculate.invoke("100 * 0.15") == "15.0"
 assert "Error" in calculate.invoke("import os")


def test_tool_current_time():
 """Test the time tool returns valid format."""
 from tools import get_current_time
 
 result = get_current_time.invoke("")
 assert len(result) == 19 # YYYY-MM-DD HH:MM:SS format


# Run with: pytest test_chain.py -v

For integration tests that call real LLMs, use LangSmith evaluation datasets as shown in Step 8. These tests are slower and cost money, so run them in CI only on pull requests, not on every commit. For more on Python testing best practices, see our Pytest tutorial with CI/CD.

Step 13: Deploy to Production with Docker

The final step is packaging your application for production deployment. Docker ensures consistent behavior across development, staging, and production environments. The Dockerfile below creates a minimal image with only the required dependencies and runs the FastAPI server behind Uvicorn with multiple workers.

👁 Step 13: Deploy to Production with Docker

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install dependencies first for better caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose port
EXPOSE 8000

# Run with multiple workers for production
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

---

# docker-compose.yml
version: "3.9"
services:
 chatbot:
 build: .
 ports:
 - "8000:8000"
 env_file:
 - .env
 volumes:
 - ./vectorstore:/app/vectorstore
 restart: unless-stopped
 healthcheck:
 test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
 interval: 30s
 timeout: 10s
 retries: 3

# Build and run
# docker compose up --build -d
# docker compose logs -f chatbot

For production deployments on cloud platforms, consider using LangServe (LangChain’s deployment library) or deploying to AWS ECS, Google Cloud Run, or Kubernetes. The Docker image above works on any container platform. Mount the vector store directory as a volume so the indexed data persists across container restarts. For Kubernetes deployment patterns, see our Kubernetes and Helm tutorial.

Common Pitfalls and How to Avoid Them

After building hundreds of LangChain applications, these are the mistakes that catch developers most often. Each pitfall includes a concrete fix so you can avoid wasting hours on debugging.

Pitfall 1: Using the wrong LangChain import paths. The v1.0 restructuring moved many classes to new packages. Importing from langchain.chat_models import ChatOpenAI fails. Use from langchain_openai import ChatOpenAI instead. Every provider now has its own package (langchain-openai, langchain-anthropic, langchain-google-genai).

Pitfall 2: Not setting chunk overlap in text splitting. Without overlap, sentences that span chunk boundaries are split and become unsearchable. Always use at least 10-20% overlap relative to your chunk size.

Pitfall 3: Forgetting to persist the vector store. Chroma operates in-memory by default. If you do not pass persist_directory, your entire index disappears when the process exits. Always set a persist directory in production.

Pitfall 4: Using temperature=1.0 for RAG responses. High temperature causes the model to hallucinate details not present in the context documents. Use temperature=0.0 to 0.2 for factual RAG applications. Reserve higher temperatures for creative tasks.

Pitfall 5: Not handling empty retrieval results. When the vector store returns no matches, the LLM gets an empty context and invents answers. Always check for empty results and return a clear “I don’t know” message instead of letting the model hallucinate.

Pitfall 6: Passing raw user input to SQL or shell tools. If your agent has database or system tools, sanitize inputs to prevent injection attacks. LangChain’s @tool decorator validates type hints but does not sanitize string content. The CVE-2025-67644 vulnerability in langgraph-checkpoint-sqlite was caused by exactly this kind of unsanitized input in metadata filters.

Pitfall 7: Ignoring token limits in conversation memory. As conversations grow, the message history exceeds the model’s context window. Use the LCEL summarization middleware introduced in v1.1.0 or implement a sliding window that keeps only the last N messages plus a summary of older ones.

Troubleshooting Guide for LangChain Applications

When things go wrong, systematic debugging saves time. Here are the most common errors and their solutions, organized by category.

Error	Cause	Solution
`ImportError: cannot import name 'ChatOpenAI' from 'langchain'`	Using old import paths from pre-v1.0	Install `langchain-openai` and import from `langchain_openai`
`AuthenticationError: Incorrect API key`	Missing or invalid OPENAI_API_KEY	Check `.env` file and verify key at platform.openai.com
`RateLimitError: Rate limit reached`	Too many API calls per minute	Add `max_retries=3` to LLM constructor or add delays
`ValueError: Missing input keys`	Chain input schema mismatch	Check `chain.input_schema.schema()` to see expected keys
`ChromaDB: no such table`	Vector store not initialized	Run `python ingest.py` first to create the store
Empty responses from RAG chain	No matching documents found	Lower `k` threshold or check embedding model consistency
`Context length exceeded`	Too many retrieved documents or long history	Reduce `k` value, add compression, or summarize history
Agent stuck in infinite loop	Model keeps calling same tool	Add `max_iterations` parameter or improve tool descriptions
Streaming not working	Using `invoke()` instead of `stream()`	Switch to `chain.stream()` or `chain.astream()`
`LangSmith traces not appearing`	Tracing not enabled	Set `LANGSMITH_TRACING=true` in environment variables

Debugging tip 1: Enable verbose logging with import langchain; langchain.debug = True. This prints every prompt, response, and intermediate step to the console. Turn it off in production.

Debugging tip 2: Use LangSmith’s trace comparison feature to diff two runs side by side. This quickly reveals which step changed when a chain starts behaving differently after a code change.

Debugging tip 3: Check embedding model consistency. If you indexed documents with text-embedding-3-small but query with text-embedding-ada-002, similarity scores will be meaningless. Always use the same embedding model for indexing and retrieval.

Advanced Tips for Production LangChain Applications

Once your basic RAG chatbot works, these optimization techniques can significantly improve performance, reduce costs, and increase answer quality in production environments.

Tip 1: Use smaller models for intermediate steps. Not every step needs GPT-4o. Use gpt-4o-mini for query reformulation, document compression, and tool argument parsing. Reserve the larger model for the final answer generation. This can cut costs by 50-70% with minimal quality loss.

Tip 2: Cache embeddings and LLM responses. LangChain supports caching through the set_llm_cache function. Use SQLite caching for development and Redis for production. Cached responses return instantly and cost nothing, which is especially valuable for repeated queries.

Tip 3: Implement semantic caching. Instead of exact-match caching, use semantic similarity to find cached responses for questions that are worded differently but mean the same thing. The GPTCache integration in LangChain supports this out of the box.

Tip 4: Pre-filter documents with metadata. Add metadata like category, date, and source type during ingestion. Use metadata filters in your retriever to narrow the search space before vector similarity kicks in. This improves both speed and accuracy for large document collections.

Tip 5: Monitor token usage. Use LangSmith or callback handlers to track token consumption per request. Set budget alerts to catch runaway chains that might consume thousands of tokens in a single agent loop. The get_openai_callback context manager gives you per-call token counts.

Tip 6: Use structured output for reliable parsing. Instead of parsing free-text LLM responses, use LangChain’s with_structured_output method that uses the model’s native function calling to return typed Pydantic objects. LangChain v1.1.0 automatically infers whether a model supports native structured output through model profiles.

Related Coverage

LangChain Performance Optimization Checklist

Before deploying your LangChain application to production, work through this optimization checklist. Each item addresses a common performance bottleneck that affects real-world applications handling concurrent users.

Connection pooling: Create a single LLM and embedding client instance and reuse it across requests. Creating new clients per request wastes time on TCP handshakes and authentication. Use FastAPI’s dependency injection or module-level singletons.

Async everywhere: Use ainvoke and astream instead of their synchronous counterparts in web servers. Synchronous calls block the event loop and prevent your server from handling other requests while waiting for LLM responses. A single async endpoint can serve 10x more concurrent users than a synchronous one.

Batch processing: When indexing documents, use vectorstore.add_documents(chunks, batch_size=100) instead of adding one document at a time. Batching reduces the number of API calls to your embedding provider and can speed up ingestion by 5-10x.

Prompt optimization: Every token in your system prompt costs money on every request. Minimize the system prompt to essential instructions only. Move dynamic content like few-shot examples into the retrieval pipeline where they are only included when relevant.

Embedding model selection: OpenAI’s text-embedding-3-small produces 1536-dimension vectors at $0.02 per million tokens. For most RAG applications, it performs within 2-3% of the larger text-embedding-3-large model while costing 5x less. Only upgrade to the large model if you have measured a meaningful retrieval quality difference on your specific data.

Frequently Asked Questions

What is LangChain and why should I use it in 2026?

LangChain is a Python and JavaScript framework for building applications powered by large language models. It provides standardized interfaces for connecting LLMs to external data sources, tools, and APIs. Since its v1.0 stable release in October 2025, it has become the most widely adopted LLM application framework with over 100,000 GitHub stars. You should use it when you need to build RAG systems, chatbots, agents, or any application that combines LLMs with external data.

What is the difference between LangChain and LangGraph?

LangChain provides the core primitives: LLM wrappers, document loaders, vector stores, prompt templates, and LCEL chains. LangGraph builds on top of LangChain to provide a graph-based framework for agent orchestration with explicit state management, conditional logic, and human-in-the-loop workflows. Use LangChain for straightforward retrieval chains and LangGraph when you need agents that can loop, branch, and make decisions.

How much does it cost to run a LangChain RAG application?

Costs depend on your LLM provider and usage volume. Using OpenAI as an example: embedding 1 million tokens with text-embedding-3-small costs $0.02. Generating responses with GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens. A typical RAG query with 4 retrieved documents and a 500-token response costs roughly $0.005-0.01. At 1,000 queries per day, expect $5-10 per day in API costs.

Can I use LangChain with models other than OpenAI?

Yes. LangChain supports dozens of LLM providers through provider-specific packages. Install langchain-anthropic for Claude, langchain-google-genai for Gemini, or langchain-community for open-source models via Ollama and HuggingFace. The standard chat model interface means you can swap providers by changing one import and one model name. For a comparison of the top models, see our Claude vs ChatGPT 2026 guide.

What is LCEL and how does it differ from the old chain syntax?

LangChain Expression Language (LCEL) is the composable pipe syntax introduced in LangChain v1.x. Instead of LLMChain(llm=llm, prompt=prompt), you write prompt | llm | parser. LCEL automatically supports streaming, async, batching, and LangSmith tracing without any boilerplate. The old classes like LLMChain, SequentialChain, and ConversationChain are deprecated in v1.x.

How do I keep my vector store up to date with new documents?

Use incremental indexing. Track which documents have been indexed (by file hash or last-modified date) and only process new or changed files. LangChain’s index function in langchain.indexes provides built-in deduplication that compares document hashes and only adds new content to the vector store. Run this as a scheduled job (cron or CI pipeline) to keep your knowledge base current.

Is LangChain production-ready?

Yes. LangChain v1.0 was specifically designed for production use with a stable API, semantic versioning, and backward compatibility guarantees. Companies including Elastic, Rakuten, and Replit have deployed LangChain in production. The framework has matured significantly since 2023, with security patches actively maintained (langchain-core has received multiple CVE fixes in 2025-2026). Pair it with LangSmith for observability and LangGraph for complex agent workflows.

What Python version do I need for LangChain v1.x?

LangChain v1.1.0 dropped Python 3.9 support. You need Python 3.10 or higher. Python 3.12 is recommended for the best performance and compatibility. If you are stuck on Python 3.9, you can use LangChain 0.3.x, but it will not receive new features.

👁 Sofia Lindström

Sofia Lindström

Editor-in-Chief

Sofia Lindström is the Editor-in-Chief at Tech Insider, where she leads editorial strategy and oversees coverage across AI, cybersecurity, and enterprise technology. With over a decade in Swedish tech journalism, she previously served as technology editor at Dagens Industri and covered the Nordic startup ecosystem for Breakit. Sofia holds an MSc in Media Technology from KTH Royal Institute of Technology and is a frequent speaker at Web Summit and Slush. She is passionate about making complex technology accessible to business leaders.

View all articles

URL: https://tech-insider.org/langchain-tutorial-rag-chatbot-python-2026/

⇱ LangChain Tutorial: Build a RAG Chatbot in 13 Steps [2026]