LangChain Document Loaders

Last Updated : 6 Nov, 2025

LangChain Document Loaders convert data from various formats such as CSV, PDF, HTML and JSON into standardized Document objects. These objects contain the raw content, metadata and optional identifiers, allowing LLMs to process and analyze the data efficiently.

Document loaders also enable developers to manage and standardise content across multiple workflows, supporting a wide range of file types and sources including YouTube, Wikipedia and GitHub.

Document Object in LangChain

Before exploring loaders, we must understand the Document object which stores the content and metadata.

page_content: Stores the textual content of the document.
metadata: Provides additional information about the document like source or category.
id: Uniquely identifies the document object

Output:

👁 Screenshot-2025-11-06-111125

Output

This structure allows loaders to consistently format data regardless of its original source.

Types of Document Loaders

LangChain provides over 200 document loaders, categorized as follows:

By File Type: CSV, PDF, HTML, Markdown, MS Office documents, JSON.
By Data Source: YouTube, Wikipedia, GitHub.

Data sources can also be public i.e no authentication needed or private that requires credentials like AWS or Azure.

1. CSV (Comma-Separated Values)

CSVLoader loads each row of a CSV file as a separate Document object.

file_path: Path to the CSV file.
metadata_columns: Columns to include in the metadata of each Document.
csv_args: Arguments for CSV parsing (e.g., delimiter).
source_column (optional): Can replace the file name as the source identifier.

Output:

👁 Screenshot-2025-11-06-111236

Output

2. HTML (HyperText Markup Language)

HTML pages can be loaded either from a saved file or directly from a URL.

urls: List of URLs to load.
mode='single': loads the entire HTML as one Document.
mode='elements': splits the page into multiple documents based on HTML tags.
Metadata includes file type, URL and parent element identifiers.

Output:

👁 Screenshot-2025-11-06-111413

Output

3. Markdown

MarkdownLoader loads Markdown files, optionally splitting by elements or pages.

file_path: Path to the Markdown file.
mode: 'single', 'elements' or 'paged' for splitting.
Metadata includes last modified date, file type and section category.

Output:

👁 Screenshot-2025-11-06-111548

Output

4. JSON

JSONLoader parses JSON files into Document objects.

file_path: Path to the JSON file.
jq_schema: JQ query to select content. '.' loads all content.
text_content: Whether to convert JSON fields to text.

Output:

👁 Screenshot-2025-11-06-111657

Output

5. MS Office Documents

MS Word documents can be loaded using the Docx2txtLoader

file_path: Path to the Word document.

Output:

👁 Screenshot-2025-11-06-111834

Output

6. PDF (Portable Document Format)

PDF files can be loaded with multiple parsers.

file_path: Path to the PDF.
mode: 'single' or 'elements'.
strategy: 'hi_res', 'ocr_only', 'fast' or 'auto' for parsing method.

Output:

👁 Screenshot-2025-11-06-111949

Output

7. Loading Multiple Files

We can load all files in a directory using DirectoryLoader.

glob: File pattern to match.
loader_cls: Loader class for matched files.
loader_kwargs: Arguments for the loader class.
show_progress: Show loading progress.
use_multithreading: Enable multithreaded loading.

Output:

👁 Screenshot-2025-11-06-112141

Output

8. Wikipedia

WikipediaLoader fetches articles based on search queries.

query: Search term for Wikipedia articles.
load_max_docs: Maximum number of articles to load.
doc_content_chars_max: Maximum characters per article content.
load_all_available_meta: Include metadata like categories, references and image URLs.

Output:

👁 Screenshot-2025-11-06-112304

Output

With these methods we can easily load and process different types of documents in LangChain which can then be used for tasks like text analysis, question answering, summarization and building intelligent retrieval-based applications.

Comment

Article Tags:

Artificial Intelligence

NLP

Data Science

Explore

Introduction to AI

AI Concepts

Machine Learning in AI

Robotics and AI

Generative AI

AI Practice

Courses

URL: https://www.geeksforgeeks.org/artificial-intelligence/langchain-document-loaders/