Rag Content Chunker

Pricing

from $0.50 / 1,000 results

Rag Content Chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

👁 mick_

mick_

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

🆕 New Feature: Bulk File Upload

Already have a document? Upload it directly to Apify Storage and point the chunker at it — no crawling needed.

How to upload a file:

Go to Apify Console → Storage → Key-Value Stores
Click + Create new store → give it a name
Click + Add record → upload your .txt or .md file
Find your file → click the 🔗 icon to copy the direct URL

Make sure the URL starts with api.apify.com — not console.apify.com. The console URL is the web page, not the file.
Paste the URL into the file_url field and run

Example input:

{
"file_url":"https://api.apify.com/v2/key-value-stores/YOUR_STORE_ID/records/document.txt",
"strategy":"markdown",
"chunk_size":512,
"chunk_overlap":64
}

Or skip storage entirely — paste text or Markdown directly:

{
"text":"# My Document\n\nThis is my content...",
"strategy":"markdown",
"chunk_size":512
}

Supported file formats: .txt, .md, .markdown, .html, .pdf Max file size: 5MB URL requirements: Must be a public HTTPS URL. Apify Storage, S3, Dropbox (shared public link), or GitHub raw URLs all work.

PDF note: Text-based PDFs are supported. Scanned/image-only PDFs have no text layer and will fail — convert them with OCR first. Office documents (.docx, .xlsx) are not yet supported.

Features

Three chunking strategies: recursive (general), markdown (header-aware), sentence (boundary-preserving)
Token-aware splitting using tiktoken cl100k_base (compatible with OpenAI embeddings and GPT-4)
Deterministic chunk IDs (SHA-256) for incremental vector DB updates
Two input modes: direct text or dataset chaining from any crawler
Dot-notation support for nested dataset fields (e.g., metadata.content)
Configurable chunk size (64-8192 tokens) and overlap (0-2048 tokens)
Input validation and sanitization (size limits, control char stripping, injection prevention)
No external API calls, no API keys required -- pure local computation

Requirements

Python 3.11+
Apify platform account (for running as Actor)

Install dependencies:

$pip install-r requirements.txt

Input

Choose one of three input modes. When multiple are provided, priority order is: dataset_id → file_url → text.

Mode 1: Direct Text or Markdown

Paste content directly into the text field. Supports plain text, Markdown, and code. Best for quick tests or single documents.

{
"text":"# My Document\n\nContent here...",
"strategy":"markdown",
"chunk_size":512,
"chunk_overlap":64
}

Mode 2: File URL (Bulk Upload)

Provide an HTTPS URL to a .txt, .md, or .html file. Best for documents already in cloud storage or Apify Key-Value Store.

{
"file_url":"https://api.apify.com/v2/key-value-stores/STORE_ID/records/document.md",
"strategy":"markdown",
"chunk_size":512,
"chunk_overlap":64
}

Mode 3: Dataset Chaining

Provide an Apify dataset ID from a previous actor run (e.g., Website Content Crawler). Chunks every item in the dataset. Best for bulk web content.

{
"dataset_id":"your-crawler-dataset-id",
"dataset_field":"markdown",
"strategy":"recursive",
"chunk_size":512,
"chunk_overlap":64
}

Configuration

Actor Inputs

Defined in .actor/INPUT_SCHEMA.json:

text (string, optional) -- plain text or Markdown to chunk, max 500,000 characters
file_url (string, optional) -- public HTTPS URL to a .txt, .md, .markdown, or .html file, max 5MB. Takes priority over text
dataset_id (string, optional) -- Apify dataset ID from a previous actor run (e.g., Website Content Crawler). Takes priority over file_url and text
dataset_field (string, optional) -- field to read from each dataset item. Default: "text". Supports dot notation
strategy (string, optional) -- "recursive" (default), "markdown", or "sentence"
chunk_size (integer, optional) -- target chunk size in tokens, 64-8192. Default: 512
chunk_overlap (integer, optional) -- overlapping tokens between chunks, 0-2048. Default: 64

At least one of text, file_url, or dataset_id must be provided.

Usage

Local (CLI)

$APIFY_TOKEN=your-token apify run

Direct Text Input

{
"text":"# Introduction\n\nThis is a sample document with multiple sections.\n\n## Section One\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n## Section Two\n\nUt enim ad minim veniam, quis nostrud exercitation.",
"strategy":"markdown",
"chunk_size":256,
"chunk_overlap":32
}

Dataset Chaining (from Website Content Crawler)

{
"dataset_id":"abc123XYZ",
"dataset_field":"text",
"strategy":"recursive",
"chunk_size":512,
"chunk_overlap":64
}

Example Output

Each chunk is a separate dataset item:

{
"chunk_id":"a1b2c3d4e5f67890",
"chunk_index":0,
"text":"# Introduction\n\nThis is a sample document with multiple sections.",
"token_count":12,
"source_url":"https://example.com/page",
"page_title":"Example Page",
"section_heading":"Introduction"
}

A summary item is appended at the end:

{
"_summary":true,
"total_chunks":3,
"total_tokens":847,
"strategy":"markdown",
"chunk_size":256,
"chunk_overlap":32,
"processing_time":0.142,
"billing":{
"total_chunks":3,
"amount":0.0015,
"rate_per_chunk":0.0005
}
}

Pipeline Position

This actor fills the chunking step in a standard RAG pipeline:

Crawl(Website Content Crawler, 101K+ users)
->Clean(optional preprocessing)
->Chunk(this actor)
->Embed(OpenAI, Cohere, etc.)
->Store(Pinecone, Qdrant, Weaviate integrations)

Chunking Strategy Guide

Strategy	Best For	How It Splits
`recursive`	General text, mixed content	Paragraphs -> sentences -> words -> hard token cuts
`markdown`	Documentation, crawled web pages	Markdown headers (h1-h6), preserves section structure
`sentence`	Conversational content, Q&A, prose	Sentence boundaries, preserves sentence integrity

Choosing chunk_size

256 tokens -- fine-grained retrieval, higher precision, more chunks
512 tokens (default) -- balanced for most RAG use cases
1024 tokens -- broader context per chunk, fewer chunks, good for summarization
2048+ tokens -- large context windows, best with newer embedding models

Architecture

src/agent/main.py -- Actor entry point, input handling, dataset chaining, output
src/agent/chunker.py -- Core chunking engine with three strategies and token counting
src/agent/validation.py -- Input validation, sanitization, and security checks
src/agent/pricing.py -- PPE billing calculator ($0.0005/chunk)
skill.md -- Machine-readable skill contract for agent discovery

Security

Size limits: 500K chars max text, 10K max dataset items, bounded chunk parameters
Sanitization: Strips null bytes and control characters (preserves newlines/tabs for Markdown)
Injection prevention: Dataset IDs and field names validated against strict regex patterns
No LLM calls: Pure text processing, zero prompt injection surface
No secrets: Actor requires no API keys or credentials
No network calls: All processing is local computation

Pricing

Pay-Per-Event (PPE): $0.0005 per chunk ($0.50 per 1,000 chunks).

Content Size	Approx. Chunks	Cost
Single blog post	10-20	$0.005-$0.01
10-page website	50-100	$0.025-$0.05
100-page docs site	500-1,000	$0.25-$0.50
Large knowledge base	5,000-10,000	$2.50-$5.00

Troubleshooting

"No input provided": Supply either text or dataset_id
"Text exceeds maximum length": Split content into batches under 500K chars, or use dataset mode
"Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 chars
"chunk_overlap must be less than chunk_size": Reduce overlap or increase chunk size
"No chunks produced": Input text may be empty or contain only whitespace/control characters
Dataset errors: Verify the dataset ID exists and the actor has access to it

License

See LICENSE file for details.

MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

Endpoint: https://mcp.apify.com?tools=labrat011/rag-content-chunker
Auth: Authorization: Bearer <APIFY_TOKEN>
Transport: Streamable HTTP
Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
"mcpServers":{
"rag-content-chunker":{
"url":"https://mcp.apify.com?tools=labrat011/rag-content-chunker",
"headers":{
"Authorization":"Bearer <APIFY_TOKEN>"
}
}
}
}

AI agents can use this actor to split text and documents into optimally-sized chunks for RAG pipelines, prepare content for embedding, and build retrieval-ready datasets -- all as a callable MCP tool.

👁 RAG-Ready Markdown Converter & Chunker avatar

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

👁 User avatar

Nguyễn Anh Duy

4.7

👁 Rag Embedding Generator avatar

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

👁 User avatar

mick_

👁 Docs Markdown Rag Ready Crawler avatar

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

👁 User avatar

Dev with Bobby

RAG Ingestor: Multi-Source Chunks for Vector DBs

aitoolbreakdown/atb-rag-ingestor

Ingest URLs, sitemaps, and GitHub READMEs into uniform chunks with titles, source URLs, and stable IDs. Ready to push straight into Pinecone, Weaviate, or any RAG pipeline.

👁 User avatar

AI Tool Breakdown

👁 RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases avatar

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

👁 User avatar

Artashes Arakelyan

Text Splitter & Chunker for RAG / LLMs

zenomastro/text-splitter-for-llm

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

👁 User avatar

Rosario Vitale

👁 Web Scraper RAG Ready avatar

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

👁 User avatar

Alexy Traore

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

👁 Docs To Rag avatar

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

👁 User avatar

Gabriel Antony Xaviour

👁 PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

👁 User avatar

Inus Grobler

URL: https://apify.com/labrat011/rag-content-chunker