VOOZH about

URL: https://apify.com/labrat011/rag-content-chunker

⇱ Rag Content Chunker Β· Apify


Pricing

from $0.50 / 1,000 results

Go to Apify Store

Rag Content Chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summaryβ€”ready for embeddings or vector DBs without extra glue code.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ mick_

mick_

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

2

Monthly active users

3 months ago

Last modified

Share

Apify Actor that splits text and Markdown into optimally-sized, token-counted chunks for RAG pipelines. Supports recursive, Markdown-aware, and sentence-based chunking strategies. Outputs flat chunk objects with deterministic IDs, ready for embedding and vector DB ingestion. MCP-ready for AI agent integration.


πŸ†• New Feature: Bulk File Upload

Already have a document? Upload it directly to Apify Storage and point the chunker at it β€” no crawling needed.

How to upload a file:

  1. Go to Apify Console β†’ Storage β†’ Key-Value Stores
  2. Click + Create new store β†’ give it a name
  3. Click + Add record β†’ upload your .txt or .md file
  4. Find your file β†’ click the πŸ”— icon to copy the direct URL

    Make sure the URL starts with api.apify.com β€” not console.apify.com. The console URL is the web page, not the file.

  5. Paste the URL into the file_url field and run

Example input:

{
"file_url":"https://api.apify.com/v2/key-value-stores/YOUR_STORE_ID/records/document.txt",
"strategy":"markdown",
"chunk_size":512,
"chunk_overlap":64
}

Or skip storage entirely β€” paste text or Markdown directly:

{
"text":"# My Document\n\nThis is my content...",
"strategy":"markdown",
"chunk_size":512
}

Supported file formats: .txt, .md, .markdown, .html, .pdf Max file size: 5MB URL requirements: Must be a public HTTPS URL. Apify Storage, S3, Dropbox (shared public link), or GitHub raw URLs all work.

PDF note: Text-based PDFs are supported. Scanned/image-only PDFs have no text layer and will fail β€” convert them with OCR first. Office documents (.docx, .xlsx) are not yet supported.


Features

  • Three chunking strategies: recursive (general), markdown (header-aware), sentence (boundary-preserving)
  • Token-aware splitting using tiktoken cl100k_base (compatible with OpenAI embeddings and GPT-4)
  • Deterministic chunk IDs (SHA-256) for incremental vector DB updates
  • Two input modes: direct text or dataset chaining from any crawler
  • Dot-notation support for nested dataset fields (e.g., metadata.content)
  • Configurable chunk size (64-8192 tokens) and overlap (0-2048 tokens)
  • Input validation and sanitization (size limits, control char stripping, injection prevention)
  • No external API calls, no API keys required -- pure local computation

Requirements

  • Python 3.11+
  • Apify platform account (for running as Actor)

Install dependencies:

$pip install-r requirements.txt

Input

Choose one of three input modes. When multiple are provided, priority order is: dataset_id β†’ file_url β†’ text.

Mode 1: Direct Text or Markdown

Paste content directly into the text field. Supports plain text, Markdown, and code. Best for quick tests or single documents.

{
"text":"# My Document\n\nContent here...",
"strategy":"markdown",
"chunk_size":512,
"chunk_overlap":64
}

Mode 2: File URL (Bulk Upload)

Provide an HTTPS URL to a .txt, .md, or .html file. Best for documents already in cloud storage or Apify Key-Value Store.

{
"file_url":"https://api.apify.com/v2/key-value-stores/STORE_ID/records/document.md",
"strategy":"markdown",
"chunk_size":512,
"chunk_overlap":64
}

Mode 3: Dataset Chaining

Provide an Apify dataset ID from a previous actor run (e.g., Website Content Crawler). Chunks every item in the dataset. Best for bulk web content.

{
"dataset_id":"your-crawler-dataset-id",
"dataset_field":"markdown",
"strategy":"recursive",
"chunk_size":512,
"chunk_overlap":64
}

Configuration

Actor Inputs

Defined in .actor/INPUT_SCHEMA.json:

  • text (string, optional) -- plain text or Markdown to chunk, max 500,000 characters
  • file_url (string, optional) -- public HTTPS URL to a .txt, .md, .markdown, or .html file, max 5MB. Takes priority over text
  • dataset_id (string, optional) -- Apify dataset ID from a previous actor run (e.g., Website Content Crawler). Takes priority over file_url and text
  • dataset_field (string, optional) -- field to read from each dataset item. Default: "text". Supports dot notation
  • strategy (string, optional) -- "recursive" (default), "markdown", or "sentence"
  • chunk_size (integer, optional) -- target chunk size in tokens, 64-8192. Default: 512
  • chunk_overlap (integer, optional) -- overlapping tokens between chunks, 0-2048. Default: 64

At least one of text, file_url, or dataset_id must be provided.

Usage

Local (CLI)

$APIFY_TOKEN=your-token apify run

Direct Text Input

{
"text":"# Introduction\n\nThis is a sample document with multiple sections.\n\n## Section One\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n## Section Two\n\nUt enim ad minim veniam, quis nostrud exercitation.",
"strategy":"markdown",
"chunk_size":256,
"chunk_overlap":32
}

Dataset Chaining (from Website Content Crawler)

{
"dataset_id":"abc123XYZ",
"dataset_field":"text",
"strategy":"recursive",
"chunk_size":512,
"chunk_overlap":64
}

Example Output

Each chunk is a separate dataset item:

{
"chunk_id":"a1b2c3d4e5f67890",
"chunk_index":0,
"text":"# Introduction\n\nThis is a sample document with multiple sections.",
"token_count":12,
"source_url":"https://example.com/page",
"page_title":"Example Page",
"section_heading":"Introduction"
}

A summary item is appended at the end:

{
"_summary":true,
"total_chunks":3,
"total_tokens":847,
"strategy":"markdown",
"chunk_size":256,
"chunk_overlap":32,
"processing_time":0.142,
"billing":{
"total_chunks":3,
"amount":0.0015,
"rate_per_chunk":0.0005
}
}

Pipeline Position

This actor fills the chunking step in a standard RAG pipeline:

Crawl(Website Content Crawler, 101K+ users)
->Clean(optional preprocessing)
->Chunk(this actor)
->Embed(OpenAI, Cohere, etc.)
->Store(Pinecone, Qdrant, Weaviate integrations)

Chunking Strategy Guide

StrategyBest ForHow It Splits
recursiveGeneral text, mixed contentParagraphs -> sentences -> words -> hard token cuts
markdownDocumentation, crawled web pagesMarkdown headers (h1-h6), preserves section structure
sentenceConversational content, Q&A, proseSentence boundaries, preserves sentence integrity

Choosing chunk_size

  • 256 tokens -- fine-grained retrieval, higher precision, more chunks
  • 512 tokens (default) -- balanced for most RAG use cases
  • 1024 tokens -- broader context per chunk, fewer chunks, good for summarization
  • 2048+ tokens -- large context windows, best with newer embedding models

Architecture

  • src/agent/main.py -- Actor entry point, input handling, dataset chaining, output
  • src/agent/chunker.py -- Core chunking engine with three strategies and token counting
  • src/agent/validation.py -- Input validation, sanitization, and security checks
  • src/agent/pricing.py -- PPE billing calculator ($0.0005/chunk)
  • skill.md -- Machine-readable skill contract for agent discovery

Security

  • Size limits: 500K chars max text, 10K max dataset items, bounded chunk parameters
  • Sanitization: Strips null bytes and control characters (preserves newlines/tabs for Markdown)
  • Injection prevention: Dataset IDs and field names validated against strict regex patterns
  • No LLM calls: Pure text processing, zero prompt injection surface
  • No secrets: Actor requires no API keys or credentials
  • No network calls: All processing is local computation

Pricing

Pay-Per-Event (PPE): $0.0005 per chunk ($0.50 per 1,000 chunks).

Content SizeApprox. ChunksCost
Single blog post10-20$0.005-$0.01
10-page website50-100$0.025-$0.05
100-page docs site500-1,000$0.25-$0.50
Large knowledge base5,000-10,000$2.50-$5.00

Troubleshooting

  • "No input provided": Supply either text or dataset_id
  • "Text exceeds maximum length": Split content into batches under 500K chars, or use dataset mode
  • "Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 chars
  • "chunk_overlap must be less than chunk_size": Reduce overlap or increase chunk size
  • "No chunks produced": Input text may be empty or contain only whitespace/control characters
  • Dataset errors: Verify the dataset ID exists and the actor has access to it

License

See LICENSE file for details.


MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

  • Endpoint: https://mcp.apify.com?tools=labrat011/rag-content-chunker
  • Auth: Authorization: Bearer <APIFY_TOKEN>
  • Transport: Streamable HTTP
  • Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
"mcpServers":{
"rag-content-chunker":{
"url":"https://mcp.apify.com?tools=labrat011/rag-content-chunker",
"headers":{
"Authorization":"Bearer <APIFY_TOKEN>"
}
}
}
}

AI agents can use this actor to split text and documents into optimally-sized chunks for RAG pipelines, prepare content for embedding, and build retrieval-ready datasets -- all as a callable MCP tool.

You might also like

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

πŸ‘ User avatar

Artashes Arakelyan

7

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

2

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

πŸ‘ User avatar

Gabriel Antony Xaviour

9

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.