👁 Text Splitter & Chunker for RAG / LLMs avatar

Text Splitter & Chunker for RAG / LLMs

Pricing

from $5.00 / 1,000 text chunkeds

👁 Text Splitter & Chunker for RAG / LLMs

Text Splitter & Chunker for RAG / LLMs

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

Pricing

from $5.00 / 1,000 text chunkeds

Rating

0.0

(0)

Developer

👁 Rosario Vitale

Rosario Vitale

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

7 days ago

Last modified

Why

Every RAG / LLM pipeline needs chunking, and everyone re-implements the same fiddly logic: respect paragraph and sentence boundaries, keep an overlap so context isn't lost, normalize messy whitespace, and estimate tokens. This Actor does it for you, reliably, in one call.

Features

✂️ Smart chunking — packs text up to your target size while respecting paragraph/sentence boundaries.
🔁 Overlap — keeps a configurable overlap so ideas spanning a boundary aren't lost.
🔢 Characters or tokens — size and overlap in characters or approximate tokens (~4 chars/token).
🧹 Cleaning — normalizes whitespace and collapses excessive blank lines.
📦 Batch — split many documents in a single run.
📊 Token estimate — every chunk includes charCount and approxTokens.

Input

Field	Type	Description
`text`	string	A single document to split.
`texts`	array	Multiple documents (one per item).
`chunkSize`	integer	Target chunk size. Default `1000`.
`chunkOverlap`	integer	Overlap between chunks. Default `100`.
`unit`	select	`characters` or `tokens`. Default `characters`.
`splitBy`	select	`paragraph`, `sentence` or `character`. Default `paragraph`.
`clean`	boolean	Normalize whitespace. Default `true`.

Example input

{
"text":"Your long document text goes here...",
"chunkSize":1000,
"chunkOverlap":100,
"unit":"characters",
"splitBy":"paragraph",
"clean":true
}

Output

One dataset item per chunk:

{
"sourceIndex":0,
"chunkIndex":0,
"totalChunks":3,
"text":"Retrieval-Augmented Generation (RAG) combines a language model ...",
"charCount":312,
"approxTokens":78
}

Export as JSON, CSV, or Excel, or pull via the Apify API — then send the chunks straight to your embeddings model or vector DB.

Common use cases

Prepare documents for embeddings + vector search (Pinecone, Qdrant, Weaviate, pgvector).
Build RAG context for ChatGPT/Claude apps.
Fit long content into LLM context windows.
Pairs perfectly with PDF to Structured Data — extract text from PDFs, then chunk it here.

Notes

Token counts are an estimate (~4 characters per token); exact tokenization depends on the model.
For character split mode the text is hard-cut at the size boundary; paragraph/sentence respect natural boundaries.

👁 RAG-Ready Markdown Converter & Chunker avatar

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

👁 User avatar

Nguyễn Anh Duy

4.7

👁 RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases avatar

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

👁 User avatar

Artashes Arakelyan

👁 Rag Content Chunker avatar

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

👁 User avatar

mick_

RAG Text Chunker — heading & sentence aware, Japanese ready

shoebill-dev27/rag-text-chunker

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

👁 User avatar

Shinobu Otani

👁 Rag Embedding Generator avatar

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

👁 User avatar

mick_

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

👁 User avatar

Andok

👁 Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

👁 User avatar

Hitman studio

AI Context Fetcher: Clean Text for RAG

sarvesh_bijawe/ai-context-fetcher-clean-text-for-rag

Instantly extracts clean, ad-free text from any URL. Designed for AI Agents, RAG pipelines, and LLM context windows.

👁 User avatar

Sarvesh Bijawe

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

👁 User avatar

Juan Triviño

👁 Docs To Rag avatar

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

👁 User avatar

Gabriel Antony Xaviour

URL: https://apify.com/zenomastro/text-splitter-for-llm

⇱ Text Chunker for RAG, Embeddings & LLMs · Apify

Text Splitter & Chunker for RAG / LLMs

Why

Features

Input

Example input

Output

Common use cases

Notes

You might also like

RAG-Ready Markdown Converter & Chunker

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

Rag Content Chunker

RAG Text Chunker — heading & sentence aware, Japanese ready

Rag Embedding Generator

PDF to Text API | Document Extraction for LLMs & RAG

Website to Text & Markdown — AI / RAG Content Crawler

AI Context Fetcher: Clean Text for RAG

Website to Markdown for LLM and RAG

Docs To Rag