VOOZH about

URL: https://apify.com/zenomastro/text-splitter-for-llm

⇱ Text Chunker for RAG, Embeddings & LLMs Β· Apify


πŸ‘ Text Splitter & Chunker for RAG / LLMs avatar

Text Splitter & Chunker for RAG / LLMs

Pricing

from $5.00 / 1,000 text chunkeds

Go to Apify Store

Text Splitter & Chunker for RAG / LLMs

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

Pricing

from $5.00 / 1,000 text chunkeds

Rating

0.0

(0)

Developer

πŸ‘ Rosario Vitale

Rosario Vitale

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

7 days ago

Last modified

Share

Split any text into clean, overlapping chunks that are ready for embeddings, vector databases, RAG pipelines and LLM context windows β€” without writing your own splitter.

Paste text (or send many documents), pick a chunk size and overlap, and get back tidy chunks with character counts and approximate token counts as JSON or CSV.

Why

Every RAG / LLM pipeline needs chunking, and everyone re-implements the same fiddly logic: respect paragraph and sentence boundaries, keep an overlap so context isn't lost, normalize messy whitespace, and estimate tokens. This Actor does it for you, reliably, in one call.

Features

  • βœ‚οΈ Smart chunking β€” packs text up to your target size while respecting paragraph/sentence boundaries.
  • πŸ” Overlap β€” keeps a configurable overlap so ideas spanning a boundary aren't lost.
  • πŸ”’ Characters or tokens β€” size and overlap in characters or approximate tokens (~4 chars/token).
  • 🧹 Cleaning β€” normalizes whitespace and collapses excessive blank lines.
  • πŸ“¦ Batch β€” split many documents in a single run.
  • πŸ“Š Token estimate β€” every chunk includes charCount and approxTokens.

Input

FieldTypeDescription
textstringA single document to split.
textsarrayMultiple documents (one per item).
chunkSizeintegerTarget chunk size. Default 1000.
chunkOverlapintegerOverlap between chunks. Default 100.
unitselectcharacters or tokens. Default characters.
splitByselectparagraph, sentence or character. Default paragraph.
cleanbooleanNormalize whitespace. Default true.

Example input

{
"text":"Your long document text goes here...",
"chunkSize":1000,
"chunkOverlap":100,
"unit":"characters",
"splitBy":"paragraph",
"clean":true
}

Output

One dataset item per chunk:

{
"sourceIndex":0,
"chunkIndex":0,
"totalChunks":3,
"text":"Retrieval-Augmented Generation (RAG) combines a language model ...",
"charCount":312,
"approxTokens":78
}

Export as JSON, CSV, or Excel, or pull via the Apify API β€” then send the chunks straight to your embeddings model or vector DB.

Common use cases

  • Prepare documents for embeddings + vector search (Pinecone, Qdrant, Weaviate, pgvector).
  • Build RAG context for ChatGPT/Claude apps.
  • Fit long content into LLM context windows.
  • Pairs perfectly with PDF to Structured Data β€” extract text from PDFs, then chunk it here.

Notes

  • Token counts are an estimate (~4 characters per token); exact tokenization depends on the model.
  • For character split mode the text is hard-cut at the size boundary; paragraph/sentence respect natural boundaries.

You might also like

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

πŸ‘ User avatar

Artashes Arakelyan

7

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summaryβ€”ready for embeddings or vector DBs without extra glue code.

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Website to Text & Markdown β€” AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

3

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

πŸ‘ User avatar

Gabriel Antony Xaviour

9