👁 RAG Text Chunker — heading & sentence aware, Japanese ready avatar

RAG Text Chunker — heading & sentence aware, Japanese ready

Pricing

Pay per usage

👁 RAG Text Chunker — heading & sentence aware, Japanese ready

RAG Text Chunker — heading & sentence aware, Japanese ready

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Shinobu Otani

Shinobu Otani

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

RAG Text Chunker

Split Markdown or plain text into retrieval-ready chunks. Heading-aware, sentence-aware, Japanese-ready — deterministic, no LLM cost.

Cuts at headings first: chunks never mix sections; fenced code blocks are not mistaken for headings
Packs whole sentences up to max_chars; oversized sentences are hard-split as a last resort
Optional overlap between consecutive chunks for retrieval continuity
Japanese-aware boundaries: 。！？ with closing-quote handling alongside Latin .!? (decimals like 3.14 stay intact)
Heading breadcrumbs: every chunk carries heading_path for citation

Input

{"documents":["# 概要\n\n検証は三段階で行う。まず再現する。"],"max_chars":1500,"overlap":200}

Output (one dataset item per chunk)

{"id":0,"document_index":0,"heading_path":["概要"],"text":"検証は三段階で行う。 まず再現する。","char_count":19}

Typical uses: chunking docs/knowledge bases before embedding; Japanese or mixed-language corpora for vector search; reproducible chunk boundaries.

Text Splitter & Chunker for RAG / LLMs

zenomastro/text-splitter-for-llm

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

👁 User avatar

Rosario Vitale

👁 RAG-Ready Markdown Converter & Chunker avatar

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

👁 User avatar

Nguyễn Anh Duy

4.7

👁 Web-to-Markdown Generator for AI & RAG Pipelines avatar

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

👁 User avatar

Manas Mantri

👁 Docs-to-RAG Crawler avatar

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

👁 User avatar

Stas Persiianenko

👁 AI / RAG Web Crawler avatar

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

👁 User avatar

Group Oject

👁 Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

👁 User avatar

Logiover

👁 Markdown RAG Chunker avatar

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

👁 User avatar

CodePoetry

PDF → RAG Chunks (Token-Aware, Vector-Ready)

gochujang/pdf-rag-chunker

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.

👁 User avatar

Hojun Lee

👁 Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks avatar

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

👁 User avatar

Ken M

👁 Rag Content Chunker avatar

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

👁 User avatar

mick_

URL: https://apify.com/shoebill-dev27/rag-text-chunker