VOOZH about

URL: https://apify.com/shoebill-dev27/rag-text-chunker

โ‡ฑ RAG Text Chunker โ€” heading & sentence aware, Japanese ready ยท Apify


๐Ÿ‘ RAG Text Chunker โ€” heading & sentence aware, Japanese ready avatar

RAG Text Chunker โ€” heading & sentence aware, Japanese ready

Pricing

Pay per usage

Go to Apify Store

RAG Text Chunker โ€” heading & sentence aware, Japanese ready

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

๐Ÿ‘ Shinobu Otani

Shinobu Otani

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Categories

Share

RAG Text Chunker

Split Markdown or plain text into retrieval-ready chunks. Heading-aware, sentence-aware, Japanese-ready โ€” deterministic, no LLM cost.

  • Cuts at headings first: chunks never mix sections; fenced code blocks are not mistaken for headings
  • Packs whole sentences up to max_chars; oversized sentences are hard-split as a last resort
  • Optional overlap between consecutive chunks for retrieval continuity
  • Japanese-aware boundaries: ใ€‚๏ผ๏ผŸ with closing-quote handling alongside Latin .!? (decimals like 3.14 stay intact)
  • Heading breadcrumbs: every chunk carries heading_path for citation

Input

{"documents":["# ๆฆ‚่ฆ\n\nๆคœ่จผใฏไธ‰ๆฎต้šŽใง่กŒใ†ใ€‚ใพใšๅ†็พใ™ใ‚‹ใ€‚"],"max_chars":1500,"overlap":200}

Output (one dataset item per chunk)

{"id":0,"document_index":0,"heading_path":["ๆฆ‚่ฆ"],"text":"ๆคœ่จผใฏไธ‰ๆฎต้šŽใง่กŒใ†ใ€‚ ใพใšๅ†็พใ™ใ‚‹ใ€‚","char_count":19}

Typical uses: chunking docs/knowledge bases before embedding; Japanese or mixed-language corpora for vector search; reproducible chunk boundaries.

You might also like

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

๐Ÿ‘ User avatar

Nguyแป…n Anh Duy

3

4.7

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

๐Ÿ‘ User avatar

Stas Persiianenko

7

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG โ€” PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summaryโ€”ready for embeddings or vector DBs without extra glue code.