RAG Text Chunker โ heading & sentence aware, Japanese ready
Pricing
Pay per usage
RAG Text Chunker โ heading & sentence aware, Japanese ready
Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
RAG Text Chunker
Split Markdown or plain text into retrieval-ready chunks. Heading-aware, sentence-aware, Japanese-ready โ deterministic, no LLM cost.
- Cuts at headings first: chunks never mix sections; fenced code blocks are not mistaken for headings
- Packs whole sentences up to
max_chars; oversized sentences are hard-split as a last resort - Optional overlap between consecutive chunks for retrieval continuity
- Japanese-aware boundaries: ใ๏ผ๏ผ with closing-quote handling alongside
Latin
.!?(decimals like3.14stay intact) - Heading breadcrumbs: every chunk carries
heading_pathfor citation
Input
{"documents":["# ๆฆ่ฆ\n\nๆค่จผใฏไธๆฎต้ใง่กใใใพใๅ็พใใใ"],"max_chars":1500,"overlap":200}
Output (one dataset item per chunk)
{"id":0,"document_index":0,"heading_path":["ๆฆ่ฆ"],"text":"ๆค่จผใฏไธๆฎต้ใง่กใใ ใพใๅ็พใใใ","char_count":19}
Typical uses: chunking docs/knowledge bases before embedding; Japanese or mixed-language corpora for vector search; reproducible chunk boundaries.
