π Japanese Text Normalizer β NFKC, kana, whitespace, sentences avatar
Japanese Text Normalizer β NFKC, kana, whitespace, sentences
Pricing
Pay per usage
Go to Apify Store
Japanese Text Normalizer β NFKC, kana, whitespace, sentences
Normalize Japanese text for data pipelines: Unicode NFKC (full/half-width unification), wave-dash unification, whitespace cleanup, hiragana/katakana conversion, Japanese-aware sentence splitting, and per-script character stats.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
Japanese Text Normalizer
Clean and normalize Japanese text for search indexes, datasets, and LLM pipelines β deterministic, instant, no LLM cost.
What it does
- Unicode NFKC: full-width alphanumerics β ASCII (
οΌ£ο½ο½ο½ο½ο½βClaude), half-width katakana β full-width (ο½ΆοΎο½²οΎοΎβγ¬γ€γ) - Wave-dash unification:
ο½(U+FF5E) βγ(U+301C), without touching real ASCII tildes in paths/URLs - Whitespace cleanup: collapses space runs (including ideographic spaces), trims line ends, collapses 3+ blank lines, normalizes CRLF
- Kana conversion: hiragana β katakana (optional)
- Sentence segmentation: Japanese-aware (
γοΌοΌwith closing-quote handling) plus Latin punctuation - Character statistics: per-script counts (hiragana / katakana / kanji / ASCII / digits) before and after
Input
{"texts":["οΌ£ο½ο½ο½ο½ο½ γοΌ£ο½ο½ο½ γ§ιηΊγγγγγγγγγ¨ζγ£γγ"],"kana":"none","split_sentences":true}
Output (one dataset item per text)
{"text":"Claude Codeγ§ιηΊγγγγγγγγγ¨ζγ£γγ","changed":true,"sentences":["Claude Codeγ§ιηΊγγγ","γγγγγγ¨ζγ£γγ"],"sentence_count":2,"stats_before":{"hiragana":8,"katakana":0,"kanji":4,"...":"..."},"stats_after":{"...":"..."}}
Typical uses
- Preprocessing scraped Japanese text before indexing or embedding
- Unifying mixed full-width/half-width product data
- Sentence-level dataset construction from raw Japanese prose
