👁 Japanese Text Normalizer — NFKC, kana, whitespace, sentences avatar

Japanese Text Normalizer — NFKC, kana, whitespace, sentences

Pricing

Pay per usage

👁 Japanese Text Normalizer — NFKC, kana, whitespace, sentences

Japanese Text Normalizer — NFKC, kana, whitespace, sentences

Normalize Japanese text for data pipelines: Unicode NFKC (full/half-width unification), wave-dash unification, whitespace cleanup, hiragana/katakana conversion, Japanese-aware sentence splitting, and per-script character stats.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

👁 Shinobu Otani

Shinobu Otani

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

Japanese Text Normalizer

Clean and normalize Japanese text for search indexes, datasets, and LLM pipelines — deterministic, instant, no LLM cost.

What it does

Unicode NFKC: full-width alphanumerics → ASCII (Ｃｌａｕｄｅ → Claude), half-width katakana → full-width (ｶﾞｲﾄﾞ → ガイド)
Wave-dash unification: ～ (U+FF5E) → 〜 (U+301C), without touching real ASCII tildes in paths/URLs
Whitespace cleanup: collapses space runs (including ideographic spaces), trims line ends, collapses 3+ blank lines, normalizes CRLF
Kana conversion: hiragana ↔ katakana (optional)
Sentence segmentation: Japanese-aware (。！？ with closing-quote handling) plus Latin punctuation
Character statistics: per-script counts (hiragana / katakana / kanji / ASCII / digits) before and after

Input

{
"texts":["Ｃｌａｕｄｅ　Ｃｏｄｅで開発する。「すごい」と思った。"],
"kana":"none",
"split_sentences":true
}

Output (one dataset item per text)

{
"text":"Claude Codeで開発する。「すごい」と思った。",
"changed":true,
"sentences":["Claude Codeで開発する。","「すごい」と思った。"],
"sentence_count":2,
"stats_before":{"hiragana":8,"katakana":0,"kanji":4,"...":"..."},
"stats_after":{"...":"..."}
}

Typical uses

Preprocessing scraped Japanese text before indexing or embedding
Unifying mixed full-width/half-width product data
Sentence-level dataset construction from raw Japanese prose

RAG Text Chunker — heading & sentence aware, Japanese ready

shoebill-dev27/rag-text-chunker

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

👁 User avatar

Shinobu Otani

👁 Japanese Text Summarizer (Groq AI) avatar

Japanese Text Summarizer (Groq AI)

acia/japanese-text-summarizer

Summarizes Japanese text using Groq AI (ultra-fast). Perfect for news articles, blog posts, and product descriptions. Supports batch processing.

👁 User avatar

Acia

Data Cleaner & Normalizer (JSON/CSV)

zenomastro/data-cleaner-normalizer

Clean and normalize JSON/CSV data: trim whitespace, lowercase emails, normalize phone numbers and dates, drop empty values/rows, and deduplicate by a field.

👁 User avatar

Rosario Vitale

Entity Extractor — emails, URLs, phones, dates (regex, no LLM)

shoebill-dev27/entity-extractor

Extract structured entities from free text: email addresses, URLs, phone numbers (incl. Japanese formats and full-width digits), dates (ISO, slash, Japanese 年月日) and IP addresses. Deterministic regex extraction with per-kind counts — fast, cheap, no LLM.

👁 User avatar

Shinobu Otani

👁 Google Maps Japan Scraper — Email + Business Leads avatar

Google Maps Japan Scraper — Email + Business Leads

totaka/google-maps-japan-scraper

Extract Japanese business leads from Google Maps — name, address, phone, email, website, rating and GPS. Emails auto-extracted from websites. Works in English and Japanese. $0.001/result.

👁 User avatar

Thomas Gharbi

Japanese Name Generator

conduit/japanese-name-generator

Generate authentic Japanese names with cultural context, meanings, and proper linguistic formatting. Perfect for creative projects, research, and educational purposes.

👁 User avatar

Conduit

👁 Unicode Text Inspector avatar

Unicode Text Inspector

automation-lab/unicode-text-inspector

Scan text for hidden Unicode characters: zero-width spaces, RTL override attacks, homoglyphs, and control characters. Get risk level + full codepoint details per character.

👁 User avatar

Stas Persiianenko

Japan Contact Scraper

kyo_kou/japan-contact-scraper

Extract emails, Japanese phone numbers (03-, 090-, 0120- formats), and social media links from Japanese company websites. Optimized regex patterns ensure high accuracy with minimal false positives.

👁 User avatar

kyo kou

5.0

👁 Japanese Web Scraper - Yahoo News, Rakuten, Suumo, Tabelog avatar

Japanese Web Scraper - Yahoo News, Rakuten, Suumo, Tabelog

project_bbb/japanese-web-scraper

Scrape major Japanese websites: Yahoo! Japan News, Rakuten, Suumo, Tabelog. Full Shift_JIS/EUC-JP encoding support, cookie wall bypass, and JP pagination handling. Structured JSON output with optional romaji transliteration for non-Japanese data consumers.

👁 User avatar

BBB & Company

👁 Gurunavi Scraper - Japanese Restaurant Reviews & Listings avatar

Gurunavi Scraper - Japanese Restaurant Reviews & Listings

huggable_quote/gurunavi-scraper

Scrape restaurant data from Rakuten Gurunavi, Japan's top dining guide. Extract menus, prices, reviews, ratings, hours, and location info. Ideal for Japanese F&B market research and competitor analysis.

👁 User avatar

OrbitData Labs

URL: https://apify.com/shoebill-dev27/jp-text-normalizer