Library notes Published: 2026-06-07

How kuromoji.js powers the Japanese tools on NoSend Tools

The Japanese tools on NoSend Tools — kanji-to-hiragana, kanji-to-romaji, wakati-tokenize, furigana-html, japanese-counter-word and more — all run morphological analysis in the browser using kuromoji.js. Here is what the 12 MB IPADIC dictionary contains, how Viterbi tokenisation works, and what it means to run Japanese NLP without sending text to a cloud API.

Why Japanese text processing needs morphological analysis

Written Japanese has no whitespace between words. The Unicode standard UAX #29 defines a default word segmentation algorithm, but it handles Latin-script languages far better than Japanese, where a single continuous string of kanji and kana must be split into meaningful units before anything useful can be done with it. Converting "日本語処理" to "にほんごしょり" requires first knowing that "日本語" and "処理" are two separate words — and that "日本語" is read "にほんご", not "にほんかた" or any other plausible combination.

Morphological analysis is the discipline of segmenting text into its minimal meaningful units (morphemes) and annotating each one with part-of-speech, reading, lemma, and inflection type. Every Japanese tool on this site depends on it: kanji-to-hiragana and kanji-to-romaji need the reading field per morpheme, wakati-tokenize needs the segment boundaries, furigana-html needs both, and japanese-counter-word needs to identify the numeral and the noun that follows it to look up the correct counter suffix.

kuromoji.js and the IPADIC dictionary

kuromoji.js, written by takuyaa, is a JavaScript port of the Java-based Kuromoji morphological analyser originally developed by Atilika. Its core algorithm uses a lattice structure and Viterbi decoding: the input string is expanded into a lattice of all possible tokenisations, and dynamic programming finds the path through the lattice with the minimum cumulative transition cost. This is how "東京都に住む" is correctly split as "東京都 / に / 住む" rather than "東京 / 都 / に / 住む".

The dictionary bundled with kuromoji.js is a compressed binary build of IPADIC (IPA Dictionary), which packs over 300,000 lexical entries and downloads to approximately 12 MB in the browser. Each entry stores surface form, reading, pronunciation, part-of-speech, sub-classification, conjugation type, conjugation form, and base form. UniDic, the alternative often mentioned in NLP comparisons, is maintained by the National Institute for Japanese Language and Linguistics and differs in vocabulary granularity, morpheme definition boundaries, and licence; the standard kuromoji.js build ships IPADIC. The kanji-to-romaji tool adds kuroshiro on top of kuromoji.js: kuroshiro uses kuromoji.js as its morphological analysis backend and adds a conversion layer that maps reading fields to romanisation schemes including Hepburn and Nihon-shiki.

Separating the 12 MB dictionary load from your data

When you open a Japanese tool for the first time, the kuromoji.js dictionary (~12 MB) downloads to your browser. That download is one-time and cached afterward. The critical distinction: none of those 12 MB contain your input text. The dictionary is a static asset served from a CDN, and the text you type — a confidential document, an unpublished manuscript — has no code path that sends it to a server.

Cloud Japanese NLP APIs such as Google Natural Language API and Yahoo! Text Analysis API deliver higher accuracy in some cases, but they require transmitting your text to an external server. At that point the terms of service govern what the provider can do with it: storage duration, secondary use, model training eligibility. Running kuromoji.js in the browser is the structural guarantee that text never leaves the device, not merely a policy promise.

Verifying zero transmission yourself

Open DevTools, start recording in the Network tab, and run any of the Japanese tools — paste text into kanji-to-hiragana or wakati-tokenize and trigger the conversion. The only requests you will see are the initial page load (HTML / JS / CSS) and the kuromoji dictionary binary on first visit. No outbound POST or XHR fires at the moment of analysis. Once the dictionary is loaded, all tokenisation runs as synchronous JavaScript inside the browser's memory.

This is verifiable evidence, not just a policy statement. The browser displays each request's destination URL and body in the Network panel, readable without specialist tooling. If you want to read the implementation, the repository otomomik/nosend-tools is public on GitHub; each tool's kuromoji.js invocation is in `src/tools/japanese/<slug>/`. Open it and confirm directly.