UTF-8 vs Shift_JIS vs EUC-JP — which Japanese encoding should you pick?
Compare three Japanese encodings by character coverage, byte structure, and modern compatibility. Maps each to CSV exports, file handoffs, and legacy system integration — and the pairs that produce mojibake.
Four axes that decide the encoding
Picking a Japanese text encoding is not about which one is newest — it is about what the receiving system expects. Four axes carry the decision. Character coverage decides whether JIS X 0208 is enough or whether you need full Unicode (emoji, simplified and traditional Chinese, Korean). Byte structure affects how easy or hazardous parsing becomes: ASCII compatibility, variable vs fixed length, and presence of shift states. Compatibility turns hostile fast when the file leaves your machine — Excel on Windows, an internal CMS from the 1990s, or a mail header all have different defaults. Use case flips the default answer: green-field development and legacy bridges rarely agree.
“UTF-8 is the de facto standard, just use UTF-8 everywhere” collapses the moment Excel mojibakes a bomless UTF-8 CSV. Encoding is an agreement between sender and receiver; break the agreement and you get mojibake. There is no inherently good or bad encoding, only matched and mismatched.
Side-by-side comparison
| Property | UTF-8 | Shift_JIS (CP932) | EUC-JP | ISO-2022-JP |
|---|---|---|---|---|
| Character set | All of Unicode | JIS X 0208 + half-width kana + NEC/IBM extensions | JIS X 0208 + 0212 | JIS X 0208 |
| Bytes per character | 1-4 (variable) | 1-2 (variable) | 1-3 (variable) | 1-7 (incl. escapes) |
| ASCII compatible | Yes (first byte identifies length) | Partial (0x5C is yen sign) | Yes | Yes (via escape sequences) |
| BOM | Optional | None | None | None |
| Primary environment | Web / new projects / Linux / macOS | Windows JP / Excel CSV | Older UNIX / legacy web | Mail headers and bodies |
| Standard / year | RFC 3629 (2003) | MS CP932 / JIS X 0208 (1978) | EUC (1985) | RFC 1468 (1993) |
| Web usage in 2026 | ~98% | <1% | <0.1% | Mail only |
UTF-8 represents ASCII (0x00–0x7F) as a single byte and Japanese kanji and kana as three bytes. Shift_JIS uses one byte for the ANK range and two for kanji, sometimes producing smaller files than UTF-8 — but the CP932 extensions (NEC special characters, IBM extension kanji) include characters like “①”, “髙”, and “﨑” that are not in strict Shift_JIS, so anything not consistently CP932-aware will mojibake on them. EUC-JP can also reach three bytes when you include JIS X 0212 (supplementary kanji), and the resulting complexity is one reason UTF-8 displaced it.
Picking by situation
New web sites, APIs, mobile apps: UTF-8 without BOM, end to end. Match <meta charset="UTF-8">, the HTTP Content-Type: text/html; charset=utf-8 header, and the database (utf8mb4 on MySQL) and mojibake essentially disappears. The BOM (EF BB BF) trips up shell scripts and JSON parsers at the start of the file, so the web stack skips it by convention.
CSV files that Excel will open: UTF-8 with BOM, or Shift_JIS outright. Excel for Windows reads BOM-less UTF-8 CSV as Shift_JIS, turning the multibyte parts into garbage. Adding the BOM is the lightest fix; writing Shift_JIS directly is the bullet-proof one. “UTF-8 with BOM whenever Excel is the consumer” is a common shop rule.
Legacy mainframes and 1990s internal portals: follow whatever the counterparty specifies. If they ask for EUC-JP or Shift_JIS, flag upfront that Unicode-only characters (circled digits, certain person-name variants) will be downgraded to ”?” or ”〓” — that conversation needs to happen before the data round-trips.
Email subjects and bodies: ISO-2022-JP is still the workhorse. The encoding switches between 7-bit ASCII and JIS X 0208 with escape sequences (ESC $ B and friends), then MIME wraps each header field as =?ISO-2022-JP?B?...?=.
Recovering and converting inside the browser
For text that has already been mangled — a copy-pasted line, a CSV that opens as garbage — mojibake-fix lets you try the common breakage patterns (“繧エ繝⦅ア” from UTF-8 misread as Shift_JIS, “ã®ã” from UTF-8 misread as Latin-1) until the right combination of source and assumed encoding restores the original. For whole files, csv-encoding-convert re-encodes between UTF-8, UTF-8 with BOM, Shift_JIS, EUC-JP, and ISO-2022-JP entirely in the browser.
Three places to be careful. (1) BOM presence: Shift_JIS has no concept of a BOM, so converting BOM-prefixed UTF-8 to Shift_JIS leaves three garbage bytes at the top of the file. (2) Shift_JIS vs CP932: text containing “①”, ”㈱”, or “髙” is not pure Shift_JIS; it is CP932 (Microsoft’s extension). Choosing the right label changes whether those characters survive. (3) CSV line endings: Excel on older Windows treats LF-only CSVs as a single line, so pick CRLF even when the encoding is right. The implementation is published on GitHub, and the DevTools Network tab lets you confirm that none of the file content leaves your machine during conversion.