Kiri (切り)
A Japanese morphological analyzer for Bun, Node.js, and Elixir, ported from Sudachi (Java) with reference to Kotori (Kotlin).
Kiri reads Sudachi binary dictionaries and provides segmentation, part-of-speech tagging, reading forms, and normalization. Ships as pure TypeScript with zero runtime dependencies (kiri-ji), an optional Rust native backend (kiri-native) for significantly faster lattice construction and Viterbi search, and a pure Elixir package (kiri) with its own Viterbi solver, trie traversal, and concurrent-safe dictionary sharing.
Features
- Three split modes (A/B/C) — shortest, medium, and named-entity segmentation
- Full plugin stack ported from Sudachi:
- NFKC input text normalization
- MeCab-style OOV (out-of-vocabulary) handling
- Path rewriting (katakana joining, numeric joining with kanji normalization)
- Prolonged sound mark collapsing
- Yomigana stripping
- Regex-based OOV matching
- Connection cost editing / inhibition
- Optional Rust backend (
kiri-native) — mmap'd dictionary, native trie/lattice/Viterbi for the hot path; auto-detected with graceful fallback to pure TypeScript - Pure Elixir package (
kiri) — full tokenizer in Elixir: Viterbi lattice solver, DARTSCLONE trie, MeCab OOV, all via binary pattern matching. Dictionary shared across processes via:persistent_termfor concurrent tokenization with no memory duplication - User dictionary support — load custom
.dicfiles alongside the system dictionary - CLI tool — Sudachi-compatible TSV output, JSON, and wakachi (space-separated) modes
- Runs on Bun, Node.js, and Elixir — native backend via NAPI (TS), pure TS via
Bun.mmap()orfs.readFileSync, pure Elixir via binary pattern matching
Benchmarks
Measured on Apple M4 Pro. SudachiDict v20260116.
Full interactive results: smart-knowledge-systems.com/kiri | Raw numbers
Dictionary Setup
Kiri uses Sudachi's prebuilt binary dictionaries. Download one from the SudachiDict releases:
Three sizes are available:
| Dictionary | Description | Size |
|---|---|---|
| small | UniDic vocabulary only | ~75 MB |
| core | Basic vocabulary (recommended) | ~207 MB |
| full | Includes miscellaneous proper nouns | ~330 MB |
Elixir dictionary conversion
The Elixir package uses a pre-converted .kiri format for fast loading. Convert once after downloading:
Installation
Node.js / Bun
Elixir
def deps do
[{:kiri, "~> 0.2"}]
end
Rust
Quick Start
Library
import { createTokenizer } from "kiri-ji";
const tokenizer = await createTokenizer("~/.kiri-ji/dict/system_core.dic");
const morphemes = tokenizer.tokenize("東京都に行った");
for (const m of morphemes) {
console.log(m.surface, m.partOfSpeech.tags.join(","), m.normalizedForm);
}
// 東京都 名詞,固有名詞,地名,一般,*,* 東京都
// に 助詞,格助詞,*,*,*,* に
// 行っ 動詞,非自立可能,*,*,五段-カ行,連用形-促音便 行く
// た 助動詞,*,*,*,助動詞-タ,終止形-一般 た
CLI
# Text as argument
# Pipe from stdin
|
# Split mode A (shortest units)
# JSON output
# Wakachi (space-separated surfaces)
# 東京都 に 行っ た
# Specify dictionary path
# User dictionary
API
createTokenizer(dictPath, options?)
Creates a fully configured tokenizer with all plugins enabled.
const tokenizer = await createTokenizer("system_core.dic", {
mode: SplitMode.A,
userDictionaries: ["custom.dic"],
prolongedSoundMarks: true,
ignoreYomigana: true,
disableNormalization: false,
disableNumericNormalize: false,
regexOov: [{ pattern: /https?:\/\/\S+/, posId: 0, leftId: 0, rightId: 0, cost: 0 }],
});
| Option | Type | Default | Description |
|---|---|---|---|
mode |
SplitMode |
C |
Default split mode (A/B/C) |
userDictionaries |
string[] |
— | Paths to user .dic files |
prolongedSoundMarks |
boolean | config |
false |
Collapse repeated prolonged sound marks |
ignoreYomigana |
boolean | config |
false |
Strip bracketed readings after kanji |
disableNormalization |
boolean |
false |
Skip NFKC input text normalization |
disableNumericNormalize |
boolean |
false |
Skip numeric normalization in path rewrite |
regexOov |
RegexOovConfig[] |
— | Regex-based OOV provider configurations |
tokenizer.tokenize(text, mode?)
Returns an array of Morpheme objects. The optional mode parameter overrides the tokenizer's default.
Morpheme
interface Morpheme {
surface: string; // Surface text as it appears in input
partOfSpeech: POS; // 6-level POS hierarchy (.tags array)
partOfSpeechId: number;
normalizedForm: string; // Spelling-normalized form
dictionaryForm: string; // Lemma / dictionary form
readingForm: string; // Katakana reading
begin: number; // Start character index in original text
end: number; // End character index in original text
isOOV: boolean; // True if not found in dictionary
wordId: number;
dictionaryId: number; // 0 = system, 1+ = user dictionaries
synonymGroupIds: Int32Array;
}
Split Modes
| Mode | Description | Example: 関西国際空港 |
|---|---|---|
| C | Named entities / longest units | 関西国際空港 |
| B | Middle-length units | 関西 / 国際 / 空港 |
| A | Shortest units (UniDic short) | 関西 / 国際 / 空港 |
Native Backend
Install kiri-native alongside kiri-ji to enable the Rust-accelerated backend:
When present, createTokenizer() automatically uses the native backend for dictionary loading (via mmap), trie lookup, lattice construction, Viterbi search, and MeCab OOV generation. Input text plugins, path rewriting, split mode processing, and regex OOV remain in TypeScript.
If kiri-native is not installed, everything works identically via the pure TypeScript engine — no code changes needed.
Prebuilt binaries are available for:
- macOS arm64 / x64
- Linux x64 (gnu) / arm64 (gnu)
import { getBackend } from "kiri-ji";
console.log(getBackend()); // "native" or "core"
Elixir
Documentation: hexdocs.pm/kiri | Changelog
Add kiri as a dependency in mix.exs:
def deps do
[{:kiri, "~> 0.2"}]
end
Dictionary conversion
Convert a Sudachi .dic file to .kiri format (one-time step):
Usage
# Load once at application startup
{:ok, dict} = Kiri.load_dictionary("~/.kiri-ji/system_core.kiri")
# Tokenize from anywhere — concurrent safe, no GenServer
morphemes = Kiri.tokenize(dict, "東京都に行った")
for m <- morphemes do
IO.puts "#{m.surface}\t#{Enum.join(m.part_of_speech, ",")}\t#{m.normalized_form}"
end
# 東京都 名詞,固有名詞,地名,一般,*,* 東京都
# に 助詞,格助詞,*,*,*,* に
# 行っ 動詞,非自立可能,*,*,五段-カ行,連用形-促音便 行く
# た 助動詞,*,*,*,助動詞-タ,終止形-一般 た
Concurrency
The %Dictionary{} struct is a ~2 KB handle. The actual ~150 MB binary data lives in :persistent_term, shared across all processes with zero copy.
texts
|> Task.async_stream(&Kiri.tokenize(dict, &1), max_concurrency: 100)
|> Enum.to_list()
Split Modes
Override the default split mode per call:
morphemes = Kiri.tokenize(dict, "関西国際空港", mode: :a)
Morpheme
%Kiri.Morpheme{
surface: "東京都",
part_of_speech: ["名詞", "固有名詞", "地名", "一般", "*", "*"],
part_of_speech_id: 42,
normalized_form: "東京都",
dictionary_form: "東京都",
reading_form: "トウキョウト",
begin: 0,
end: 3,
is_oov: false,
word_id: 12345,
dictionary_id: 0,
synonym_group_ids: []
}
Options
Options can be passed to Kiri.tokenize/3:
| Option | Type | Default | Description |
|---|---|---|---|
mode |
:a | :b | :c |
:c |
Split mode (A/B/C) |
prolonged_sound_marks |
boolean |
false |
Collapse repeated prolonged sound marks |
ignore_yomigana |
boolean |
false |
Strip bracketed readings after kanji |
disable_normalization |
boolean |
false |
Skip NFKC input text normalization |
disable_numeric_normalize |
boolean |
false |
Skip numeric normalization in path rewrite |
backend |
:elixir | :nif |
:elixir |
Tokenization backend |
The Elixir package is a pure Elixir implementation — no Rust toolchain or NIF compilation required. The full plugin stack (input text normalization, path rewriting, split modes, prolonged sound marks, yomigana stripping) and the core algorithms (Viterbi lattice solver, DARTSCLONE trie, MeCab OOV) are implemented in Elixir using binary pattern matching against :persistent_term-stored dictionary sections.
Project Structure
kiri/
├── packages/
│ ├── kiri-core/ Pure TypeScript tokenizer engine [npm]
│ ├── kiri-engine/ Shared Rust lib crate (pure logic, no FFI) [crates.io]
│ ├── kiri-kotoba/ Input text processing Rust crate [crates.io]
│ ├── kiri-yaiba/ Standalone Rust tokenizer + CLI [crates.io]
│ ├── kiri-native/ NAPI wrapper over kiri-engine (Node/Bun) [npm + crates.io]
│ ├── kiri-nif/ Erlang NIF wrapper over kiri-engine [crates.io]
│ ├── kiri/ Pure Elixir tokenizer package [hex.pm]
│ └── kiri-ji/ Public TypeScript API + CLI [npm]
└── bench/ Benchmark scripts and corpus
Development
# Install dependencies
# Run tests (TypeScript)
# Type check + lint
# Format
# Rust tests (requires Rust toolchain)
&&
&&
# Elixir tests
&&
Profiling (Elixir)
See bench/PROFILING.md for a guide to profiling the pure Elixir tokenizer with OTP's cprof, eprof, and fprof.
Development Reports
- Elixir Pure Rewrite — Sprint L report: NIF/GenServer architecture replaced with pure Elixir Viterbi solver,
:persistent_termdictionary sharing, and concurrent tokenization.
Acknowledgments
Kiri is a derivative work of Sudachi by Works Applications, a Java-based Japanese morphological analyzer licensed under Apache 2.0. The core algorithms (lattice construction, Viterbi search, double-array trie, plugin architecture) and binary dictionary format are ported from Sudachi's source.
Kotori by Wanasit Tanakitrungruang served as an additional reference, particularly for its approach to porting Sudachi concepts to a non-Java runtime.
License
Apache License 2.0 — see LICENSE.
