kiri-engine 0.1.1

Core Rust engine for Kiri Japanese morphological analyzer
Documentation
# Kiri (切り)

[![npm: kiri-ji](https://img.shields.io/npm/v/kiri-ji?label=kiri-ji&color=cb3837)](https://www.npmjs.com/package/kiri-ji)
[![npm: kiri-core](https://img.shields.io/npm/v/kiri-core?label=kiri-core&color=cb3837)](https://www.npmjs.com/package/kiri-core)
[![npm: kiri-native](https://img.shields.io/npm/v/kiri-native?label=kiri-native&color=cb3837)](https://www.npmjs.com/package/kiri-native)
[![hex: kiri](https://img.shields.io/hexpm/v/kiri?label=kiri&color=6e4a7e)](https://hex.pm/packages/kiri)
[![crates.io: kiri-engine](https://img.shields.io/crates/v/kiri-engine?label=kiri-engine&color=dea584)](https://crates.io/crates/kiri-engine)
[![crates.io: kiri_nif](https://img.shields.io/crates/v/kiri_nif?label=kiri_nif&color=dea584)](https://crates.io/crates/kiri_nif)
[![crates.io: kiri-native](https://img.shields.io/crates/v/kiri-native?label=kiri-native&color=dea584)](https://crates.io/crates/kiri-native)

A Japanese morphological analyzer for Bun and Node.js, ported from [Sudachi](https://github.com/WorksApplications/Sudachi) (Java) with reference to [Kotori](https://github.com/wanasit/kotori) (Kotlin).

Kiri reads Sudachi binary dictionaries and provides segmentation, part-of-speech tagging, reading forms, and normalization. Ships as pure TypeScript with zero runtime dependencies (`kiri-ji`), an optional Rust native backend (`kiri-native`) for significantly faster lattice construction and Viterbi search, and an Elixir wrapper (`kiri`) with idiomatic API via rustler NIFs.

## Features

- **Three split modes** (A/B/C) — shortest, medium, and named-entity segmentation
- **Full plugin stack** ported from Sudachi:
  - NFKC input text normalization
  - MeCab-style OOV (out-of-vocabulary) handling
  - Path rewriting (katakana joining, numeric joining with kanji normalization)
  - Prolonged sound mark collapsing
  - Yomigana stripping
  - Regex-based OOV matching
  - Connection cost editing / inhibition
- **Optional Rust backend** (`kiri-native`) — mmap'd dictionary, native trie/lattice/Viterbi for the hot path; auto-detected with graceful fallback to pure TypeScript
- **Elixir wrapper** (`kiri`) — idiomatic Elixir API via rustler NIFs on the same Rust engine, with full plugin stack ported to Elixir
- **User dictionary support** — load custom `.dic` files alongside the system dictionary
- **CLI tool** — Sudachi-compatible TSV output, JSON, and wakachi (space-separated) modes
- **Runs on Bun, Node.js, and Elixir** — native backend via NAPI or rustler NIFs, pure TS via `Bun.mmap()` or `fs.readFileSync`

## Benchmarks

Measured on Apple M4 Pro. SudachiDict v20260116. Single-threaded throughput.

[![Tokenization throughput](bench/benchmarks.png)](bench/benchmarks.html)

Full interactive results: [smart-knowledge-systems.com/kiri]https://smart-knowledge-systems.com/kiri | [Raw numbers]bench/results.md

Working on debugging the concurrency for the Elixir implementation, it's well below expectation. Disappointing if it can't be improved.

## Dictionary Setup

Kiri uses Sudachi's prebuilt binary dictionaries. Download one from the [SudachiDict releases](https://github.com/WorksApplications/SudachiDict/releases):

```bash
mkdir -p ~/.kiri-ji/dict
curl -L -o ~/.kiri-ji/dict/sudachi-dictionary-core.zip \
  https://github.com/WorksApplications/SudachiDict/releases/download/v20260116/sudachi-dictionary-20260116-core.zip
unzip -o ~/.kiri-ji/dict/sudachi-dictionary-core.zip -d ~/.kiri-ji/dict
mv ~/.kiri-ji/dict/sudachi-dictionary-*/system_core.dic ~/.kiri-ji/dict/
```

Three sizes are available:

| Dictionary | Description                         | Size    |
| ---------- | ----------------------------------- | ------- |
| **small**  | UniDic vocabulary only              | ~75 MB  |
| **core**   | Basic vocabulary (recommended)      | ~207 MB |
| **full**   | Includes miscellaneous proper nouns | ~330 MB |

## Installation

### Node.js / Bun

```bash
bun add kiri-ji            # Public API + CLI
bun add kiri-native        # Optional: Rust-accelerated backend
```

### Elixir

```elixir
def deps do
  [{:kiri, "~> 0.1"}]
end
```

### Rust

```bash
cargo add kiri-engine      # Core engine (no FFI)
```

## Quick Start

### Library

```typescript
import { createTokenizer } from "kiri-ji";

const tokenizer = await createTokenizer("~/.kiri-ji/dict/system_core.dic");
const morphemes = tokenizer.tokenize("東京都に行った");

for (const m of morphemes) {
  console.log(m.surface, m.partOfSpeech.tags.join(","), m.normalizedForm);
}
// 東京都  名詞,固有名詞,地名,一般,*,*  東京都
// に      助詞,格助詞,*,*,*,*            に
// 行っ    動詞,非自立可能,*,*,五段-カ行,連用形-促音便  行く
// た      助動詞,*,*,*,助動詞-タ,終止形-一般  た
```

### CLI

```bash
# Text as argument
bunx kiri-ji "東京都に行った"

# Pipe from stdin
echo "お寿司が食べたい" | bunx kiri-ji

# Split mode A (shortest units)
bunx kiri-ji --mode A "関西国際空港"

# JSON output
bunx kiri-ji --format json "すもももももももものうち"

# Wakachi (space-separated surfaces)
bunx kiri-ji --wakachi "東京都に行った"
# 東京都 に 行っ た

# Specify dictionary path
bunx kiri-ji --dict /path/to/system_core.dic "text"

# User dictionary
bunx kiri-ji --user-dict custom.dic "text"
```

## API

### `createTokenizer(dictPath, options?)`

Creates a fully configured tokenizer with all plugins enabled.

```typescript
const tokenizer = await createTokenizer("system_core.dic", {
  mode: SplitMode.A,
  userDictionaries: ["custom.dic"],
  prolongedSoundMarks: true,
  ignoreYomigana: true,
  disableNormalization: false,
  disableNumericNormalize: false,
  regexOov: [{ pattern: /https?:\/\/\S+/, posId: 0, leftId: 0, rightId: 0, cost: 0 }],
});
```

| Option                    | Type                | Default | Description                                |
| ------------------------- | ------------------- | ------- | ------------------------------------------ |
| `mode`                    | `SplitMode`         | `C`     | Default split mode (A/B/C)                 |
| `userDictionaries`        | `string[]`          || Paths to user `.dic` files                 |
| `prolongedSoundMarks`     | `boolean \| config` | `false` | Collapse repeated prolonged sound marks    |
| `ignoreYomigana`          | `boolean \| config` | `false` | Strip bracketed readings after kanji       |
| `disableNormalization`    | `boolean`           | `false` | Skip NFKC input text normalization         |
| `disableNumericNormalize` | `boolean`           | `false` | Skip numeric normalization in path rewrite |
| `regexOov`                | `RegexOovConfig[]`  || Regex-based OOV provider configurations    |

### `tokenizer.tokenize(text, mode?)`

Returns an array of `Morpheme` objects. The optional `mode` parameter overrides the tokenizer's default.

### Morpheme

```typescript
interface Morpheme {
  surface: string; // Surface text as it appears in input
  partOfSpeech: POS; // 6-level POS hierarchy (.tags array)
  partOfSpeechId: number;
  normalizedForm: string; // Spelling-normalized form
  dictionaryForm: string; // Lemma / dictionary form
  readingForm: string; // Katakana reading
  begin: number; // Start character index in original text
  end: number; // End character index in original text
  isOOV: boolean; // True if not found in dictionary
  wordId: number;
  dictionaryId: number; // 0 = system, 1+ = user dictionaries
  synonymGroupIds: Int32Array;
}
```

## Split Modes

| Mode  | Description                    | Example: 関西国際空港 |
| ----- | ------------------------------ | --------------------- |
| **C** | Named entities / longest units | 関西国際空港          |
| **B** | Middle-length units            | 関西 / 国際 / 空港    |
| **A** | Shortest units (UniDic short)  | 関西 / 国際 / 空港    |

## Native Backend

Install `kiri-native` alongside `kiri-ji` to enable the Rust-accelerated backend:

```bash
bun add kiri-native
```

When present, `createTokenizer()` automatically uses the native backend for dictionary loading (via mmap), trie lookup, lattice construction, Viterbi search, and MeCab OOV generation. Input text plugins, path rewriting, split mode processing, and regex OOV remain in TypeScript.

If `kiri-native` is not installed, everything works identically via the pure TypeScript engine — no code changes needed.

Prebuilt binaries are available for:

- macOS arm64 / x64
- Linux x64 (gnu) / arm64 (gnu)

```typescript
import { getBackend } from "kiri-ji";

console.log(getBackend()); // "native" or "core"
```

## Elixir

Documentation: [hexdocs.pm/kiri](https://hexdocs.pm/kiri)

Add `kiri` as a dependency in `mix.exs`:

```elixir
def deps do
  [{:kiri, "~> 0.1"}]
end
```

### Usage

```elixir
{:ok, tokenizer} = Kiri.create_tokenizer("~/.kiri-ji/dict/system_core.dic",
  mode: :c,
  user_dictionaries: ["custom.dic"],
  prolonged_sound_marks: true,
  ignore_yomigana: true
)

morphemes = Kiri.tokenize(tokenizer, "東京都に行った")

for m <- morphemes do
  IO.puts "#{m.surface}\t#{Enum.join(m.part_of_speech, ",")}\t#{m.normalized_form}"
end
# 東京都  名詞,固有名詞,地名,一般,*,*  東京都
# に      助詞,格助詞,*,*,*,*
# 行っ    動詞,非自立可能,*,*,五段-カ行,連用形-促音便  行く
# た      助動詞,*,*,*,助動詞-タ,終止形-一般  た
```

### Split Modes

Override the default split mode per call:

```elixir
morphemes = Kiri.tokenize(tokenizer, "関西国際空港", mode: :a)
```

### Morpheme

```elixir
%Kiri.Morpheme{
  surface: "東京都",
  part_of_speech: ["名詞", "固有名詞", "地名", "一般", "*", "*"],
  part_of_speech_id: 42,
  normalized_form: "東京都",
  dictionary_form: "東京都",
  reading_form: "トウキョウト",
  begin: 0,
  end: 3,
  is_oov: false,
  word_id: 12345,
  dictionary_id: 0,
  synonym_group_ids: []
}
```

### Options

| Option                      | Type                | Default | Description                                |
| --------------------------- | ------------------- | ------- | ------------------------------------------ |
| `mode`                      | `:a \| :b \| :c`    | `:c`    | Default split mode (A/B/C)                 |
| `user_dictionaries`         | `[String.t()]`      | `[]`    | Paths to user `.dic` files                 |
| `prolonged_sound_marks`     | `boolean \| config` | `false` | Collapse repeated prolonged sound marks    |
| `ignore_yomigana`           | `boolean \| config` | `false` | Strip bracketed readings after kanji       |
| `disable_normalization`     | `boolean`           | `false` | Skip NFKC input text normalization         |
| `disable_numeric_normalize` | `boolean`           | `false` | Skip numeric normalization in path rewrite |

The Elixir wrapper uses the same shared Rust engine (`kiri-engine`) as the Node.js native backend, with the full plugin stack (input text normalization, path rewriting, split mode, prolonged sound marks, yomigana stripping) ported to Elixir. Precompiled NIF binaries are available via `rustler_precompiled`.

## Project Structure

```
kiri/
├── packages/
│   ├── kiri-core/     Pure TypeScript tokenizer engine          [npm]
│   ├── kiri-engine/   Shared Rust lib crate (pure logic, no FFI) [crates.io]
│   ├── kiri-native/   NAPI wrapper over kiri-engine (Node/Bun)  [npm + crates.io]
│   ├── kiri-nif/      Rustler NIF wrapper over kiri-engine       [crates.io]
│   ├── kiri/          Elixir Mix package (plugins + API)         [hex.pm]
│   └── kiri-ji/       Public TypeScript API + CLI                [npm]
└── bench/             Benchmark scripts and corpus
```

## Development

```bash
# Install dependencies
bun install

# Run tests (TypeScript)
bun test

# Type check + lint
bun run check

# Format
bun run format

# Rust tests (requires Rust toolchain)
cd packages/kiri-engine && cargo test
cd packages/kiri-native && cargo test
cd packages/kiri-nif && cargo test

# Elixir tests (requires Elixir + Rust toolchains)
cd packages/kiri && mix test
```

## Acknowledgments

Kiri is a derivative work of [Sudachi](https://github.com/WorksApplications/Sudachi) by [Works Applications](https://nlp.worksap.co.jp/), a Java-based Japanese morphological analyzer licensed under Apache 2.0. The core algorithms (lattice construction, Viterbi search, double-array trie, plugin architecture) and binary dictionary format are ported from Sudachi's source.

[Kotori](https://github.com/wanasit/kotori) by Wanasit Tanakitrungruang served as an additional reference, particularly for its approach to porting Sudachi concepts to a non-Java runtime.

## License

Apache License 2.0 — see [LICENSE](LICENSE).