kiwi-rs

한국어 README | kiwipiepy parity (EN) | kiwipiepy parity (KO)

Rust bindings for Kiwi via the official C API (include/kiwi/capi.h).

AI user guide

If you use an AI assistant (Codex/ChatGPT/Claude/Gemini, etc.) to generate kiwi-rs code, ask for output with this contract:

Choose one init path only (Kiwi::init, Kiwi::new, or Kiwi::from_config) and explain why.
Return runnable Rust code (fn main() -> Result<(), Box<dyn std::error::Error>>).
Include one verification command (cargo run --example ... or cargo run).
List 2-3 request-specific pitfalls (not generic advice).

Prompt template:

Use kiwi-rs and provide:
1) init path choice with reason,
2) copy-paste runnable Rust code,
3) one verification command,
4) pitfalls for this exact task.
Task: <describe your task here>
Environment: <OS / whether KIWI_LIBRARY_PATH and KIWI_MODEL_PATH are set>

Accuracy checks you should ask AI to follow:

Treat UTF-8 offsets as character indices, not byte indices.
Check supports_utf16_api() before UTF-16 APIs.
Check supports_analyze_mw() before analyze_many_utf16_via_native.
Do not assume full kiwipiepy parity (see docs/kiwipiepy_parity.md).

Skill-based usage (`skills/`)

This repository includes a local AI skill for kiwi-rs:

Skill file: skills/kiwi-rs-assistant/SKILL.md
Reference docs: skills/kiwi-rs-assistant/references/
Agent metadata: skills/kiwi-rs-assistant/agents/openai.yaml

If your assistant supports skill invocation, call it explicitly:

Use $kiwi-rs-assistant and implement: <your task>

`llms.txt` usage

Use llms.txt as the first context file when prompting AI. It summarizes the canonical docs, API surface, examples, and guardrails in one place.

File: llms.txt
Recommended prompt add-on:

Read llms.txt first, then answer using repository APIs and examples only.

Current support status

As of February 16, 2026:

C API symbol loading: complete (101/101 symbols in capi.h are loaded)
Core high-level usage: implemented (init/new/from_config, analyze/tokenize/split/join, MorphemeSet, Pretokenized, typo APIs, SwTokenizer, CoNg APIs)
kiwipiepy full surface parity: partial (Python/C++-specific layers still missing)

Installation

[dependencies]
kiwi-rs = "0.1"

Runtime setup options

Option 1: automatic bootstrap in code

Kiwi::init() tries local paths first, then downloads a matching release pair (library + model) into cache.

use kiwi_rs::Kiwi;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let kiwi = Kiwi::init()?;
    let tokens = kiwi.tokenize("아버지가방에들어가신다.")?;
    println!("{}", tokens.len());
    Ok(())
}

Environment variables used by bootstrap:

KIWI_RS_VERSION (default: latest, e.g. v0.22.2)
KIWI_RS_CACHE_DIR (default: OS cache directory)

External commands required by bootstrap:

Common: curl, tar
Windows zip extraction: powershell (Expand-Archive)

Option 2: helper installer scripts

Linux/macOS:

cd kiwi-rs
make install-kiwi

Windows (PowerShell):

cd kiwi-rs
powershell -NoProfile -ExecutionPolicy Bypass -File .\scripts\install_kiwi.ps1

Installer options:

KIWI_VERSION / -Version (default: latest)
KIWI_PREFIX / -Prefix (default: $HOME/.local/kiwi on Unix, %LOCALAPPDATA%\\kiwi on Windows)
KIWI_MODEL_VARIANT / -ModelVariant (default: base)

Manual path configuration

Env-based (`Kiwi::new`)

KIWI_LIBRARY_PATH: dynamic library path
KIWI_MODEL_PATH: model directory path

Config-based (`Kiwi::from_config`)

use kiwi_rs::{Kiwi, KiwiConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = KiwiConfig::default()
        .with_library_path("/path/to/libkiwi.dylib")
        .with_model_path("/path/to/models/cong/base")
        .add_user_word("러스트", "NNP", 0.0);

    let kiwi = Kiwi::from_config(config)?;
    let analyses = kiwi.analyze_top_n("형태소 분석 예시", 2)?;
    println!("{} candidates", analyses.len());
    Ok(())
}

API overview

Core

Initialization: Kiwi::init, Kiwi::new, Kiwi::from_config, Kiwi::init_direct
Analyze/tokenize: analyze*, tokenize*, analyze_many*, tokenize_many*
Sentence split: split_into_sents*, split_into_sents_with_options*
Join/spacing: join*, space*, glue*

Advanced

Builder: user words, alias words, pre-analyzed words, dictionary loading, regex rules, extract APIs
Constraints: MorphemeSet, Pretokenized
Typo: KiwiTypo, default typo sets, cost controls
Subword: SwTokenizer
CoNg: similarity/context/prediction/context-id conversion

UTF-16 and optional API checks

Kiwi::supports_utf16_api
Kiwi::supports_analyze_mw
KiwiLibrary::supports_builder_init_stream

Supported APIs

Kiwi

The main struct for analysis.

Initialization: init, init_with_version, new, from_config, init_direct, with_model_path
Analysis: analyze, analyze_top_n, analyze_with_options, analyze_with_blocklist, analyze_with_pretokenized, analyze_with_blocklist_and_pretokenized
Tokenization: tokenize, tokenize_with_match_options, tokenize_with_options, tokenize_with_blocklist, tokenize_with_pretokenized, tokenize_with_blocklist_and_pretokenized
Multi-string Analysis: analyze_many_with_options, analyze_many_via_native, tokenize_many, tokenize_many_with_echo
Sentence Splitting: split_into_sents, split_into_sents_with_options
Spacing/Joining: space, space_many, glue, glue_with_options, join, prepare_join_morphs, prepare_join_tokens, prepare_joiner, join_prepared, join_prepared_utf16
Configuration:
- global_config, set_global_config
- set_option, get_option, set_option_f, get_option_f
- cutoff_threshold, set_cutoff_threshold
- integrate_allomorph, set_integrate_allomorph
- space_penalty, set_space_penalty, space_tolerance, set_space_tolerance
- max_unk_form_size, set_max_unk_form_size
- typo_cost_weight, set_typo_cost_weight
Morpheme/Sense Info: morpheme, morpheme_info, morpheme_form, list_senses, tag_to_string, script_name, list_all_scripts
Search: find_morphemes, find_morphemes_with_prefix
Semantics (CoNg):
- most_similar_morphemes, most_similar_contexts
- predict_words_from_context, predict_next_morpheme
- predict_words_from_context_diff, predict_next_morpheme_diff
- morpheme_similarity, context_similarity
- to_context_id, from_context_id
Sub-objects Creation: typo, basic_typo, default_typo_set, new_morphset, new_pretokenized, open_sw_tokenizer
UTF-16: analyze_utf16*, tokenize_utf16*, split_into_sents_utf16*, join_utf16, analyze_many_utf16_via_native
Misc: library_version, num_workers, model_type, typo_cost_threshold, add_re_word, clear_re_words

KiwiBuilder

Used to customize the dictionary and build a Kiwi instance.

Build: build, build_with_default_options
Word Management: add_user_word, add_pre_analyzed_word, add_rule, add_re_rule, add_alias, add_automata
Dictionary Loading: load_dictionary, load_user_dictionary, extract_add_words
Configuration: set_option, get_option, set_option_f, get_option_f, set_cut_off_threshold, set_integrate_allomorph, set_model_path

KiwiTypo

Corrects typos in text.

Creation: Kiwi::typo, Kiwi::basic_typo, Kiwi::default_typo_set
Management: add, update, scale_cost, set_continual_typo_cost, set_lengthening_typo_cost, copy

SwTokenizer

Subword tokenizer.

Usage: encode, encode_with_offsets, decode

MorphemeSet

A set of morphemes for blocklisting.

Management: add, add_utf16

Pretokenized

Defines pre-analyzed token spans.

Management: add_span, add_token_to_span, add_token_to_span_utf16

Examples

cd kiwi-rs
cargo run --example basic
cargo run --example analyze_options
cargo run --example builder_custom_words
cargo run --example typo_build
cargo run --example blocklist_and_pretokenized
cargo run --example split_sentences
cargo run --example utf16_api
cargo run --example native_batch
cargo run --example sw_tokenizer -- /path/to/tokenizer.json
cargo run --example morpheme_semantics
cargo run --release --example bench_tokenize -- --iters 1000 --warmup 100
cargo run --release --example bench_features -- --iters 5000 --warmup 100 --batch-size 256 --batch-iters 500

What each example is for:

Example	What you learn	Key APIs	Notes
`basic`	End-to-end quick start (init + tokenize)	`Kiwi::init`, `Kiwi::tokenize`	Demonstrates cache bootstrap behavior when assets are missing.
`analyze_options`	How candidate analysis options change output	`AnalyzeOptions`, `Kiwi::analyze_with_options`	Shows `top_n`, `match_options`, and candidate probabilities.
`builder_custom_words`	Building a custom analyzer with user lexicon/rules	`KiwiLibrary::builder`, `add_user_words`, `add_re_rule`	Uses builder-time customization APIs.
`typo_build`	Enabling typo-aware analysis	`default_typo_set`, `build_with_typo_and_default_options`	Prints typo-related token metadata.
`blocklist_and_pretokenized`	Blocking specific morphemes and forcing token spans	`new_morphset`, `new_pretokenized`, `tokenize_with_blocklist_and_pretokenized`	Useful for domain constraints and deterministic spans.
`split_sentences`	Sentence segmentation with per-sentence token/sub-sentence structures	`split_into_sents_with_options`	Shows the `Sentence` return surface (`text/start/end/tokens/subs`).
`utf16_api`	UTF-16 analysis/tokenization/sentence split path	`supports_utf16_api`, `analyze_utf16`, `tokenize_utf16`, `split_into_sents_utf16*`	Includes runtime feature check for UTF-16 support.
`native_batch`	Native callback-based batch analysis route	`analyze_many_via_native`, `analyze_many_utf16_via_native`	Useful for higher-throughput multi-line processing.
`sw_tokenizer`	Subword tokenizer encode/decode flow	`open_sw_tokenizer`, `encode_with_offsets`, `decode`	Requires `tokenizer.json` path argument.
`morpheme_semantics`	Morpheme ID lookup and CoNg semantic utilities	`find_morphemes`, `morpheme`, `most_similar_morphemes`, `to_context_id`	Shows semantic APIs that operate on morpheme/context IDs.
`bench_tokenize`	Fair latency/throughput timing split by phase	`Kiwi::init`, `Kiwi::tokenize`	Prints `init`, first call, and steady-state tokenize metrics using the same text repeatedly.
`bench_features`	Expanded feature throughput/latency comparison (Rust side)	`tokenize`, `analyze_with_options`, `split_into_sents`, `space`, `join`, `glue`, `analyze_many`, `tokenize_many`	Pair with `scripts/bench_features_kiwipiepy.py` and `scripts/compare_feature_bench.py` for Rust vs Python comparison.

Rust vs Python benchmark (same conditions)

Use the same input text / warmup / iteration count for both sides:

cd kiwi-rs
cargo run --release --example bench_tokenize -- --text "아버지가방에들어가신다." --warmup 100 --iters 5000
python3 scripts/bench_kiwipiepy.py --text "아버지가방에들어가신다." --warmup 100 --iters 5000

Notes:

Compare bench_avg_ms, calls_per_sec, and tokens_per_sec for steady-state speed.
Compare init_ms and first_tokenize_ms separately; startup can dominate one-shot runs.
Ensure both runtimes use the same Kiwi library/model assets (KIWI_LIBRARY_PATH, KIWI_MODEL_PATH) when strict 1:1 comparison is required.
For option parity with kiwipiepy tokenize defaults, add --python-default-options on the Rust benchmark command.

Expanded feature benchmark snapshot (local run, 2026-02-17)

Commands:

cd kiwi-rs
mkdir -p tmp
.venv-bench/bin/python scripts/compare_feature_bench.py \
  --text "아버지가방에들어가신다." \
  --warmup 100 --iters 5000 \
  --batch-size 256 --batch-iters 500 \
  --input-mode repeated --variant-pool 4096 \
  --repeats 1 \
  --md-out tmp/feature_bench_repeated.md \
  --json-out tmp/feature_bench_repeated.json

.venv-bench/bin/python scripts/compare_feature_bench.py \
  --text "아버지가방에들어가신다." \
  --warmup 100 --iters 5000 \
  --batch-size 256 --batch-iters 500 \
  --input-mode varied --variant-pool 8192 \
  --repeats 1 \
  --md-out tmp/feature_bench_varied.md \
  --json-out tmp/feature_bench_varied.json

Automated weekly run (same command) is configured in .github/workflows/feature-benchmark.yml. Generated markdown/json snapshots now include benchmark environment and config metadata.

Summary below is the median of 1 run, with min-max in brackets (same value for single-run snapshots).

Benchmark environment:

Item	Value
Timestamp (local)	2026-02-17T17:10:06+09:00
OS	Darwin 24.6.0
Platform	macOS-15.7.4-arm64-arm-64bit-Mach-O
CPU	arm64 (CPU brand unavailable in sandbox)
Cores (physical/logical)	-/10
Memory	16.00 GiB (17179869184 bytes)
rustc	rustc 1.93.1 (01f6ddf75 2026-02-11)
cargo	cargo 1.93.1 (083ac5135 2025-12-15)
Python (harness)	3.14.3 (main, Feb 3 2026, 15:32:20) [Clang 17.0.0 (clang-1700.6.3.2)]
Python (bench bin)	Python 3.14.3 (`.venv-bench/bin/python`)
kiwipiepy	0.22.2
Git	`753b8dc4d648d33b5ed6f163ba2ae3cb46397a7e` (`main`, dirty=True)

Benchmark config:

Item	Value
text	아버지가방에들어가신다.
warmup	100
iters	5000
batch_size	256
batch_iters	500
input_mode	repeated
variant_pool	4096
repeats	1
join_lm_search	true

Throughput comparison (calls_per_sec, higher is better):

Feature	`kiwi-rs`	`kiwipiepy`	Relative (`kiwi-rs / kiwipiepy`)
`tokenize`	1185489.51 [1185489.51-1185489.51]	7792.55 [7792.55-7792.55]	152.13x
`analyze_top1`	1199112.66 [1199112.66-1199112.66]	7612.25 [7612.25-7612.25]	157.52x
`split_into_sents`	28908752.41 [28908752.41-28908752.41]	3802.38 [3802.38-3802.38]	7602.80x
`split_into_sents_with_tokens`	250558.01 [250558.01-250558.01]	4872.41 [4872.41-4872.41]	51.42x
`space`	357757.20 [357757.20-357757.20]	4768.69 [4768.69-4768.69]	75.02x
`join`	2402355.08 [2402355.08-2402355.08]	675759.32 [675759.32-675759.32]	3.56x
`glue`	6221490.02 [6221490.02-6221490.02]	7613.64 [7613.64-7613.64]	817.15x
`analyze_many_loop`	32.36 [32.36-32.36]	27.94 [27.94-27.94]	1.16x
`analyze_many_native`	166.11 [166.11-166.11]	165.71 [165.71-165.71]	1.00x
`tokenize_many_loop`	3409.24 [3409.24-3409.24]	28.66 [28.66-28.66]	118.95x
`tokenize_many_batch`	3134.67 [3134.67-3134.67]	184.16 [184.16-184.16]	17.02x
`split_many_loop`	27.87 [27.87-27.87]	29.18 [29.18-29.18]	0.96x
`space_many_loop`	29.39 [29.39-29.39]	27.22 [27.22-27.22]	1.08x
`space_many_batch`	161.79 [161.79-161.79]	160.39 [160.39-160.39]	1.01x
`batch_analyze_native`	166.11 [166.11-166.11]	165.71 [165.71-165.71]	1.00x

Startup (init_ms, lower is better):

Init path	`kiwi-rs`	`kiwipiepy`
`Kiwi::init()` / `Kiwi()`	1417.905 [1417.905-1417.905] ms	680.748 [680.748-680.748] ms

Rust-only benchmark features:

Feature	`kiwi-rs`
`join_prepared`	277556.12 [277556.12-277556.12]
`join_prepared_utf16`	278618.79 [278618.79-278618.79]
`joiner_reuse`	3518440.85 [3518440.85-3518440.85]
`joiner_reuse_utf16`	2743359.29 [2743359.29-2743359.29]

Python-only benchmark features:

Feature	`kiwipiepy`
`split_many_batch`	181.50 [181.50-181.50]

Varied-input (near no-cache) ratio snapshot (input_mode=varied, variant_pool=8192):

Feature	Repeated Ratio	Repeated Δ%	Varied Ratio	Varied Δ%
`tokenize`	152.13x	+15113.0%	0.94x	-6.0%
`analyze_top1`	157.52x	+15652.0%	1.01x	+1.0%
`split_into_sents`	7602.80x	+760180.0%	1.16x	+16.0%
`split_into_sents_with_tokens`	51.42x	+5042.0%	1.02x	+2.0%
`glue`	817.15x	+81615.0%	1.15x	+15.0%
`analyze_many_native`	1.00x	+0.0%	0.82x	-18.0%
`tokenize_many_batch`	17.02x	+1602.0%	0.79x	-21.0%
`space_many_batch`	1.01x	+1.0%	0.95x	-5.0%
`join`	3.56x	+256.0%	4.37x	+337.0%

Δ% is (kiwi-rs / kiwipiepy - 1) * 100.
+ means kiwi-rs is faster, - means slower.

Visual bar charts (relative throughput):

xychart-beta
    title "Repeated Input Ratio (Selected)"
    x-axis ["tokenize","analyze_top1","split_with_tokens","join","analyze_many_native","tokenize_many_batch","space_many_batch"]
    y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 170
    bar [152.13,157.52,51.42,3.56,1.00,17.02,1.01]

xychart-beta
    title "Repeated Input Ratio (Split + Glue)"
    x-axis ["split_into_sents","glue"]
    y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 8000
    bar [7602.80,817.15]

xychart-beta
    title "Varied Input Ratio (Near No-Cache)"
    x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","join","analyze_many_native","tokenize_many_batch","space_many_batch"]
    y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 5
    bar [0.94,1.01,1.16,1.02,1.10,1.15,4.37,0.82,0.79,0.95]

Absolute-value charts (varied input, near no-cache):

Throughput = number of calls processed per second (calls/sec, higher is better)
Latency = average time per call (avg_ms, lower is better)
mermaid xychart-beta can visually overlap multi-bar series in some renderers.
To keep readability, charts below are split by engine.

xychart-beta
    title "Varied Throughput (Core Features, kiwi-rs)"
    x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","analyze_many_native","tokenize_many_batch","space_many_batch"]
    y-axis "calls/sec (higher is better)" 0 --> 8000
    bar [6956.95,7319.22,5104.73,4372.13,4944.59,5692.86,158.62,151.12,150.76]

xychart-beta
    title "Varied Throughput (Core Features, kiwipiepy)"
    x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","analyze_many_native","tokenize_many_batch","space_many_batch"]
    y-axis "calls/sec (higher is better)" 0 --> 8000
    bar [7393.81,7212.44,4399.49,4282.95,4497.21,4965.80,192.74,190.38,159.43]

xychart-beta
    title "Varied Throughput (Join)"
    x-axis ["join (kiwi-rs)","join (kiwipiepy)"]
    y-axis "calls/sec (higher is better)" 0 --> 3000000
    bar [2927258.22,669983.08]

xychart-beta
    title "Varied Latency (Core Features, kiwi-rs)"
    x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","analyze_many_native","tokenize_many_batch","space_many_batch"]
    y-axis "avg ms/call (lower is better)" 0 --> 7
    bar [0.143741,0.136627,0.195897,0.228721,0.202241,0.175659,6.304233,6.617300,6.632977]

xychart-beta
    title "Varied Latency (Core Features, kiwipiepy)"
    x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","analyze_many_native","tokenize_many_batch","space_many_batch"]
    y-axis "avg ms/call (lower is better)" 0 --> 7
    bar [0.135248,0.138649,0.227299,0.233484,0.222360,0.201377,5.188234,5.252784,6.272204]

Side-by-side numeric comparison (varied input, near no-cache):

Feature	`kiwi-rs` calls/sec	`kiwipiepy` calls/sec	Ratio (`x`)	Δ%	`kiwi-rs` avg_ms	`kiwipiepy` avg_ms
`tokenize`	6956.95	7393.81	0.94x	-6.0%	0.143741	0.135248
`analyze_top1`	7319.22	7212.44	1.01x	+1.0%	0.136627	0.138649
`split_into_sents`	5104.73	4399.49	1.16x	+16.0%	0.195897	0.227299
`split_into_sents_with_tokens`	4372.13	4282.95	1.02x	+2.0%	0.228721	0.233484
`space`	4944.59	4497.21	1.10x	+10.0%	0.202241	0.222360
`glue`	5692.86	4965.80	1.15x	+15.0%	0.175659	0.201377
`join`	2927258.22	669983.08	4.37x	+337.0%	0.000342	0.001493
`analyze_many_native`	158.62	192.74	0.82x	-18.0%	6.304233	5.188234
`tokenize_many_batch`	151.12	190.38	0.79x	-21.0%	6.617300	5.252784
`space_many_batch`	150.76	159.43	0.95x	-5.0%	6.632977	6.272204

Δ% is (kiwi-rs / kiwipiepy - 1) * 100.

Interpretation:

join is now faster on kiwi-rs for repeated identical morph sequences because the default join path reuses an internal LRU joiner cache.
split_into_sents and glue are now above 1.0x even in the varied scenario after reducing miss-path cache overhead and reusing glue pair decisions.
prepare_joiner (joiner_reuse*) remains the fastest path when explicitly reusing a fixed morph sequence.
Repeated identical inputs show large gains on tokenize*, analyze*, and tokenized sentence split paths because internal result caches are reused.
For strict fairness, publish both scenarios together: input_mode=repeated (warm-cache) and input_mode=varied (near no-cache).
split_many_batch is still Python-only in this benchmark set.
Kiwi::init() includes runtime asset discovery/bootstrap checks, so startup should be evaluated separately from steady-state throughput.

kiwipiepy parity

Detailed matrix:

English: docs/kiwipiepy_parity.md
Korean: docs/kiwipiepy_parity.ko.md

In short, kiwi-rs already covers most C API-backed workflows, while Python/C++-specific layers (template/dataset/ngram utilities) remain out of scope for a pure C API binding.

Common errors

failed to load library
- Library path is invalid or inaccessible. Set KIWI_LIBRARY_PATH explicitly or use Kiwi::init().
Cannot open extract.mdl for WordDetector
- Model path is wrong. Point KIWI_MODEL_PATH (or config model path) to the directory containing model files.
reading type 'Ds' failed (iostream-style errors)
- Library/model version mismatch. Use matching assets from the same Kiwi release tag.

Local quality checks

cd kiwi-rs
cargo fmt
cargo check
cargo test
cargo clippy --all-targets -- -D warnings
cargo check --examples
cargo package --allow-dirty

License

kiwi-rs is licensed under LGPL-2.1-or-later.
The upstream Kiwi C library used by this project is distributed under LGPL 2.1 terms.
See LICENSE for the full license text.

kiwi-rs 0.1.4