kiwi-rs
한국어 README | kiwipiepy parity (EN) | kiwipiepy parity (KO)
Rust bindings for Kiwi via the official C API (include/kiwi/capi.h).
AI user guide
If you use an AI assistant (Codex/ChatGPT/Claude/Gemini, etc.) to generate kiwi-rs code, ask for output with this contract:
- Choose one init path only (
Kiwi::init,Kiwi::new, orKiwi::from_config) and explain why. - Return runnable Rust code (
fn main() -> Result<(), Box<dyn std::error::Error>>). - Include one verification command (
cargo run --example ...orcargo run). - List 2-3 request-specific pitfalls (not generic advice).
Prompt template:
Use kiwi-rs and provide:
1) init path choice with reason,
2) copy-paste runnable Rust code,
3) one verification command,
4) pitfalls for this exact task.
Task: <describe your task here>
Environment: <OS / whether KIWI_LIBRARY_PATH and KIWI_MODEL_PATH are set>
Accuracy checks you should ask AI to follow:
- Treat UTF-8 offsets as character indices, not byte indices.
- Check
supports_utf16_api()before UTF-16 APIs. - Check
supports_analyze_mw()beforeanalyze_many_utf16_via_native. - Do not assume full
kiwipiepyparity (seedocs/kiwipiepy_parity.md).
Skill-based usage (skills/)
This repository includes a local AI skill for kiwi-rs:
- Skill file:
skills/kiwi-rs-assistant/SKILL.md - Reference docs:
skills/kiwi-rs-assistant/references/ - Agent metadata:
skills/kiwi-rs-assistant/agents/openai.yaml
If your assistant supports skill invocation, call it explicitly:
Use $kiwi-rs-assistant and implement: <your task>
llms.txt usage
Use llms.txt as the first context file when prompting AI. It summarizes the canonical docs, API surface, examples, and guardrails in one place.
- File:
llms.txt - Recommended prompt add-on:
Read llms.txt first, then answer using repository APIs and examples only.
Current support status
As of February 16, 2026:
- C API symbol loading: complete (
101/101symbols incapi.hare loaded) - Core high-level usage: implemented (
init/new/from_config,analyze/tokenize/split/join,MorphemeSet,Pretokenized, typo APIs,SwTokenizer, CoNg APIs) - kiwipiepy full surface parity: partial (Python/C++-specific layers still missing)
Installation
[]
= "0.1"
Runtime setup options
Option 1: automatic bootstrap in code
Kiwi::init() tries local paths first, then downloads a matching release pair (library + model) into cache.
use Kiwi;
Environment variables used by bootstrap:
KIWI_RS_VERSION(default:latest, e.g.v0.22.2)KIWI_RS_CACHE_DIR(default: OS cache directory)
External commands required by bootstrap:
- Common:
curl,tar - Windows zip extraction:
powershell(Expand-Archive)
Option 2: helper installer scripts
Linux/macOS:
Windows (PowerShell):
cd kiwi-rs
powershell -NoProfile -ExecutionPolicy Bypass -File .\scripts\install_kiwi.ps1
Installer options:
KIWI_VERSION/-Version(default:latest)KIWI_PREFIX/-Prefix(default:$HOME/.local/kiwion Unix,%LOCALAPPDATA%\\kiwion Windows)KIWI_MODEL_VARIANT/-ModelVariant(default:base)
Manual path configuration
Env-based (Kiwi::new)
KIWI_LIBRARY_PATH: dynamic library pathKIWI_MODEL_PATH: model directory path
Config-based (Kiwi::from_config)
use ;
API overview
Core
- Initialization:
Kiwi::init,Kiwi::new,Kiwi::from_config,Kiwi::init_direct - Analyze/tokenize:
analyze*,tokenize*,analyze_many*,tokenize_many* - Sentence split:
split_into_sents*,split_into_sents_with_options* - Join/spacing:
join*,space*,glue*
Advanced
- Builder: user words, alias words, pre-analyzed words, dictionary loading, regex rules, extract APIs
- Constraints:
MorphemeSet,Pretokenized - Typo:
KiwiTypo, default typo sets, cost controls - Subword:
SwTokenizer - CoNg: similarity/context/prediction/context-id conversion
UTF-16 and optional API checks
Kiwi::supports_utf16_apiKiwi::supports_analyze_mwKiwiLibrary::supports_builder_init_stream
Supported APIs
Kiwi
The main struct for analysis.
- Initialization:
init,init_with_version,new,from_config,init_direct,with_model_path - Analysis:
analyze,analyze_top_n,analyze_with_options,analyze_with_blocklist,analyze_with_pretokenized,analyze_with_blocklist_and_pretokenized - Tokenization:
tokenize,tokenize_with_match_options,tokenize_with_options,tokenize_with_blocklist,tokenize_with_pretokenized,tokenize_with_blocklist_and_pretokenized - Multi-string Analysis:
analyze_many_with_options,analyze_many_via_native,tokenize_many,tokenize_many_with_echo - Sentence Splitting:
split_into_sents,split_into_sents_with_options - Spacing/Joining:
space,space_many,glue,glue_with_options,join,prepare_join_morphs,prepare_join_tokens,prepare_joiner,join_prepared,join_prepared_utf16 - Configuration:
global_config,set_global_configset_option,get_option,set_option_f,get_option_fcutoff_threshold,set_cutoff_thresholdintegrate_allomorph,set_integrate_allomorphspace_penalty,set_space_penalty,space_tolerance,set_space_tolerancemax_unk_form_size,set_max_unk_form_sizetypo_cost_weight,set_typo_cost_weight
- Morpheme/Sense Info:
morpheme,morpheme_info,morpheme_form,list_senses,tag_to_string,script_name,list_all_scripts - Search:
find_morphemes,find_morphemes_with_prefix - Semantics (CoNg):
most_similar_morphemes,most_similar_contextspredict_words_from_context,predict_next_morphemepredict_words_from_context_diff,predict_next_morpheme_diffmorpheme_similarity,context_similarityto_context_id,from_context_id
- Sub-objects Creation:
typo,basic_typo,default_typo_set,new_morphset,new_pretokenized,open_sw_tokenizer - UTF-16:
analyze_utf16*,tokenize_utf16*,split_into_sents_utf16*,join_utf16,analyze_many_utf16_via_native - Misc:
library_version,num_workers,model_type,typo_cost_threshold,add_re_word,clear_re_words
KiwiBuilder
Used to customize the dictionary and build a Kiwi instance.
- Build:
build,build_with_default_options - Word Management:
add_user_word,add_pre_analyzed_word,add_rule,add_re_rule,add_alias,add_automata - Dictionary Loading:
load_dictionary,load_user_dictionary,extract_add_words - Configuration:
set_option,get_option,set_option_f,get_option_f,set_cut_off_threshold,set_integrate_allomorph,set_model_path
KiwiTypo
Corrects typos in text.
- Creation:
Kiwi::typo,Kiwi::basic_typo,Kiwi::default_typo_set - Management:
add,update,scale_cost,set_continual_typo_cost,set_lengthening_typo_cost,copy
SwTokenizer
Subword tokenizer.
- Usage:
encode,encode_with_offsets,decode
MorphemeSet
A set of morphemes for blocklisting.
- Management:
add,add_utf16
Pretokenized
Defines pre-analyzed token spans.
- Management:
add_span,add_token_to_span,add_token_to_span_utf16
Examples
What each example is for:
| Example | What you learn | Key APIs | Notes |
|---|---|---|---|
basic |
End-to-end quick start (init + tokenize) | Kiwi::init, Kiwi::tokenize |
Demonstrates cache bootstrap behavior when assets are missing. |
analyze_options |
How candidate analysis options change output | AnalyzeOptions, Kiwi::analyze_with_options |
Shows top_n, match_options, and candidate probabilities. |
builder_custom_words |
Building a custom analyzer with user lexicon/rules | KiwiLibrary::builder, add_user_words, add_re_rule |
Uses builder-time customization APIs. |
typo_build |
Enabling typo-aware analysis | default_typo_set, build_with_typo_and_default_options |
Prints typo-related token metadata. |
blocklist_and_pretokenized |
Blocking specific morphemes and forcing token spans | new_morphset, new_pretokenized, tokenize_with_blocklist_and_pretokenized |
Useful for domain constraints and deterministic spans. |
split_sentences |
Sentence segmentation with per-sentence token/sub-sentence structures | split_into_sents_with_options |
Shows the Sentence return surface (text/start/end/tokens/subs). |
utf16_api |
UTF-16 analysis/tokenization/sentence split path | supports_utf16_api, analyze_utf16*, tokenize_utf16*, split_into_sents_utf16* |
Includes runtime feature check for UTF-16 support. |
native_batch |
Native callback-based batch analysis route | analyze_many_via_native, analyze_many_utf16_via_native |
Useful for higher-throughput multi-line processing. |
sw_tokenizer |
Subword tokenizer encode/decode flow | open_sw_tokenizer, encode_with_offsets, decode |
Requires tokenizer.json path argument. |
morpheme_semantics |
Morpheme ID lookup and CoNg semantic utilities | find_morphemes, morpheme, most_similar_morphemes, to_context_id |
Shows semantic APIs that operate on morpheme/context IDs. |
bench_tokenize |
Fair latency/throughput timing split by phase | Kiwi::init, Kiwi::tokenize |
Prints init, first call, and steady-state tokenize metrics using the same text repeatedly. |
bench_features |
Expanded feature throughput/latency comparison (Rust side) | tokenize, analyze_with_options, split_into_sents*, space*, join*, glue, analyze_many*, tokenize_many |
Pair with scripts/bench_features_kiwipiepy.py and scripts/compare_feature_bench.py for Rust vs Python comparison. |
Rust vs Python benchmark (same conditions)
Use the same input text / warmup / iteration count for both sides:
Notes:
- Compare
bench_avg_ms,calls_per_sec, andtokens_per_secfor steady-state speed. - Compare
init_msandfirst_tokenize_msseparately; startup can dominate one-shot runs. - Ensure both runtimes use the same Kiwi library/model assets (
KIWI_LIBRARY_PATH,KIWI_MODEL_PATH) when strict 1:1 comparison is required. - For option parity with
kiwipiepytokenize defaults, add--python-default-optionson the Rust benchmark command.
Expanded feature benchmark snapshot (local run, 2026-02-17)
Commands:
Automated weekly run (same command) is configured in .github/workflows/feature-benchmark.yml.
Generated markdown/json snapshots now include benchmark environment and config metadata.
Summary below is the median of 1 run, with min-max in brackets (same value for single-run snapshots).
Benchmark environment:
| Item | Value |
|---|---|
| Timestamp (local) | 2026-02-17T17:10:06+09:00 |
| OS | Darwin 24.6.0 |
| Platform | macOS-15.7.4-arm64-arm-64bit-Mach-O |
| CPU | arm64 (CPU brand unavailable in sandbox) |
| Cores (physical/logical) | -/10 |
| Memory | 16.00 GiB (17179869184 bytes) |
| rustc | rustc 1.93.1 (01f6ddf75 2026-02-11) |
| cargo | cargo 1.93.1 (083ac5135 2025-12-15) |
| Python (harness) | 3.14.3 (main, Feb 3 2026, 15:32:20) [Clang 17.0.0 (clang-1700.6.3.2)] |
| Python (bench bin) | Python 3.14.3 (.venv-bench/bin/python) |
| kiwipiepy | 0.22.2 |
| Git | 753b8dc4d648d33b5ed6f163ba2ae3cb46397a7e (main, dirty=True) |
Benchmark config:
| Item | Value |
|---|---|
| text | 아버지가방에들어가신다. |
| warmup | 100 |
| iters | 5000 |
| batch_size | 256 |
| batch_iters | 500 |
| input_mode | repeated |
| variant_pool | 4096 |
| repeats | 1 |
| join_lm_search | true |
Throughput comparison (calls_per_sec, higher is better):
| Feature | kiwi-rs |
kiwipiepy |
Relative (kiwi-rs / kiwipiepy) |
|---|---|---|---|
tokenize |
1185489.51 [1185489.51-1185489.51] | 7792.55 [7792.55-7792.55] | 152.13x |
analyze_top1 |
1199112.66 [1199112.66-1199112.66] | 7612.25 [7612.25-7612.25] | 157.52x |
split_into_sents |
28908752.41 [28908752.41-28908752.41] | 3802.38 [3802.38-3802.38] | 7602.80x |
split_into_sents_with_tokens |
250558.01 [250558.01-250558.01] | 4872.41 [4872.41-4872.41] | 51.42x |
space |
357757.20 [357757.20-357757.20] | 4768.69 [4768.69-4768.69] | 75.02x |
join |
2402355.08 [2402355.08-2402355.08] | 675759.32 [675759.32-675759.32] | 3.56x |
glue |
6221490.02 [6221490.02-6221490.02] | 7613.64 [7613.64-7613.64] | 817.15x |
analyze_many_loop |
32.36 [32.36-32.36] | 27.94 [27.94-27.94] | 1.16x |
analyze_many_native |
166.11 [166.11-166.11] | 165.71 [165.71-165.71] | 1.00x |
tokenize_many_loop |
3409.24 [3409.24-3409.24] | 28.66 [28.66-28.66] | 118.95x |
tokenize_many_batch |
3134.67 [3134.67-3134.67] | 184.16 [184.16-184.16] | 17.02x |
split_many_loop |
27.87 [27.87-27.87] | 29.18 [29.18-29.18] | 0.96x |
space_many_loop |
29.39 [29.39-29.39] | 27.22 [27.22-27.22] | 1.08x |
space_many_batch |
161.79 [161.79-161.79] | 160.39 [160.39-160.39] | 1.01x |
batch_analyze_native |
166.11 [166.11-166.11] | 165.71 [165.71-165.71] | 1.00x |
Startup (init_ms, lower is better):
| Init path | kiwi-rs |
kiwipiepy |
|---|---|---|
Kiwi::init() / Kiwi() |
1417.905 [1417.905-1417.905] ms | 680.748 [680.748-680.748] ms |
Rust-only benchmark features:
| Feature | kiwi-rs |
|---|---|
join_prepared |
277556.12 [277556.12-277556.12] |
join_prepared_utf16 |
278618.79 [278618.79-278618.79] |
joiner_reuse |
3518440.85 [3518440.85-3518440.85] |
joiner_reuse_utf16 |
2743359.29 [2743359.29-2743359.29] |
Python-only benchmark features:
| Feature | kiwipiepy |
|---|---|
split_many_batch |
181.50 [181.50-181.50] |
Varied-input (near no-cache) ratio snapshot (input_mode=varied, variant_pool=8192):
| Feature | Repeated Ratio | Repeated Δ% | Varied Ratio | Varied Δ% |
|---|---|---|---|---|
tokenize |
152.13x | +15113.0% | 0.94x | -6.0% |
analyze_top1 |
157.52x | +15652.0% | 1.01x | +1.0% |
split_into_sents |
7602.80x | +760180.0% | 1.16x | +16.0% |
split_into_sents_with_tokens |
51.42x | +5042.0% | 1.02x | +2.0% |
glue |
817.15x | +81615.0% | 1.15x | +15.0% |
analyze_many_native |
1.00x | +0.0% | 0.82x | -18.0% |
tokenize_many_batch |
17.02x | +1602.0% | 0.79x | -21.0% |
space_many_batch |
1.01x | +1.0% | 0.95x | -5.0% |
join |
3.56x | +256.0% | 4.37x | +337.0% |
Δ% is (kiwi-rs / kiwipiepy - 1) * 100.
+ means kiwi-rs is faster, - means slower.
Visual bar charts (relative throughput):
xychart-beta
title "Repeated Input Ratio (Selected)"
x-axis ["tokenize","analyze_top1","split_with_tokens","join","analyze_many_native","tokenize_many_batch","space_many_batch"]
y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 170
bar [152.13,157.52,51.42,3.56,1.00,17.02,1.01]
xychart-beta
title "Repeated Input Ratio (Split + Glue)"
x-axis ["split_into_sents","glue"]
y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 8000
bar [7602.80,817.15]
xychart-beta
title "Varied Input Ratio (Near No-Cache)"
x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","join","analyze_many_native","tokenize_many_batch","space_many_batch"]
y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 5
bar [0.94,1.01,1.16,1.02,1.10,1.15,4.37,0.82,0.79,0.95]
Absolute-value charts (varied input, near no-cache):
- Throughput = number of calls processed per second (
calls/sec, higher is better) - Latency = average time per call (
avg_ms, lower is better) mermaid xychart-betacan visually overlap multi-bar series in some renderers.- To keep readability, charts below are split by engine.
xychart-beta
title "Varied Throughput (Core Features, kiwi-rs)"
x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","analyze_many_native","tokenize_many_batch","space_many_batch"]
y-axis "calls/sec (higher is better)" 0 --> 8000
bar [6956.95,7319.22,5104.73,4372.13,4944.59,5692.86,158.62,151.12,150.76]
xychart-beta
title "Varied Throughput (Core Features, kiwipiepy)"
x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","analyze_many_native","tokenize_many_batch","space_many_batch"]
y-axis "calls/sec (higher is better)" 0 --> 8000
bar [7393.81,7212.44,4399.49,4282.95,4497.21,4965.80,192.74,190.38,159.43]
xychart-beta
title "Varied Throughput (Join)"
x-axis ["join (kiwi-rs)","join (kiwipiepy)"]
y-axis "calls/sec (higher is better)" 0 --> 3000000
bar [2927258.22,669983.08]
xychart-beta
title "Varied Latency (Core Features, kiwi-rs)"
x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","analyze_many_native","tokenize_many_batch","space_many_batch"]
y-axis "avg ms/call (lower is better)" 0 --> 7
bar [0.143741,0.136627,0.195897,0.228721,0.202241,0.175659,6.304233,6.617300,6.632977]
xychart-beta
title "Varied Latency (Core Features, kiwipiepy)"
x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","analyze_many_native","tokenize_many_batch","space_many_batch"]
y-axis "avg ms/call (lower is better)" 0 --> 7
bar [0.135248,0.138649,0.227299,0.233484,0.222360,0.201377,5.188234,5.252784,6.272204]
Side-by-side numeric comparison (varied input, near no-cache):
| Feature | kiwi-rs calls/sec |
kiwipiepy calls/sec |
Ratio (x) |
Δ% | kiwi-rs avg_ms |
kiwipiepy avg_ms |
|---|---|---|---|---|---|---|
tokenize |
6956.95 | 7393.81 | 0.94x | -6.0% | 0.143741 | 0.135248 |
analyze_top1 |
7319.22 | 7212.44 | 1.01x | +1.0% | 0.136627 | 0.138649 |
split_into_sents |
5104.73 | 4399.49 | 1.16x | +16.0% | 0.195897 | 0.227299 |
split_into_sents_with_tokens |
4372.13 | 4282.95 | 1.02x | +2.0% | 0.228721 | 0.233484 |
space |
4944.59 | 4497.21 | 1.10x | +10.0% | 0.202241 | 0.222360 |
glue |
5692.86 | 4965.80 | 1.15x | +15.0% | 0.175659 | 0.201377 |
join |
2927258.22 | 669983.08 | 4.37x | +337.0% | 0.000342 | 0.001493 |
analyze_many_native |
158.62 | 192.74 | 0.82x | -18.0% | 6.304233 | 5.188234 |
tokenize_many_batch |
151.12 | 190.38 | 0.79x | -21.0% | 6.617300 | 5.252784 |
space_many_batch |
150.76 | 159.43 | 0.95x | -5.0% | 6.632977 | 6.272204 |
Δ% is (kiwi-rs / kiwipiepy - 1) * 100.
Interpretation:
joinis now faster onkiwi-rsfor repeated identical morph sequences because the defaultjoinpath reuses an internal LRU joiner cache.split_into_sentsandglueare now above 1.0x even in thevariedscenario after reducing miss-path cache overhead and reusing glue pair decisions.prepare_joiner(joiner_reuse*) remains the fastest path when explicitly reusing a fixed morph sequence.- Repeated identical inputs show large gains on
tokenize*,analyze*, and tokenized sentence split paths because internal result caches are reused. - For strict fairness, publish both scenarios together:
input_mode=repeated(warm-cache) andinput_mode=varied(near no-cache). split_many_batchis still Python-only in this benchmark set.Kiwi::init()includes runtime asset discovery/bootstrap checks, so startup should be evaluated separately from steady-state throughput.
kiwipiepy parity
Detailed matrix:
- English:
docs/kiwipiepy_parity.md - Korean:
docs/kiwipiepy_parity.ko.md
In short, kiwi-rs already covers most C API-backed workflows, while Python/C++-specific layers (template/dataset/ngram utilities) remain out of scope for a pure C API binding.
Common errors
-
failed to load library- Library path is invalid or inaccessible. Set
KIWI_LIBRARY_PATHexplicitly or useKiwi::init().
- Library path is invalid or inaccessible. Set
-
Cannot open extract.mdl for WordDetector- Model path is wrong. Point
KIWI_MODEL_PATH(or config model path) to the directory containing model files.
- Model path is wrong. Point
-
reading type 'Ds' failed(iostream-style errors)- Library/model version mismatch. Use matching assets from the same Kiwi release tag.
Local quality checks
License
kiwi-rsis licensed under LGPL-2.1-or-later.- The upstream Kiwi C library used by this project is distributed under LGPL 2.1 terms.
- See
LICENSEfor the full license text.