kiwi-rs
한국어 README | kiwipiepy parity (EN) | kiwipiepy parity (KO)
Rust bindings for Kiwi via the official C API (include/kiwi/capi.h).
AI user guide
If you use an AI assistant (Codex/ChatGPT/Claude/Gemini, etc.) to generate kiwi-rs code, ask for output with this contract:
- Choose one init path only (
Kiwi::init,Kiwi::new, orKiwi::from_config) and explain why. - Return runnable Rust code (
fn main() -> Result<(), Box<dyn std::error::Error>>). - Include one verification command (
cargo run --example ...orcargo run). - List 2-3 request-specific pitfalls (not generic advice).
Prompt template:
Use kiwi-rs and provide:
1) init path choice with reason,
2) copy-paste runnable Rust code,
3) one verification command,
4) pitfalls for this exact task.
Task: <describe your task here>
Environment: <OS / whether KIWI_LIBRARY_PATH and KIWI_MODEL_PATH are set>
Accuracy checks you should ask AI to follow:
- Treat UTF-8 offsets as character indices, not byte indices.
- Check
supports_utf16_api()before UTF-16 APIs. - Check
supports_analyze_mw()beforeanalyze_many_utf16_via_native. - Do not assume full
kiwipiepyparity (seedocs/kiwipiepy_parity.md).
Skill-based usage (skills/)
This repository includes a local AI skill for kiwi-rs:
- Skill file:
skills/kiwi-rs-assistant/SKILL.md - Reference docs:
skills/kiwi-rs-assistant/references/ - Agent metadata:
skills/kiwi-rs-assistant/agents/openai.yaml
If your assistant supports skill invocation, call it explicitly:
Use $kiwi-rs-assistant and implement: <your task>
llms.txt usage
Use llms.txt as the first context file when prompting AI. It summarizes the canonical docs, API surface, examples, and guardrails in one place.
- File:
llms.txt - Recommended prompt add-on:
Read llms.txt first, then answer using repository APIs and examples only.
Current support status
As of February 16, 2026:
- C API symbol loading: complete (
101/101symbols incapi.hare loaded) - Core high-level usage: implemented (
init/new/from_config,analyze/tokenize/split/join,MorphemeSet,Pretokenized, typo APIs,SwTokenizer, CoNg APIs) - kiwipiepy full surface parity: partial (Python/C++-specific layers still missing)
Installation
[]
= "0.1"
Runtime setup options
Option 1: automatic bootstrap in code
Kiwi::init() tries local paths first, then downloads a matching release pair (library + model) into cache.
use Kiwi;
Environment variables used by bootstrap:
KIWI_RS_VERSION(default:latest, e.g.v0.22.2)KIWI_RS_CACHE_DIR(default: OS cache directory)
External commands required by bootstrap:
- Common:
curl,tar - Windows zip extraction:
powershell(Expand-Archive)
Option 2: helper installer scripts
Linux/macOS:
Windows (PowerShell):
cd kiwi-rs
powershell -NoProfile -ExecutionPolicy Bypass -File .\scripts\install_kiwi.ps1
Installer options:
KIWI_VERSION/-Version(default:latest)KIWI_PREFIX/-Prefix(default:$HOME/.local/kiwion Unix,%LOCALAPPDATA%\\kiwion Windows)KIWI_MODEL_VARIANT/-ModelVariant(default:base)
Manual path configuration
Env-based (Kiwi::new)
KIWI_LIBRARY_PATH: dynamic library pathKIWI_MODEL_PATH: model directory path
Config-based (Kiwi::from_config)
use ;
API overview
Core
- Initialization:
Kiwi::init,Kiwi::new,Kiwi::from_config,Kiwi::init_direct - Analyze/tokenize:
analyze*,tokenize*,analyze_many*,tokenize_many* - Sentence split:
split_into_sents*,split_into_sents_with_options* - Join/spacing:
join*,space*,glue*
Advanced
- Builder: user words, alias words, pre-analyzed words, dictionary loading, regex rules, extract APIs
- Constraints:
MorphemeSet,Pretokenized - Typo:
KiwiTypo, default typo sets, cost controls - Subword:
SwTokenizer - CoNg: similarity/context/prediction/context-id conversion
UTF-16 and optional API checks
Kiwi::supports_utf16_apiKiwi::supports_analyze_mwKiwiLibrary::supports_builder_init_stream
Supported APIs
Kiwi
The main struct for analysis.
- Initialization:
init,init_with_version,new,from_config,init_direct,with_model_path - Analysis:
analyze,analyze_top_n,analyze_with_options,analyze_with_blocklist,analyze_with_pretokenized,analyze_with_blocklist_and_pretokenized - Tokenization:
tokenize,tokenize_with_match_options,tokenize_with_options,tokenize_with_blocklist,tokenize_with_pretokenized,tokenize_with_blocklist_and_pretokenized - Multi-string Analysis:
analyze_many_with_options,analyze_many_via_native,tokenize_many,tokenize_many_with_echo - Sentence Splitting:
split_into_sents,split_into_sents_with_options - Spacing/Joining:
space,space_many,glue,glue_with_options,join,prepare_join_morphs,prepare_join_tokens,prepare_joiner,join_prepared,join_prepared_utf16 - Configuration:
global_config,set_global_configset_option,get_option,set_option_f,get_option_fcutoff_threshold,set_cutoff_thresholdintegrate_allomorph,set_integrate_allomorphspace_penalty,set_space_penalty,space_tolerance,set_space_tolerancemax_unk_form_size,set_max_unk_form_sizetypo_cost_weight,set_typo_cost_weight
- Morpheme/Sense Info:
morpheme,morpheme_info,morpheme_form,list_senses,tag_to_string,script_name,list_all_scripts - Search:
find_morphemes,find_morphemes_with_prefix - Semantics (CoNg):
most_similar_morphemes,most_similar_contextspredict_words_from_context,predict_next_morphemepredict_words_from_context_diff,predict_next_morpheme_diffmorpheme_similarity,context_similarityto_context_id,from_context_id
- Sub-objects Creation:
typo,basic_typo,default_typo_set,new_morphset,new_pretokenized,open_sw_tokenizer - UTF-16:
analyze_utf16*,tokenize_utf16*,split_into_sents_utf16*,join_utf16,analyze_many_utf16_via_native - Misc:
library_version,num_workers,model_type,typo_cost_threshold,add_re_word,clear_re_words
KiwiBuilder
Used to customize the dictionary and build a Kiwi instance.
- Build:
build,build_with_default_options - Word Management:
add_user_word,add_pre_analyzed_word,add_rule,add_re_rule,add_alias,add_automata - Dictionary Loading:
load_dictionary,load_user_dictionary,extract_add_words - Configuration:
set_option,get_option,set_option_f,get_option_f,set_cut_off_threshold,set_integrate_allomorph,set_model_path
KiwiTypo
Corrects typos in text.
- Creation:
Kiwi::typo,Kiwi::basic_typo,Kiwi::default_typo_set - Management:
add,update,scale_cost,set_continual_typo_cost,set_lengthening_typo_cost,copy
SwTokenizer
Subword tokenizer.
- Usage:
encode,encode_with_offsets,decode
MorphemeSet
A set of morphemes for blocklisting.
- Management:
add,add_utf16
Pretokenized
Defines pre-analyzed token spans.
- Management:
add_span,add_token_to_span,add_token_to_span_utf16
Examples
What each example is for:
| Example | What you learn | Key APIs | Notes |
|---|---|---|---|
basic |
End-to-end quick start (init + tokenize) | Kiwi::init, Kiwi::tokenize |
Demonstrates cache bootstrap behavior when assets are missing. |
analyze_options |
How candidate analysis options change output | AnalyzeOptions, Kiwi::analyze_with_options |
Shows top_n, match_options, and candidate probabilities. |
builder_custom_words |
Building a custom analyzer with user lexicon/rules | KiwiLibrary::builder, add_user_words, add_re_rule |
Uses builder-time customization APIs. |
typo_build |
Enabling typo-aware analysis | default_typo_set, build_with_typo_and_default_options |
Prints typo-related token metadata. |
blocklist_and_pretokenized |
Blocking specific morphemes and forcing token spans | new_morphset, new_pretokenized, tokenize_with_blocklist_and_pretokenized |
Useful for domain constraints and deterministic spans. |
split_sentences |
Sentence segmentation with per-sentence token/sub-sentence structures | split_into_sents_with_options |
Shows the Sentence return surface (text/start/end/tokens/subs). |
utf16_api |
UTF-16 analysis/tokenization/sentence split path | supports_utf16_api, analyze_utf16*, tokenize_utf16*, split_into_sents_utf16* |
Includes runtime feature check for UTF-16 support. |
native_batch |
Native callback-based batch analysis route | analyze_many_via_native, analyze_many_utf16_via_native |
Useful for higher-throughput multi-line processing. |
sw_tokenizer |
Subword tokenizer encode/decode flow | open_sw_tokenizer, encode_with_offsets, decode |
Requires tokenizer.json path argument. |
morpheme_semantics |
Morpheme ID lookup and CoNg semantic utilities | find_morphemes, morpheme, most_similar_morphemes, to_context_id |
Shows semantic APIs that operate on morpheme/context IDs. |
bench_tokenize |
Fair latency/throughput timing split by phase | Kiwi::init, Kiwi::tokenize |
Prints init, first call, and steady-state tokenize metrics using the same text repeatedly. |
bench_features |
Expanded feature throughput/latency comparison (Rust side) | tokenize, analyze_with_options, split_into_sents*, space*, join*, glue, analyze_many*, tokenize_many |
Pair with scripts/bench_features_kiwipiepy.py and scripts/compare_feature_bench.py for Rust vs Python comparison. |
Rust vs Python benchmark (same conditions)
Use the same input text / warmup / iteration count for both sides:
Notes:
- Compare
bench_avg_ms,calls_per_sec, andtokens_per_secfor steady-state speed. - Compare
init_msandfirst_tokenize_msseparately; startup can dominate one-shot runs. - Ensure both runtimes use the same Kiwi library/model assets (
KIWI_LIBRARY_PATH,KIWI_MODEL_PATH) when strict 1:1 comparison is required. - For option parity with
kiwipiepytokenize defaults, add--python-default-optionson the Rust benchmark command.
Expanded feature benchmark snapshot (local run, 2026-02-17)
Commands:
Automated weekly run (same command) is configured in .github/workflows/feature-benchmark.yml.
Generated markdown/json snapshots now include benchmark environment and config metadata.
Summary below is the median of 1 run, with min-max in brackets (same value for single-run snapshots).
Benchmark environment:
| Item | Value |
|---|---|
| Timestamp (local) | 2026-02-17T17:10:06+09:00 |
| OS | Darwin 24.6.0 |
| Platform | macOS-15.7.4-arm64-arm-64bit-Mach-O |
| CPU | arm64 (CPU brand unavailable in sandbox) |
| Cores (physical/logical) | -/10 |
| Memory | 16.00 GiB (17179869184 bytes) |
| rustc | rustc 1.93.1 (01f6ddf75 2026-02-11) |
| cargo | cargo 1.93.1 (083ac5135 2025-12-15) |
| Python (harness) | 3.14.3 (main, Feb 3 2026, 15:32:20) [Clang 17.0.0 (clang-1700.6.3.2)] |
| Python (bench bin) | Python 3.14.3 (.venv-bench/bin/python) |
| kiwipiepy | 0.22.2 |
| Git | 753b8dc4d648d33b5ed6f163ba2ae3cb46397a7e (main, dirty=True) |
Benchmark config:
| Item | Value |
|---|---|
| text | 아버지가방에들어가신다. |
| warmup | 100 |
| iters | 5000 |
| batch_size | 256 |
| batch_iters | 500 |
| input_mode | repeated |
| variant_pool | 4096 |
| repeats | 1 |
| join_lm_search | true |
Throughput comparison (calls_per_sec, higher is better):
| Feature | kiwi-rs |
kiwipiepy |
Relative (kiwi-rs / kiwipiepy) |
|---|---|---|---|
tokenize |
1185489.51 [1185489.51-1185489.51] | 7792.55 [7792.55-7792.55] | 152.13x |
analyze_top1 |
1199112.66 [1199112.66-1199112.66] | 7612.25 [7612.25-7612.25] | 157.52x |
split_into_sents |
28908752.41 [28908752.41-28908752.41] | 3802.38 [3802.38-3802.38] | 7602.80x |
split_into_sents_with_tokens |
250558.01 [250558.01-250558.01] | 4872.41 [4872.41-4872.41] | 51.42x |
space |
357757.20 [357757.20-357757.20] | 4768.69 [4768.69-4768.69] | 75.02x |
join |
2402355.08 [2402355.08-2402355.08] | 675759.32 [675759.32-675759.32] | 3.56x |
glue |
6221490.02 [6221490.02-6221490.02] | 7613.64 [7613.64-7613.64] | 817.15x |
analyze_many_loop |
32.36 [32.36-32.36] | 27.94 [27.94-27.94] | 1.16x |
analyze_many_native |
166.11 [166.11-166.11] | 165.71 [165.71-165.71] | 1.00x |
tokenize_many_loop |
3409.24 [3409.24-3409.24] | 28.66 [28.66-28.66] | 118.95x |
tokenize_many_batch |
3134.67 [3134.67-3134.67] | 184.16 [184.16-184.16] | 17.02x |
split_many_loop |
27.87 [27.87-27.87] | 29.18 [29.18-29.18] | 0.96x |
space_many_loop |
29.39 [29.39-29.39] | 27.22 [27.22-27.22] | 1.08x |
space_many_batch |
161.79 [161.79-161.79] | 160.39 [160.39-160.39] | 1.01x |
batch_analyze_native |
166.11 [166.11-166.11] | 165.71 [165.71-165.71] | 1.00x |
Startup (init_ms, lower is better):
| Init path | kiwi-rs |
kiwipiepy |
|---|---|---|
Kiwi::init() / Kiwi() |
1417.905 [1417.905-1417.905] ms | 680.748 [680.748-680.748] ms |
Rust-only benchmark features:
| Feature | kiwi-rs |
|---|---|
join_prepared |
277556.12 [277556.12-277556.12] |
join_prepared_utf16 |
278618.79 [278618.79-278618.79] |
joiner_reuse |
3518440.85 [3518440.85-3518440.85] |
joiner_reuse_utf16 |
2743359.29 [2743359.29-2743359.29] |
Python-only benchmark features:
| Feature | kiwipiepy |
|---|---|
split_many_batch |
181.50 [181.50-181.50] |
Varied-input (near no-cache) ratio snapshot (input_mode=varied, variant_pool=8192):
| Feature | Repeated Ratio | Varied Ratio |
|---|---|---|
tokenize |
152.13x | 0.94x |
analyze_top1 |
157.52x | 1.01x |
split_into_sents |
7602.80x | 1.16x |
split_into_sents_with_tokens |
51.42x | 1.02x |
glue |
817.15x | 1.15x |
analyze_many_native |
1.00x | 0.82x |
tokenize_many_batch |
17.02x | 0.79x |
space_many_batch |
1.01x | 0.95x |
join |
3.56x | 4.37x |
Visual bar charts (relative throughput):
xychart-beta
title "Repeated Input Ratio (Selected)"
x-axis ["tokenize","analyze_top1","split_with_tokens","join","analyze_many_native","tokenize_many_batch","space_many_batch"]
y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 170
bar [152.13,157.52,51.42,3.56,1.00,17.02,1.01]
xychart-beta
title "Repeated Input Ratio (Split + Glue)"
x-axis ["split_into_sents","glue"]
y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 8000
bar [7602.80,817.15]
xychart-beta
title "Varied Input Ratio (Near No-Cache)"
x-axis ["tokenize","analyze_top1","split","split_with_tokens","space","glue","join","analyze_many_native","tokenize_many_batch","space_many_batch"]
y-axis "kiwi-rs / kiwipiepy (x)" 0 --> 5
bar [0.94,1.01,1.16,1.02,1.10,1.15,4.37,0.82,0.79,0.95]
Interpretation:
joinis now faster onkiwi-rsfor repeated identical morph sequences because the defaultjoinpath reuses an internal LRU joiner cache.split_into_sentsandglueare now above 1.0x even in thevariedscenario after reducing miss-path cache overhead and reusing glue pair decisions.prepare_joiner(joiner_reuse*) remains the fastest path when explicitly reusing a fixed morph sequence.- Repeated identical inputs show large gains on
tokenize*,analyze*, and tokenized sentence split paths because internal result caches are reused. - For strict fairness, publish both scenarios together:
input_mode=repeated(warm-cache) andinput_mode=varied(near no-cache). split_many_batchis still Python-only in this benchmark set.Kiwi::init()includes runtime asset discovery/bootstrap checks, so startup should be evaluated separately from steady-state throughput.
kiwipiepy parity
Detailed matrix:
- English:
docs/kiwipiepy_parity.md - Korean:
docs/kiwipiepy_parity.ko.md
In short, kiwi-rs already covers most C API-backed workflows, while Python/C++-specific layers (template/dataset/ngram utilities) remain out of scope for a pure C API binding.
Common errors
-
failed to load library- Library path is invalid or inaccessible. Set
KIWI_LIBRARY_PATHexplicitly or useKiwi::init().
- Library path is invalid or inaccessible. Set
-
Cannot open extract.mdl for WordDetector- Model path is wrong. Point
KIWI_MODEL_PATH(or config model path) to the directory containing model files.
- Model path is wrong. Point
-
reading type 'Ds' failed(iostream-style errors)- Library/model version mismatch. Use matching assets from the same Kiwi release tag.
Local quality checks
License
Same as Kiwi: LGPL v3.