π jmdict-fast
Blazing-fast Japanese dictionary engine, powered by FST indexing.
A Rust library that turns the official JMdict dataset into memory-mapped FST indexes and serves lookups in ~4 Β΅s. Designed for Japanese readers, IMEs, language-learning tools, and anything that needs to look up words fast.
Note: This crate uses bunpo for Japanese conjugation handling. Both crates live in the same monorepo but are published separately to crates.io.
β¨ Features
- β‘ Instant lookups β O(log n) exact matching across kanji, kana, and romaji (~4 Β΅s per lookup)
- π Multimodal search β exact, prefix, fuzzy (edit-distance), and English-gloss reverse lookup
- πͺΆ Memory-mapped β zero-copy access on
load; the kernel pages data in on demand, no upfront read into aVec, no allocations during lookup - π§ Deinflection-aware β finds
ι£γΉγfromι£γΉγΎγvia bunpo - π¦ Two loading modes β embedded (compile-time) or runtime-loaded (filesystem)
- π·οΈ Full JMdict data β antonyms, dialects, field tags, cross-references, JMdict IDs
- π― Filterable queries β by part-of-speech, misc tag, field, dialect, common-only, with limits
- π Stable lookup by JMdict ID plus a sequential iterator over every entry
ποΈ Performance at a Glance
| Metric | Value |
|---|---|
| Index size | ~888 KB (FSTs) |
| Data size | ~16 MB binary blob |
| Lookup speed | O(log n), ~4 Β΅s |
| Memory usage | Memory-mapped, zero allocations |
Side-by-side: jmdict-fast vs jmdict
The bundled Criterion bench (benches/lookup_word.rs) looks up η« against both crates on the same machine:
| Crate | Approach | Time per lookup | Relative |
|---|---|---|---|
jmdict-fast (this) |
FST index + memory-mapped binary blob | ~4.06 Β΅s | 1Γ |
jmdict v2.x |
Linear filter over entries() iterator |
~511.96 Β΅s | ~125Γ slower |
That's the gap between an O(log n) FST walk and an O(n) full-table scan β the bigger your dictionary, the wider the gap gets. Run cargo bench -p jmdict-fast to reproduce on your hardware.
Comparisons are deliberately scoped to crates that solve the same problem (in-process JMdict lookup from Rust). If you're aware of another crate worth benching against, please open an issue or PR.
π Getting Started
1. Generate or download dictionary data
Data files are not included in the crate. Generate them or download pre-built artifacts:
# Option A β generate from source (requires network access)
# Option B β download pre-built data from GitHub Releases
# (asset name encodes JMdict + format versions; check Releases for current values)
|
This produces seven files in dist/: kana.fst, kanji.fst, romaji.fst, id.fst, gloss.fst, entries.bin, and gloss_postings.bin.
2. Add the dependency
[]
= "0.1.1"
3. Use the library
Runtime-loaded mode (default)
use Dict;
Embedded mode
Bake data into your binary at compile time:
[]
= { = "0.1.1", = ["embedded"] }
let dict = load_embedded?;
Requires data files in
dist/when building. Runcargo xtask generatefirst.
π§ Loading Behavior
Dict::load_default() tries sources in order:
- Embedded data (if
embeddedfeature is enabled) JMDICT_DATAenv var β path to a directory with data filesdist/relative to the current directorydist/relative to the workspace root
Or load from an explicit path:
let dict = load?;
| Variable | Description |
|---|---|
JMDICT_DATA |
Path to directory containing FST and entries.bin files |
| Feature | Description |
|---|---|
embedded |
Bake dictionary data into the binary via include_bytes! |
π Data Structure
kana.fst kanji.fst romaji.fst id.fst gloss.fst
β β β β β
βββββββββββββΌβββββββββββββ β βΌ
βΌ β gloss_postings.bin
entries.bin βββββββββββββββββββ (per token: u32 count
(postcard-serialized entries followed by count Γ u64
with version header) entry ids, little-endian)
- FST maps β sorted keyβentry-id indexes for each writing system, plus a JMdict-ID index
- entries.bin β versioned binary blob (magic
JMDF+ format version + postcard-serialized entries) - gloss.fst + gloss_postings.bin β English-gloss reverse-lookup index: tokens β byte offset into a postings file containing the matching entry-id sets
π How It Works
- Build phase β
cargo xtask generatedownloads JMdict, normalizes it, and emits the four FSTs +entries.bin. - Runtime phase β
Dict::loadmemory-maps the FSTs, and lookups walk the FST to find an entry offset, then deserialize a single entry fromentries.bin. No global parse, no allocations on the hot path.
π API Reference
Loading
Dict::load(path)β load from a specific directoryDict::load_default()β auto-detect data locationDict::load_embedded()β load compile-time embedded data (requiresembeddedfeature)
Lookups
dict.lookup_exact(term)β exact match across kana, kanji, romajidict.lookup_partial(prefix)β prefix searchdict.lookup_exact_with_deinflection(term)β exact match with verb/adjective deinflectiondict.lookup_by_id(jmdict_id)β fetch by stable JMdict ID (string)dict.lookup_gloss("to eat")β reverse lookup by English gloss (multi-token = AND)dict.resolve_xref(&xref)β walkSenseEntry::related/antonymto entriesdict.lookup(term)βQueryBuilderwithmode,common_only,pos,misc,field,dialect,limit,max_distancedict.lookup_batch(terms)β same builder, multiple terms at once
Browsing
dict.get(seq_id)β fetch by sequential (internal) index0..entry_count()dict.iter_entries()β lazy iterator over every entrydict.entry_count()/dict.version()β dictionary metadata
Entry helpers
let entry = dict.lookup_exact.entry.clone;
entry.primary_kanji; // Some("η«")
entry.primary_kana; // Some("γγ")
entry.headword; // kanji if present, else kana
entry.is_common;
entry.glosses; // Iterator<Item = &str>
entry.parts_of_speech; // Vec<&str>, distinct, first-seen order
Entry structure
See docs.rs for the full API.
π€ Contributing
Issues, PRs, and ideas welcome β especially around new lookup modes, query ergonomics, and data quality. Fork, branch, test, PR.
π License
MIT License β see LICENSE.
π Acknowledgments
- JMdict β the source dictionary data. See the EDRDG dictionary licence statement.
- fst β the underlying finite-state-transducer crate.
- 10ten Japanese Reader β for their deinflector implementation.
Built with β€οΈ and Rust π¦