jmdict-fast
Blazing-fast Japanese dictionary engine powered by FST indexing
Note: This crate uses bunpo for Japanese conjugation handling. Both crates are part of the same monorepo but are published separately to crates.io.
Features
- O(log n) lookups across kanji, kana, and romaji
- Memory-mapped FST indexes with zero-copy access
- Deinflection support via bunpo
- Two loading modes: embedded (compile-time) or runtime (filesystem)
- Full JMdict data including antonyms, dialects, field tags, and cross-references
Getting Started
1. Generate or download dictionary data
Data files are not included in the crate. Generate them or download pre-built artifacts:
# Option A: Generate from source (requires network access)
# Option B: Download pre-built data from GitHub Releases
# (asset name encodes JMdict + format versions; check Releases for current values)
|
This produces five files in dist/: kana.fst, kanji.fst, romaji.fst, id.fst, entries.bin.
2. Add the dependency
[]
= "0.1.1"
3. Use the library
Runtime-loaded mode (default)
use Dict;
Embedded mode
Bake data into your binary at compile time:
[]
= { = "0.1.1", = ["embedded"] }
let dict = load_embedded?;
Requires data files in
dist/when building. Runcargo xtask generatefirst.
Loading behavior
Dict::load_default() tries sources in order:
- Embedded data (if
embeddedfeature is enabled) JMDICT_DATAenv var — path to a directory with data filesdist/relative to the current directorydist/relative to the workspace root
You can also load from an explicit path:
let dict = load?;
Environment Variables
| Variable | Description |
|---|---|
JMDICT_DATA |
Path to directory containing FST and entries.bin files |
Feature Flags
| Feature | Description |
|---|---|
embedded |
Bake dictionary data into the binary via include_bytes! |
Data Generation (cargo xtask generate)
The xtask crate handles downloading JMdict and building the FST indexes:
# Generate to default output (dist/)
# Generate to a custom directory
Pre-built data is also attached to GitHub Releases as jmdict-data.tar.gz.
Data Structure
kana.fst kanji.fst romaji.fst id.fst
│ │ │ │
└──────────────┼──────────────┘ │
▼ │
entries.bin ◄─────────────────────┘
(postcard-serialized entries
with version header)
- FST maps — sorted key→entry-id indexes for each writing system
- entries.bin — versioned binary blob (magic
JMDF+ format version + postcard-serialized entries)
Migration from embedded-only API
Previous versions embedded dictionary data via include_bytes! in build.rs. The new architecture:
- Data generation moved to
cargo xtask generate(separate from compilation) - Embedded mode is now opt-in via the
embeddedfeature flag - Runtime loading is the default — set
JMDICT_DATAor place files indist/ Dict::load_default()cascades: embedded → env var → filesystem
If you were using Dict::load_default() before, it continues to work — just generate data first with cargo xtask generate.
API Reference
Loading
Dict::load(path)— Load from a specific directoryDict::load_default()— Auto-detect data locationDict::load_embedded()— Load compile-time embedded data (requiresembeddedfeature)
Lookups
dict.lookup_exact(term)— Exact match across kana, kanji, romajidict.lookup_partial(prefix)— Prefix searchdict.lookup_exact_with_deinflection(term)— Exact match with verb/adjective deinflection
Entry Structure
See docs.rs for full API documentation.
Performance
| Metric | Value |
|---|---|
| Index Size | ~888KB (FSTs) |
| Data Size | ~16MB binary blob |
| Lookup Speed | O(log n), ~4 us |
License
MIT License — see LICENSE for details.
Acknowledgments
- JMdict — The source dictionary data - see EDRDG DICTIONARY LICENCE STATEMENT
- FST crate — Fast finite state transducer implementation
- 10ten Japanese Reader for their deinflector implementation