jmdict-fast

Blazing-fast Japanese dictionary engine powered by FST indexing

Note: This crate uses bunpo for Japanese conjugation handling. Both crates are part of the same monorepo but are published separately to crates.io.

Features

O(log n) lookups across kanji, kana, and romaji
Memory-mapped FST indexes with zero-copy access
Deinflection support via bunpo
Two loading modes: embedded (compile-time) or runtime (filesystem)
Full JMdict data including antonyms, dialects, field tags, and cross-references

Getting Started

1. Generate or download dictionary data

Data files are not included in the crate. Generate them or download pre-built artifacts:

# Option A: Generate from source (requires network access)
cargo xtask generate

# Option B: Download pre-built data from GitHub Releases
# (asset name encodes JMdict + format versions; check Releases for current values)
mkdir -p dist
curl -L https://github.com/theGlenn/jmdict-fst/releases/latest/download/jmdict-data-jmdict3.6.1-fmt3.tar.gz | tar xz -C dist/

This produces five files in dist/: kana.fst, kanji.fst, romaji.fst, id.fst, entries.bin.

2. Add the dependency

[dependencies]
jmdict-fast = "0.1.1"

3. Use the library

Runtime-loaded mode (default)

use jmdict_fast::Dict;

fn main() -> anyhow::Result<()> {
    // Loads from JMDICT_DATA env var, or dist/ directory
    let dict = Dict::load_default()?;

    // Exact lookup
    let results = dict.lookup_exact("猫");
    for entry in &results {
        println!("{}: {}", entry.kanji[0].text, entry.sense[0].gloss[0].text);
    }

    // Prefix search
    let results = dict.lookup_partial("こんに");

    // With deinflection (finds 食べる from 食べます)
    let results = dict.lookup_exact_with_deinflection("食べます");

    Ok(())
}

Embedded mode

Bake data into your binary at compile time:

[dependencies]
jmdict-fast = { version = "0.1.1", features = ["embedded"] }

let dict = Dict::load_embedded()?;

Requires data files in dist/ when building. Run cargo xtask generate first.

Loading behavior

Dict::load_default() tries sources in order:

Embedded data (if embedded feature is enabled)
JMDICT_DATA env var — path to a directory with data files
dist/ relative to the current directory
dist/ relative to the workspace root

You can also load from an explicit path:

let dict = Dict::load("/path/to/data")?;

Environment Variables

Variable	Description
`JMDICT_DATA`	Path to directory containing FST and entries.bin files

Feature Flags

Feature	Description
`embedded`	Bake dictionary data into the binary via `include_bytes!`

Data Generation (`cargo xtask generate`)

The xtask crate handles downloading JMdict and building the FST indexes:

# Generate to default output (dist/)
cargo xtask generate

# Generate to a custom directory
cargo xtask generate --output /path/to/output

Pre-built data is also attached to GitHub Releases as jmdict-data.tar.gz.

Data Structure

kana.fst     kanji.fst     romaji.fst    id.fst
   │              │              │            │
   └──────────────┼──────────────┘            │
                  ▼                           │
            entries.bin ◄─────────────────────┘
         (postcard-serialized entries
          with version header)

FST maps — sorted key→entry-id indexes for each writing system
entries.bin — versioned binary blob (magic JMDF + format version + postcard-serialized entries)

Migration from embedded-only API

Previous versions embedded dictionary data via include_bytes! in build.rs. The new architecture:

Data generation moved to cargo xtask generate (separate from compilation)
Embedded mode is now opt-in via the embedded feature flag
Runtime loading is the default — set JMDICT_DATA or place files in dist/
Dict::load_default() cascades: embedded → env var → filesystem

If you were using Dict::load_default() before, it continues to work — just generate data first with cargo xtask generate.

API Reference

Loading

Dict::load(path) — Load from a specific directory
Dict::load_default() — Auto-detect data location
Dict::load_embedded() — Load compile-time embedded data (requires embedded feature)

Lookups

dict.lookup_exact(term) — Exact match across kana, kanji, romaji
dict.lookup_partial(prefix) — Prefix search
dict.lookup_exact_with_deinflection(term) — Exact match with verb/adjective deinflection

Entry Structure

pub struct Entry {
    pub id: String,
    pub kanji: Vec<KanjiEntry>,
    pub kana: Vec<KanaEntry>,
    pub sense: Vec<SenseEntry>,
}

See docs.rs for full API documentation.

Performance

Metric	Value
Index Size	~888KB (FSTs)
Data Size	~16MB binary blob
Lookup Speed	O(log n), ~4 us

License

MIT License — see LICENSE for details.

Acknowledgments

JMdict — The source dictionary data - see EDRDG DICTIONARY LICENCE STATEMENT
FST crate — Fast finite state transducer implementation
10ten Japanese Reader for their deinflector implementation

jmdict-fast 0.1.2