rsmarisa
Pure Rust port of marisa-trie, a static and space-efficient trie data structure.
About
MARISA (Matching Algorithm with Recursively Implemented StorAge) is a static and space-efficient trie data structure. This is a pure Rust implementation (no C++ dependencies) that maintains full binary compatibility with the original C++ marisa-trie while leveraging Rust's safety features.
Why rsmarisa?
- Pure Rust implementation - no C++ dependencies or FFI overhead
- Binary compatible with C++ marisa-trie (can read/write the same files)
- Memory safe with comprehensive test coverage (321 tests)
- Identical behavior to the original C++ library
- Memory-mapped I/O support for efficient loading of large dictionaries
Features
- Lookup: Check whether a given string exists in the dictionary
- Reverse lookup: Restore a key from its ID
- Common prefix search: Find keys from prefixes of a given string
- Predictive search: Find keys starting with a given string
- Space-efficient: Compressed trie structure with LOUDS encoding
- Binary I/O: Save and load tries to/from files
- Memory-mapped I/O: Efficient loading with zero-copy memory mapping
- Full binary compatibility: Rust-built tries are byte-for-byte identical to C++-built tries
Quick Start
Add this to your Cargo.toml:
[]
= "0.2"
Basic Usage
use ;
// Build a trie
let mut keyset = new;
keyset.push_back_str.unwrap;
keyset.push_back_str.unwrap;
keyset.push_back_str.unwrap;
let mut trie = new;
trie.build;
// Lookup
let mut agent = new;
agent.init_state.unwrap;
agent.set_query_str;
assert!;
// Common prefix search
agent.set_query_str;
while trie.common_prefix_search
// Prints: "app", "application"
Save and Load
use ;
// Build and save
let mut keyset = new;
keyset.push_back_str.unwrap;
keyset.push_back_str.unwrap;
let mut trie = new;
trie.build;
trie.save.unwrap;
// Load
let mut loaded_trie = new;
loaded_trie.load.unwrap;
Memory-Mapped I/O
For efficient loading of large dictionaries (instant startup, zero-copy):
use Trie;
// Memory-map from file (recommended for large dictionaries)
let mut trie = new;
trie.mmap.unwrap;
// Or embed dictionary data in the binary
static DICT_DATA: & = include_bytes!;
let mut trie = new;
trie.map.unwrap;
When to use mmap() vs load():
- Large dictionaries (>100MB): Use
mmap()for O(1) startup time - Small dictionaries (<1MB): Use
load()for simplicity - Embedded data: Use
map()withinclude_bytes!()
Both methods produce identical behavior and support the same operations.
Examples
Run the included examples:
# Basic usage demonstration
# Save/load file I/O
Status
✅ Production ready! Core functionality is complete with full binary compatibility and all CLI tools working.
Latest Updates (v0.2.0 - 2026-01-27):
- ✅ Memory-mapped I/O: Added
Trie::mmap()for efficient zero-copy loading of large dictionaries - ✅ Instant startup: O(1) load time for large dictionaries using memory mapping
- ✅ Static memory support:
Trie::map()for embedding dictionary data in binaries - ✅ Binary compatibility: Both
mmap()andload()work with C++-created files
Upcoming in v0.3.0 (Breaking Change):
- 🔄 Library name alignment: The library name has been changed from
marisatorsmarisato match the package name- Migration required: Update all imports from
use marisa::touse rsmarisa:: - Why: Eliminates confusion and aligns with Rust naming conventions
- Impact: All users must update their code when upgrading to v0.3.0
- Migration required: Update all imports from
See CHANGELOG.md for complete version history.
What's Implemented
- ✅ LOUDS trie construction - Space-efficient trie building
- ✅ Lookup operations - Exact string matching
- ✅ Common prefix search - Find all prefixes of a string
- ✅ Binary I/O - Save/load with C++ marisa-trie compatibility
- ✅ Memory-mapped I/O - Efficient zero-copy loading
- ✅ File format validation - "We love Marisa." header check
- ✅ 321 comprehensive tests - All passing ✅
Compatibility
| Feature | Status | Notes |
|---|---|---|
| Behavioral compatibility | ✅ | Search operations match C++ exactly |
| Binary file format | ✅ | Byte-for-byte identical to C++ output |
| C++ file reading | ✅ | Can read files from C++ marisa-trie |
| C++ file writing | ✅ | C++ can read Rust-built files |
| Cross-verification | ✅ | Verified with cmp and marisa-lookup |
| Memory-mapped I/O | ✅ | Implemented using memmap2 |
Testing
# Run all tests
# Run with output
# Run specific test
CLI Tools
Command-line tools with rsmarisa- prefix (to avoid conflicts with C++ marisa-trie):
Available:
- ✅
rsmarisa-build- Build a dictionary from text input (binary compatible with C++) - ✅
rsmarisa-lookup- Look up keys in a dictionary (results match C++ exactly) - ✅
rsmarisa-common-prefix-search- Find keys that are prefixes of queries (results match C++ exactly) - ✅
rsmarisa-dump- Dump dictionary contents (results match C++ exactly) - ✅
rsmarisa-predictive-search- Find keys with a given prefix (results match C++ exactly) - ✅
rsmarisa-reverse-lookup- Restore keys from IDs (results match C++ exactly)
Coming Soon:
rsmarisa-benchmark- Performance benchmarking
All CLI tools are now fully functional and produce output identical to their C++ counterparts.
Usage Examples
# Build a dictionary
|
# Look up keys
|
# Output: 0\tapple
# -1\tgrape
# Find common prefixes
|
# Output: 3 found
# 0\ta\tapplication
# 1\tapp\tapplication
# 5\tapplication\tapplication
Performance
MARISA tries are designed to be:
- Space-efficient: Uses LOUDS encoding and suffix compression
- Fast lookups: O(m) where m is the query length
- Cache-friendly: Sequential memory access patterns
See PORTING_STATUS.md for detailed implementation progress.
Original Project
- Repository: https://github.com/s-yata/marisa-trie
- Author: Susumu Yata
- Version: 0.3.1
- Baseline commit:
4ef33cc5a2b6b4f5e147e4564a5236e163d67982
Contributing
Contributions are welcome! Please see CLAUDE.md for porting guidelines and project structure.
Development
# Build
# Run tests
# Run clippy
# Format code
License
BSD-2-Clause (same as the original project)
See LICENSE for details.
Acknowledgments
This is a Rust port of the excellent marisa-trie library by Susumu Yata.