DictUtils
A high-performance Rust library for fast dictionary operations with support for multiple dictionary formats (MDict, StarDict, ZIM) and advanced indexing capabilities.
โ ๏ธ Experimental Status
DictUtils is currently experimental and not suitable for production use. Many format parsers rely on placeholder logic that does not validate real dictionary files, index sidecars are not compatible with production dictionaries, and compression/IO helpers are best-effort prototypes. Use this crate only for prototyping or research experiments. Contributions are welcome to replace the mock parsing layers with real format support.
โจ Features
- ๐ High Performance: B-TREE indexing and memory-mapped files for optimal speed
- ๐ Multi-Format Support: MDict, StarDict, and ZIM dictionary formats
- ๐ Advanced Search: Prefix, fuzzy, and full-text search capabilities
- โก Concurrent Access: Thread-safe operations with parallel processing
- ๐พ Memory Efficient: LRU caching and lazy loading
- ๐ ๏ธ Flexible Configuration: Customizable cache sizes, indexing options, and more
๐ Quick Start
Add DictUtils to your Cargo.toml:
[]
= "0.1.0"
Or with optional features:
[]
= { = "0.1.0", = ["criterion", "rayon", "cli", "encoding-support"] }
Basic usage example:
use *;
๐ Documentation
Core Concepts
Dictionary Loading
// Auto-detection of dictionary format
let mut dict = new.load?;
// With custom configuration
let config = DictConfig ;
let loader = with_config;
let mut dict = loader.load?;
Search Operations
use *;
// Prefix search - find words starting with "comp"
let prefix_results = dict.search_prefix?;
// Fuzzy search - find words similar to "programing"
let fuzzy_results = dict.search_fuzzy?;
// Full-text search - search within content
let fts_iterator = dict.search_fulltext?;
let fts_results: = fts_iterator.collect?;
// Range queries
let range_results = dict.get_range?;
// Batch lookups
let keys = vec!;
let batch_results = dict.get_batch?;
Performance Optimization
// Build indexes for better performance
dict.build_indexes?;
// Configure for memory efficiency
let efficient_config = DictConfig ;
// Monitor performance statistics
let stats = dict.stats;
println!;
println!;
for in &stats.index_sizes
๐๏ธ Architecture
Core Components
dictutils/
โโโ traits.rs # Core trait definitions
โโโ dict/ # Dictionary format implementations
โ โโโ mdict.rs # Monkey's Dictionary format
โ โโโ stardict.rs # StarDict format
โ โโโ zimdict.rs # ZIM format
โโโ index/ # High-performance indexing
โ โโโ btree.rs # B-TREE index for fast lookups
โ โโโ fts.rs # Full-text search index
โโโ util/ # Utility modules
โ โโโ compression.rs # Compression algorithms
โ โโโ encoding.rs # Text encoding conversion
โ โโโ buffer.rs # Binary buffer utilities
โโโ lib.rs # Main library module
Design Principles
- Performance First: Optimized for speed with efficient data structures
- Memory Efficiency: Lazy loading, caching, and memory mapping
- Thread Safety: All operations are thread-safe by default
- Format Agnostic: Unified interface across different dictionary formats
- Extensible: Easy to add new dictionary formats and features
๐ Performance Guide
Dictionary Size Recommendations
| Dictionary Size | Configuration | Memory Mapping | Indexes |
|---|---|---|---|
| < 10MB | Basic config | Optional | Optional |
| 10MB - 100MB | Standard | Recommended | B-TREE |
| 100MB - 1GB | Optimized | Recommended | B-TREE + FTS |
| > 1GB | Enterprise | Required | B-TREE + FTS |
Performance Tips
1. Index Optimization
// Build B-TREE index for fast exact lookups
dict.build_indexes?;
// Enable memory mapping for better I/O performance
let config = DictConfig ;
// Cache frequently accessed entries
let config = DictConfig ;
2. Search Optimization
// Use batch operations for multiple lookups
let keys = vec!;
let results = dict.get_batch?;
// Cache search results
let mut cache = new;
// Prefix search with limits
let results = dict.search_prefix?;
// Use appropriate search type
if query.len <= 3 else if query.contains else
3. Memory Optimization
// Use memory mapping for large files
let config = DictConfig ;
// Clear cache periodically
dict.clear_cache;
// Monitor memory usage
let stats = dict.stats;
println!;
Benchmarking
Run performance benchmarks:
# Run all benchmarks
# Run specific benchmark category
# Profile memory usage
Expected performance characteristics:
- Dictionary Loading: 10-100ms for dictionaries < 100MB
- Exact Lookup: < 1ms with B-TREE index
- Prefix Search: < 10ms for 1000 results
- Fuzzy Search: < 100ms for 100 results
- Full-Text Search: < 50ms for 100 results
๐ง Advanced Usage
Concurrent Access
use ;
use thread;
// Share dictionary across threads
let dict = new;
// Thread 1: Reading operations
let dict1 = clone;
let handle1 = spawn;
// Thread 2: Search operations
let dict2 = clone;
let handle2 = spawn;
handle1.join.unwrap.unwrap;
handle2.join.unwrap.unwrap;
Custom Dictionary Processing
// Process large dictionaries efficiently
Format Conversion
// Convert between dictionary formats
use ;
๐ Dictionary Formats
MDict (Monkey's Dictionary)
High-performance binary format with:
- B-TREE indexing for fast lookups
- Memory-mapped file access
- Compression support (GZIP, LZ4, Zstandard)
- Custom metadata fields
Best for: Large dictionaries, performance-critical applications
StarDict
Classic format with:
- Binary search support
- Synonym and mnemonic files
- Cross-platform compatibility
- Simple text-based format
- Enhanced DICTZIP handling: random-access via RA tables or deterministic sequential inflation when RA is missing
Best for: General purpose dictionaries, simple implementations
ZIM
Wikipedia offline format with:
- Article-based storage
- Built-in compression
- Rich metadata support
- Efficient for encyclopedia content
Best for: Offline wikis, reference materials
Babylon (BGL)
Babylon format with:
- Sidecar index support
- Memory-mapped file access
- Requires external indexing tools
Important: The BGL implementation does NOT parse raw .bgl binaries directly. It requires externally built sidecar index files (.btree and .fts) that must be provided by an external tool like GoldenDict's indexer. The BGL parser only consumes these pre-built indexes and does not implement raw BGL binary parsing.
Best for: Babylon dictionaries with pre-built indexes
๐จ Error Handling
All operations return Result<T, DictError>:
use ;
๐งช Testing
Running Tests
# Run all tests
# Run specific test categories
# Run with coverage
# Run benchmarks (requires criterion feature)
Performance Testing
# Run performance tests
# Run memory leak detection
# Run concurrent stress tests
๐ฆ Optional Features
Enable additional functionality with Cargo features:
[]
= "0.1.0"
= [
"criterion", # Performance benchmarks
"rayon", # Parallel processing
"cli", # Command-line tools
"serde", # Serialization support
"debug_leaks" # Memory leak detection
]
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone repository
# Install development dependencies
# Run tests
# Run linting
# Run benchmarks
Adding New Dictionary Formats
To add support for a new dictionary format:
- Implement the
DictFormattrait - Implement the
Dicttrait for your format - Add format detection logic to
DictLoader - Add comprehensive tests
Example template:
use *;
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- MDict format specification
- StarDict documentation
- ZIM format documentation
- Rust ecosystem crates that made this possible
๐ Benchmarks
Performance results on typical hardware (Intel i7, 16GB RAM):
| Operation | Small Dict (<1MB) | Medium Dict (10MB) | Large Dict (100MB) |
|---|---|---|---|
| Load Time | < 10ms | < 100ms | < 500ms |
| Exact Lookup | < 0.1ms | < 0.1ms | < 0.1ms |
| Prefix Search | < 1ms | < 5ms | < 20ms |
| Fuzzy Search | < 10ms | < 50ms | < 200ms |
| Full-Text Search | < 20ms | < 100ms | < 500ms |
๐ Support
- Documentation: docs.rs/dictutils
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Discord: Join our Discord server
Made with โค๏ธ by the DictUtils team