# DictUtils - Comprehensive Documentation
## Overview
DictUtils is a high-performance Rust library for working with dictionary formats. It provides fast and efficient dictionary operations with support for multiple dictionary formats including Monkey's Dictionary (MDict), StarDict, and ZIM format. The library features B-TREE indexing for fast lookups, full-text search capabilities, memory-mapped file support, compression handling, batch operations, thread safety, and lazy loading.
## Library Constants
- `VERSION`: Library version string
- `NAME`: Library name ("dictutils")
- `DESCRIPTION`: Library description
- `MAX_DICT_SIZE`: Maximum supported dictionary size (2GB - 2,147,483,648 bytes)
- `DEFAULT_CACHE_SIZE`: Default cache size for entries (1000)
- `DEFAULT_BATCH_SIZE`: Default batch size for operations (100)
- `MIN_MEMORY`: Minimum memory required for basic operations (64MB)
- `RECOMMENDED_MEMORY`: Recommended memory for optimal performance (256MB)
## Core Modules
### 1. Library Root (`src/lib.rs`)
The main library module that provides:
- Re-exports of common types and functions
- Convenience prelude module for easy imports
- CLI utilities (when `cli` feature is enabled)
- Library constants and configuration
### 2. Traits Module (`src/traits.rs`)
Core trait definitions and types that all dictionary implementations must satisfy.
#### 2.1 Types and Enums
**`Result<T>`**: Custom result type for dictionary operations
- Alias for `std::result::Result<T, DictError>`
**`SearchResult`**: Search result containing word and entry data
```rust
pub struct SearchResult {
pub word: String, // The matching word/term
pub entry: Vec<u8>, // The dictionary entry data
pub score: Option<f32>, // Optional relevance score
pub highlights: Option<Vec<(usize, usize)>>, // Optional highlight information
}
```
**`BatchResult`**: Batch lookup result
```rust
pub struct BatchResult {
pub word: String, // Original word that was searched
pub entry: Option<Vec<u8>>, // Entry data if found
pub error: Option<DictError>, // Error if lookup failed
}
```
**`DictMetadata`**: Dictionary entry metadata
```rust
pub struct DictMetadata {
pub name: String, // Dictionary name
pub version: String, // Dictionary format version
pub entries: u64, // Total number of entries
pub description: Option<String>, // Dictionary description
pub author: Option<String>, // Dictionary author/publisher
pub language: Option<String>, // Dictionary language
pub file_size: u64, // File size in bytes
pub created: Option<String>, // Creation date
pub has_btree: bool, // Whether B-TREE index is available
pub has_fts: bool, // Whether FTS index is available
}
```
**`DictError`**: Error types for dictionary operations
```rust
pub enum DictError {
FileNotFound(String), // File not found
InvalidFormat(String), // Invalid file format
UnsupportedOperation(String), // Unsupported operation
IoError(String), // I/O error
MmapError(String), // Memory mapping error
IndexError(String), // Index error
DecompressionError(String), // Decompression error
SerializationError(String), // Serialization error
Internal(String), // Internal error with message
}
```
**`DictConfig`**: Configuration options for dictionary initialization
```rust
pub struct DictConfig {
pub load_btree: bool, // Whether to load B-TREE index
pub load_fts: bool, // Whether to load FTS index
pub use_mmap: bool, // Whether to enable memory mapping
pub cache_size: usize, // Cache size for entries
pub batch_size: usize, // Batch size for bulk operations
pub encoding: Option<String>, // Custom encoding (auto-detect if None)
pub build_btree: bool, // Whether to build B-TREE index
pub build_fts: bool, // Whether to build FTS index
}
```
**`DictStats`**: Statistics about dictionary performance and usage
```rust
pub struct DictStats {
pub total_entries: u64, // Total number of entries
pub cache_hit_rate: f32, // Cache hit rate (0.0 to 1.0)
pub memory_usage: u64, // Estimated memory usage in bytes
pub index_sizes: HashMap<String, u64>, // Size of different indexes
}
```
**`EntryIterator<'a, K>`**: Iterator over dictionary entries
```rust
pub struct EntryIterator<'a, K> {
pub keys: std::vec::IntoIter<K>,
pub dictionary: &'a dyn Dict<K>,
}
```
#### 2.2 Core Traits
**`Dict<K>`**: Core trait that defines all dictionary operations
```rust
pub trait Dict<K>: Send + Sync
where
K: Hash + Eq + Clone + fmt::Display,
{
// Metadata and basic operations
fn metadata(&self) -> &DictMetadata;
fn contains(&self, key: &K) -> Result<bool>;
fn get(&self, key: &K) -> Result<Vec<u8>>;
fn len(&self) -> usize;
fn is_empty(&self) -> bool;
fn file_paths(&self) -> &[std::path::PathBuf];
// Batch operations
fn get_multiple(&self, keys: &[K]) -> Result<Vec<BatchResult>>;
fn get_batch(&self, keys: &[K], batch_size: Option<usize>) -> Result<Vec<BatchResult>>;
// Collection operations
fn keys(&self) -> Result<Vec<K>>;
fn values(&self) -> Result<Vec<Vec<u8>>>;
fn get_range(&self, range: Range<usize>) -> Result<Vec<(K, Vec<u8>)>>;
fn iter(&self) -> Result<EntryIterator<K>>;
// Search operations
fn search_prefix(&self, prefix: &str, limit: Option<usize>) -> Result<Vec<SearchResult>>;
fn search_fuzzy(&self, query: &str, max_distance: Option<u32>) -> Result<Vec<SearchResult>>;
fn search_fulltext(&self, query: &str) -> Result<Box<dyn Iterator<Item = Result<SearchResult>> + Send>>;
fn prefix_iter(&self, prefix: &str) -> Result<Box<dyn Iterator<Item = Result<(K, Vec<u8>)>> + Send>>;
// Maintenance operations
fn reload_indexes(&mut self) -> Result<()>;
fn clear_cache(&mut self);
fn stats(&self) -> DictStats;
fn build_indexes(&mut self) -> Result<()>;
}
```
**`DictBuilder<K>`**: Trait for building dictionaries (for creating new dictionary files)
```rust
pub trait DictBuilder<K> {
fn add_entry(&mut self, key: K, entry: &[u8]) -> Result<()>;
fn build(&mut self, output_path: &Path, config: Option<DictConfig>) -> Result<()>;
fn set_metadata(&mut self, metadata: DictMetadata);
fn len(&self) -> usize;
fn is_empty(&self) -> bool;
}
```
**`HighPerformanceDict<K>`**: Trait for high-performance operations
```rust
pub trait HighPerformanceDict<K>: Dict<K>
where
K: Hash + Eq + Clone + fmt::Display,
{
fn binary_search_get(&self, key: &K) -> Result<Vec<u8>>;
fn stream_search(&self, query: &str) -> Result<Box<dyn Iterator<Item = Result<SearchResult>>>>;
}
```
**`DictFormat<K>`**: Trait for dictionary formats
```rust
pub trait DictFormat<K> {
const FORMAT_NAME: &'static str;
const FORMAT_VERSION: &'static str;
fn is_valid_format(path: &Path) -> Result<bool>;
fn load(path: &Path, config: DictConfig) -> Result<Box<dyn Dict<K> + Send + Sync>>;
}
```
#### 2.3 Constants
**Format Constants**:
- `FORMAT_MDICT`: "mdict"
- `FORMAT_STARDICT`: "stardict"
- `FORMAT_ZIM`: "zim"
**File Extension Constants**:
- `EXT_DICT`: ".dict"
- `EXT_IDX`: ".idx"
- `EXT_INFO`: ".info"
- `EXT_BTREE`: ".btree"
- `EXT_FTS`: ".fts"
### 3. Dictionary Module (`src/dict/`)
Implementations for various dictionary formats.
#### 3.1 Dictionary Loader (`src/dict/mod.rs`)
**`DictLoader`**: Dictionary format detection and loading
```rust
pub struct DictLoader {
default_config: DictConfig,
}
// Methods:
impl DictLoader {
pub fn new() -> Self; // Create new loader
pub fn with_config(config: DictConfig) -> Self; // Create with custom config
pub fn load<P: AsRef<Path>>(&self, path: P) -> Result<Box<dyn Dict<String> + Send + Sync>>; // Auto-detect format
pub fn load_format<P: AsRef<Path>>(&self, path: P, format: &str) -> Result<Box<dyn Dict<String> + Send + Sync>>; // Load specific format
pub fn detect_format(&self, path: &Path) -> Result<String>; // Detect format from file
pub fn scan_directory<P: AsRef<Path>>(&self, dir: P) -> Result<Vec<PathBuf>>; // Scan for dictionaries
pub fn is_dictionary_file(&self, path: &Path) -> bool; // Check if file is dictionary
pub fn supported_formats(&self) -> Vec<String>; // Get supported formats
pub fn default_config(&self) -> &DictConfig; // Get default config
pub fn set_default_config(&mut self, config: DictConfig); // Set default config
}
```
**`BatchOperations`**: Dictionary batch operations utilities
```rust
pub struct BatchOperations;
// Methods:
impl BatchOperations {
pub fn load_batch<P: AsRef<Path>>(paths: &[P], config: Option<DictConfig>) -> Result<Vec<Box<dyn Dict<String> + Send + Sync>>>;
pub fn search_multiple(dictionaries: &[Box<dyn Dict<String> + Send + Sync>], query: &str, search_type: SearchType) -> Result<Vec<SearchResult<String>>>;
pub fn merge<K>(dictionaries: &[Box<dyn Dict<K> + Send + Sync>], output_path: &Path, format: &str) -> Result<()>
where
K: Clone + std::fmt::Display + serde::Serialize + serde::de::DeserializeOwned + Eq + std::hash::Hash;
pub fn validate_batch<P: AsRef<Path>>(paths: &[P]) -> Result<Vec<(PathBuf, bool)>>;
}
```
**`SearchType`**: Search type for batch operations
```rust
pub enum SearchType {
Prefix(String), // Prefix search
Fuzzy(String), // Fuzzy search
Fulltext(String), // Full-text search
}
```
**`SearchResult<K>`**: Search result for multiple dictionaries
```rust
pub struct SearchResult<K> {
pub key: K, // Dictionary key
pub entry: Vec<u8>, // Entry data
pub score: Option<f32>, // Relevance score
pub source_dict: Option<String>, // Source dictionary name
pub highlights: Option<Vec<(usize, usize)>>, // Highlight information
}
```
**Dictionary Utility Functions** (`utils` submodule):
```rust
pub fn get_dict_size<P: AsRef<Path>>(path: P) -> Result<u64>;
pub fn is_readable<P: AsRef<Path>>(path: P) -> bool;
pub fn get_dict_format<P: AsRef<Path>>(path: P) -> Result<String>;
pub fn copy_dict<P: AsRef<Path>>(source: P, destination: P, create_indexes: bool) -> Result<()>;
pub fn remove_dict<P: AsRef<Path>>(path: P) -> Result<()>;
pub fn list_dicts<P: AsRef<Path>>(directory: P) -> Result<Vec<PathBuf>>;
```
#### 3.2 MDict Implementation (`src/dict/mdict.rs`)
**`MDict`**: Monkey's Dictionary implementation
```rust
pub struct MDict {
file_path: std::path::PathBuf, // File path
mmap: Option<Arc<Mmap>>, // Memory-mapped file
file: Option<File>, // File for sequential access
header: MdictHeader, // Header information
btree_index: Option<BTreeIndex>, // B-TREE index for fast lookups
fts_index: Option<FtsIndex>, // FTS index for full-text search
entry_cache: Arc<RwLock<lru_cache::LruCache<String, Vec<u8>>>>, // Cache for frequently accessed entries
config: DictConfig, // Index configuration
metadata: DictMetadata, // Cached metadata
}
// Methods:
impl MDict {
pub fn new<P: AsRef<Path>>(path: P, config: DictConfig) -> Result<Self>; // Create new MDict instance
pub fn build_indexes(&mut self) -> Result<()>; // Build indexes for this MDict
pub fn file_paths(&self) -> Vec<std::path::PathBuf>; // Get file paths for this dictionary
}
```
**`MdictHeader`**: MDict header information
```rust
struct MdictHeader {
encoding: String, // Encoding name as in header (normalized)
version: f64, // Version as parsed from GeneratedByEngineVersion
encrypted: i32, // Encrypted flags (bitmask)
rtl: bool, // Right-to-left flag
title: String, // Title (or filename fallback)
description: String, // Description (plain text)
attributes: HashMap<String, String>, // Raw attribute map for extensibility
number_size: u8, // Number size for numeric fields (4 or 8)
headword_block_info_pos: u64, // Position of headword block info (absolute in file)
headword_block_info_size: u64, // Size of headword block info (compressed or plain)
num_headword_blocks: u64, // Number of headword blocks
word_count: u64, // Total word count (entries)
headword_block_size: u64, // Size of headword block (compressed/decompressed descriptor)
record_block_info_pos: u64, // Position of record block info table
total_records_size: u64, // Total decompressed size of all records
record_blocks: Vec<RecordIndex>, // Record blocks (compressed/decompressed sizes and shadow offsets)
file_size: u64, // Absolute file size for metadata and safety checks
}
```
**`RecordIndex`**: Record block index entry
```rust
struct RecordIndex {
compressed_size: u64, // Compressed size
decompressed_size: u64, // Decompressed size
start_pos: u64, // Start position (relative to first record block) in compressed space
shadow_start_pos: u64, // Start position in concatenated decompressed space
shadow_end_pos: u64, // End position in decompressed space
}
```
**`MdictKeyEntry`**: One headword entry mapped to a record offset/size
```rust
struct MdictKeyEntry {
key: String, // Key
record_offset: u64, // Absolute record offset in concatenated decompressed record stream
record_size: u64, // Length of the record data
}
```
#### 3.3 StarDict Implementation (`src/dict/stardict.rs`)
**`StarDict`**: StarDict dictionary implementation
```rust
pub struct StarDict {
ifo_path: PathBuf, // .ifo path
dict_path: PathBuf, // Associated .dict or .dict.dz
dict_is_dz: bool, // Whether dict is DICTZIP-compressed (.dict.dz)
syn_path: Option<PathBuf>, // Optional .syn path
header: StarDictHeader, // Parsed header
index: HashMap<String, EntryLoc>, // In-memory index: word → (offset,size)
mmap: Option<Arc<Mmap>>, // Memory-mapped .dict (for uncompressed dict)
dict_file: File, // File handle for .dict/.dz
btree_index: Option<BTreeIndex>, // Optional BTree index for fast key lookups
fts_index: Option<FtsIndex>, // Optional FTS index for full-text search
entry_cache: Arc<RwLock<lru_cache::LruCache<String, Vec<u8>>>>, // Cache for frequently accessed entries
config: DictConfig, // Configuration
metadata: DictMetadata, // Cached metadata
}
// Methods:
impl StarDict {
pub fn new<P: AsRef<Path>>(path: P, config: DictConfig) -> Result<Self>; // Create new StarDict from .ifo file
}
```
**`Ifo`**: Parsed contents of .ifo file
```rust
struct Ifo {
version: String, // Version
bookname: String, // Book name
wordcount: u64, // Word count
synwordcount: u64, // Synonym word count
idxfilesize: Option<u64>, // Index file size
idxoffsetbits: u32, // Index offset bits
sametypesequence: Option<String>, // Same type sequence
dicttype: Option<String>, // Dictionary type
description: Option<String>, // Description
copyright: Option<String>, // Copyright
author: Option<String>, // Author
email: Option<String>, // Email
website: Option<String>, // Website
date: Option<String>, // Date
}
```
**`StarDictHeader`**: StarDict dictionary header/metadata
```rust
struct StarDictHeader {
ifo: Ifo, // Parsed .ifo contents
encoding: String, // Encoding for textual parts
idx_64bit: bool, // True if index offsets are 64-bit
}
```
**`EntryLoc`**: Entry location in .dict or .dict.dz
```rust
struct EntryLoc {
offset: u64, // Offset
size: u64, // Size
}
```
#### 3.4 ZIM Implementation (`src/dict/zimdict.rs`)
**`ZimDict`**: ZIM format implementation
```rust
pub struct ZimDict {
file_path: PathBuf, // Main ZIM file path
mmap: Option<Arc<Mmap>>, // Memory-mapped file for fast random access
file: File, // File handle for IO fallback
header: ZimHeader, // Parsed header
mime_types: Vec<String>, // Mime types list (index → string)
btree_index: Option<BTreeIndex>, // Optional BTree index (external)
fts_index: Option<FtsIndex>, // Optional FTS index (external)
entry_cache: Arc<RwLock<lru_cache::LruCache<String, Vec<u8>>>>, // Cache for frequently accessed entries
config: DictConfig, // Configuration
metadata: DictMetadata, // Cached metadata
}
// Methods:
impl ZimDict {
pub fn new<P: AsRef<Path>>(path: P, config: DictConfig) -> Result<Self>; // Create new ZimDict from .zim file
}
```
**`ZimHeader`**: ZIM file header (subset based on references/zim.cc ZIM_header)
```rust
struct ZimHeader {
magic_number: u32, // Magic number
major_version: u16, // Major version
minor_version: u16, // Minor version
article_count: u32, // Article count
cluster_count: u32, // Cluster count
url_ptr_pos: u64, // URL pointer position
title_ptr_pos: u64, // Title pointer position
cluster_ptr_pos: u64, // Cluster pointer position
mime_list_pos: u64, // Mime list position
}
```
**`ArticleLoc`**: Location of an article blob: (cluster, blob_index)
```rust
struct ArticleLoc {
cluster: u32, // Cluster
blob: u32, // Blob index
}
```
#### 3.5 BGL Implementation (`src/dict/bgl.rs`)
**`BglDict`**: Lightweight BGL dictionary backed by sidecar indexes
```rust
pub struct BglDict {
bgl_path: PathBuf, // Original BGL file path
index_path: PathBuf, // Index/chunks file path (`.bglx` / `.idx`)
header: BglIndexHeader, // Parsed header from index (for metadata/chunks_offset)
btree_index: Option<BTreeIndex>, // BTree-based index for key lookups
fts_index: Option<FtsIndex>, // Full-text search index (optional)
cache: Arc<RwLock<lru_cache::LruCache<String, Vec<u8>>>>, // Cache for entries
config: DictConfig, // Configuration
metadata: DictMetadata, // Metadata
}
// Methods:
impl BglDict {
pub fn new<P: AsRef<Path>>(path: P, config: DictConfig) -> Result<Self>; // Create BglDict using existing BGL file and compatible sidecar index files
}
```
**`BglIndexHeader`**: Minimal BGL "index header" used only for metadata and chunks base offset
```rust
struct BglIndexHeader {
signature: [u8; 4], // Magic signature, expected "BGLX"
format_version: u32, // Format version (opaque here)
article_count: u32, // Number of articles
word_count: u32, // Number of words (for metadata only)
chunks_offset: u64, // Offset to chunked article storage in index file
}
```
#### 3.6 DSL Implementation (`src/dict/dsl.rs`)
**`DslDict`**: Main DSL dictionary implementation
```rust
pub struct DslDict {
dsl_path: PathBuf, // Primary path (.dsl or .dsl.dz)
entries: HashMap<String, String>, // Parsed entries (headword -> UTF-8 body)
btree_index: Option<BTreeIndex>, // Optional BTree index (sidecar)
fts_index: Option<FtsIndex>, // Optional FTS index (sidecar)
entry_cache: Arc<RwLock<lru_cache::LruCache<String, Vec<u8>>>>, // Cache for frequently accessed entries
config: DictConfig, // Configuration
metadata: DictMetadata, // Cached metadata
}
// Methods:
impl DslDict {
pub fn new(path: &Path, config: DictConfig) -> Result<Self>; // Load DSL dictionary from the given path
}
// Utility function:
pub fn levenshtein(a: &str, b: &str) -> usize; // Simple Levenshtein distance used for DSL fuzzy search
```
**`DslEntry`**: In-memory representation of a parsed DSL entry
```rust
struct DslEntry {
headword: String, // Headword
body: String, // Body
}
```
**`DslEncoding`**: Supported DSL encodings
```rust
enum DslEncoding {
Utf16Le, // UTF-16 Little Endian
Utf16Be, // UTF-16 Big Endian
Utf8, // UTF-8
Windows1252, // Windows-1252
Windows1251, // Windows-1251
Windows1250, // Windows-1250
}
```
### 4. Index Module (`src/index/`)
High-performance indexing system for dictionary operations.
#### 4.1 Index Core (`src/index/mod.rs`)
**`IndexStats`**: Common index statistics
```rust
pub struct IndexStats {
pub entries: u64, // Number of entries indexed
pub size: u64, // Index file size in bytes
pub build_time: u64, // Index build time in milliseconds
pub version: String, // Index version
pub config: IndexConfig, // Index configuration
}
```
**`IndexConfig`**: Configuration for index operations
```rust
pub struct IndexConfig {
pub btree_order: Option<usize>, // B-TREE order (branching factor)
pub fts_config: FtsConfig, // FTS analyzer settings
pub compression: Option<CompressionConfig>, // Compression settings
pub build_in_memory: bool, // Whether to build index in memory first
pub max_memory: Option<u64>, // Maximum memory usage during build (bytes)
}
```
**`FtsConfig`**: Full-Text Search configuration
```rust
pub struct FtsConfig {
pub min_token_len: usize, // Minimum token length for indexing
pub max_token_len: usize, // Maximum token length for indexing
pub use_stemming: bool, // Whether to use stemming
pub stop_words: Vec<String>, // Stop words to ignore during indexing
pub language: Option<String>, // Analyzer language
}
```
**`CompressionConfig`**: Compression configuration
```rust
pub struct CompressionConfig {
pub algorithm: CompressionAlgorithm, // Compression algorithm
pub level: u32, // Compression level (0-9 for gzip, 1-19 for zstd)
}
```
**`CompressionAlgorithm`**: Compression algorithm types
```rust
pub enum CompressionAlgorithm {
None, // No compression
Gzip, // GZIP compression
Lz4, // LZ4 compression
Zstd, // Zstandard compression
}
```
**`IndexError`**: Error types specific to index operations
```rust
pub enum IndexError {
CorruptedIndex(String), // Index corruption detected
VersionMismatch { expected: String, found: String }, // Index version mismatch
NotBuilt(String), // Index not built
IoError(String), // Index I/O error
ConfigError(String), // Index configuration error
InsufficientMemory(String), // Index too large for memory
}
```
**`IndexManager`**: Manager for multiple indexes
```rust
pub struct IndexManager {
btree: Option<btree::BTreeIndex>, // B-TREE index
fts: Option<fts::FtsIndex>, // FTS index
config: IndexConfig, // Index configuration
paths: HashMap<&'static str, PathBuf>, // Paths to index files
stats: IndexStats, // Index statistics
}
// Methods:
impl IndexManager {
pub fn new(config: IndexConfig) -> Self; // Create new index manager
pub fn build_all(&mut self, entries: &[(String, Vec<u8>)]) -> Result<()>; // Build both B-TREE and FTS indexes
pub fn load_all(&mut self, base_path: &Path, extensions: &[(&str, &str)]) -> Result<()>; // Load indexes from files
pub fn save_all(&self, base_path: &Path, extensions: &[(&str, &str)]) -> Result<()>; // Save indexes to files
pub fn binary_search(&self, key: &str) -> Result<Option<(Vec<u8>, u64)>>; // Binary search using B-TREE index
pub fn fulltext_search(&self, query: &str) -> Result<Vec<(String, f32)>>; // Search using FTS index
pub fn stats(&self) -> &IndexStats; // Get all statistics
pub fn is_built(&self) -> bool; // Check if indexes are built
pub fn clear(&mut self); // Clear all indexes
pub fn verify(&self) -> Result<bool>; // Verify all indexes
}
```
**`Index`**: Trait that defines common index operations
```rust
pub trait Index: Send + Sync {
const INDEX_TYPE: &'static str; // Index type identifier
fn build(&mut self, entries: &[(String, Vec<u8>)], config: &IndexConfig) -> Result<()>; // Build index from entries
fn load(&mut self, path: &Path) -> Result<()>; // Load index from file
fn save(&self, path: &Path) -> Result<()>; // Save index to file
fn stats(&self) -> &IndexStats; // Get index statistics
fn is_built(&self) -> bool; // Check if index is built
fn clear(&mut self); // Clear the index
fn verify(&self) -> Result<bool>; // Verify index integrity
}
```
#### 4.2 B-TREE Index (`src/index/btree.rs`)
**`BTreeIndex`**: Production-ready B-Tree index implementation
```rust
pub struct BTreeIndex {
order: usize, // Maximum number of keys per node (fan-out minus one)
root: Option<usize>, // Root node index
nodes: Vec<BTreeNode>, // All nodes backing the tree
stats: IndexStats, // Statistics about the index
lock: Arc<RwLock<()>>, // Thread-safe access control
node_cache: LruCache<usize, BTreeNode>, // Lightweight cache for recently accessed nodes to avoid cloning
}
// Methods:
impl BTreeIndex {
pub fn new() -> Self; // Create new empty B-Tree index
pub fn with_order(order: usize) -> Self; // Create new B-Tree index with requested order
pub fn binary_search(&self, key: &str) -> Result<Option<(Vec<u8>, u64)>>; // Perform binary search for key
pub fn search(&self, key: &str) -> Result<Option<(Vec<u8>, u64)>>; // Public search helper
pub fn range_query(&self, start: &str, end: &str) -> Result<Vec<(String, u64)>>; // Get range of keys inclusively between start and end
pub fn validate(&self) -> Result<bool>; // Validate B-Tree properties
}
```
**`BTreeNode`**: B-Tree node containing keys and child pointers
```rust
struct BTreeNode {
keys: Vec<String>, // Keys in this node (sorted)
values: Vec<u64>, // Values (file offsets) in this node
children: Vec<usize>, // Child pointers stored as node indices
is_leaf: bool, // Whether this is a leaf node
}
```
**`BTreeSnapshot`**: On-disk snapshot persisted through save()/load()
```rust
struct BTreeSnapshot {
order: usize, // Order
root: Option<usize>, // Root
nodes: Vec<BTreeNode>, // Nodes
stats: IndexStats, // Statistics
}
```
**`RangeQueryResult`**: Range query aggregation helper
```rust
pub struct RangeQueryResult {
pub keys: Vec<String>, // Matching keys
pub values: Vec<u64>, // Corresponding values (file offsets)
pub count: usize, // Total number of results
}
// Methods:
impl RangeQueryResult {
pub fn new() -> Self;
pub fn add(&mut self, key: String, value: u64);
}
```
#### 4.3 FTS Index (`src/index/fts.rs`)
**`FtsIndex`**: FTS Index implementation using inverted indexing
```rust
pub struct FtsIndex {
inverted_index: HashMap<String, InvertedIndexEntry>, // Inverted index: term -> posting list
documents: HashMap<DocId, Document>, // Forward index: doc_id -> document
next_term_id: TermId, // Next available term ID
next_doc_id: DocId, // Next available document ID
stop_words: HashSet<String>, // Stop words set
term_stats: HashMap<String, u32>, // Term statistics
config: FtsConfig, // Index configuration
stats: IndexStats, // Index statistics
lock: Arc<RwLock<()>>, // Thread-safe access
}
// Methods:
impl FtsIndex {
pub fn new() -> Self; // Create new FTS index
pub fn with_config(config: FtsConfig) -> Self; // Create FTS index with custom configuration
pub fn search(&self, query: &str) -> Result<Vec<(String, f32)>>; // Search for documents containing the query
pub fn prefix_search(&self, prefix: &str) -> Result<Vec<String>>; // Get terms starting with prefix
pub fn term_frequency(&self, term: &str) -> u32; // Get term frequency
pub fn document_frequency(&self, term: &str) -> u32; // Get document frequency
pub fn vocabulary_size(&self) -> usize; // Get vocabulary size
pub fn avg_doc_length(&self) -> f32; // Get average document length
pub fn suggest_spelling(&self, query: &str) -> Result<Vec<String>>; // Get suggestions for misspelled query
pub fn get_snippet(&self, doc_id: DocId, query: &str, max_length: usize) -> Option<String>; // Get highlighted snippet for a document
pub fn validate(&self) -> Result<bool>; // Validate index integrity
pub fn get_stats(&self) -> &IndexStats; // Get index statistics
}
```
**`FtsSearchResult`**: Search result with score
```rust
pub struct FtsSearchResult {
pub doc_id: DocId, // Document ID
pub key: String, // Document key (word)
pub score: f32, // Relevance score
pub highlights: Vec<(usize, usize)>, // Snippet highlighting positions
}
```
**Supporting Types**:
```rust
type DocId = u32; // Document ID for FTS operations
type TermId = u32; // Term ID for token indexing
struct Token { // Token with position information
text: String, // Token text
term_id: TermId, // Term ID
position: u32, // Position in document
doc_freq: u32, // Document frequency
}
struct InvertedIndexEntry { // Inverted index entry
term_id: TermId, // Term ID
term: String, // Term text
postings: Vec<Posting>, // Documents containing this term
doc_freq: u32, // Total document frequency
term_freq: u32, // Total term frequency
}
struct Posting { // Posting list entry
doc_id: DocId, // Document ID
term_freq: u32, // Term frequency in document
positions: Vec<u32>, // Positions where term occurs
}
struct Document { // Document representation
doc_id: DocId, // Document ID
key: String, // Document key (word)
content: Vec<u8>, // Document content
doc_length: u32, // Document length in tokens
}
```
### 5. Utility Module (`src/util/`)
Utility functions for dictionary operations.
#### 5.1 File Utilities (`src/util/mod.rs` - file_utils submodule)
**File Operations**:
```rust
pub fn read_file(path: &Path) -> Result<Vec<u8>>; // Read entire file into memory
pub fn read_file_mmap(path: &Path) -> Result<memmap2::Mmap>; // Read file with memory mapping
pub fn write_file_atomic(path: &Path, data: &[u8]) -> Result<()>; // Write data to file with atomic operations
pub fn file_size(path: &Path) -> Result<u64>; // Get file size
pub fn is_readable(path: &Path) -> bool; // Check if file exists and is readable
pub fn ensure_dir(path: &Path) -> Result<()>; // Create directory if it doesn't exist
pub fn crc32(data: &[u8]) -> u32; // Calculate CRC32 checksum
pub fn verify_crc32(path: &Path, expected_crc: u32) -> Result<bool>; // Verify file integrity with CRC32
```
#### 5.2 Buffer Utilities (`src/util/mod.rs` - buffer submodule)
**Read Operations**:
```rust
pub fn read_exact<R: Read>(reader: &mut R, buf: &mut [u8]) -> Result<()>; // Read bytes from reader with error handling
pub fn read_u32_le<R: Read>(reader: &mut R) -> Result<u32>; // Read 32-bit unsigned integer (little-endian)
pub fn read_u32_be<R: Read>(reader: &mut R) -> Result<u32>; // Read 32-bit unsigned integer (big-endian)
pub fn read_u64_le<R: Read>(reader: &mut R) -> Result<u64>; // Read 64-bit unsigned integer (little-endian)
pub fn read_u64_be<R: Read>(reader: &mut R) -> Result<u64>; // Read 64-bit unsigned integer (big-endian)
pub fn read_varint<R: Read>(reader: &mut R) -> Result<u64>; // Read variable-length integer (VARINT)
pub fn read_string<R: Read, F: FnMut(String) -> Result<()>>(reader: &mut R, callback: F) -> Result<()>; // Read length-prefixed string
pub fn read_u8<R: Read>(reader: &mut R) -> Result<u8>; // Read 8-bit unsigned integer
pub fn read_u16_le<R: Read>(reader: &mut R) -> Result<u16>; // Read 16-bit unsigned integer (little-endian)
pub fn read_u16_be<R: Read>(reader: &mut R) -> Result<u16>; // Read 16-bit unsigned integer (big-endian)
```
**Write Operations**:
```rust
pub fn write_all<W: Write>(writer: &mut W, buf: &[u8]) -> Result<()>; // Write bytes to writer with error handling
pub fn write_u32_le<W: Write>(writer: &mut W, value: u32) -> Result<()>; // Write 32-bit unsigned integer (little-endian)
pub fn write_u32_be<W: Write>(writer: &mut W, value: u32) -> Result<()>; // Write 32-bit unsigned integer (big-endian)
pub fn write_u64_le<W: Write>(writer: &mut W, value: u64) -> Result<()>; // Write 64-bit unsigned integer (little-endian)
pub fn write_u64_be<W: Write>(writer: &mut W, value: u64) -> Result<()>; // Write 64-bit unsigned integer (big-endian)
pub fn write_varint<W: Write>(writer: &mut W, value: u64) -> Result<()>; // Write variable-length integer (VARINT)
pub fn write_string<W: Write>(writer: &mut W, s: &str) -> Result<()>; // Write length-prefixed string
```
#### 5.3 Binary Search Utilities (`src/util/mod.rs` - binary_search submodule)
```rust
pub fn search_sorted<'a, K, V>(keys: &'a [K], values: &'a [V], target: &K, compare: impl Fn(&K, &K) -> std::cmp::Ordering) -> Option<(usize, &'a V)> // Binary search in sorted array
where K: Ord;
pub fn lower_bound<K>(keys: &[K], target: &K, compare: impl Fn(&K, &K) -> std::cmp::Ordering) -> usize // Binary search for lower bound
where K: Ord;
pub fn upper_bound<K>(keys: &[K], target: &K, compare: impl Fn(&K, &K) -> std::cmp::Ordering) -> usize // Binary search for upper bound
where K: Ord;
```
#### 5.4 Memory Management Utilities (`src/util/mod.rs` - memory submodule)
```rust
pub fn optimal_cache_size(entries: usize, avg_entry_size: usize) -> usize; // Calculate optimal cache size based on available memory
pub fn total_memory() -> u64; // Get total available system memory (bytes)
pub fn used_memory() -> u64; // Get currently used memory by the process (bytes)
pub fn has_sufficient_memory(required: u64) -> bool; // Check if we have enough memory for an operation
pub fn clear_buffer(buf: &mut [u8]); // Clear memory buffer to prevent data leakage
pub fn zero_sensitive<T: Default>(data: &mut T); // Securely zero sensitive data
```
#### 5.5 Performance Monitoring Utilities (`src/util/mod.rs` - performance submodule)
**`Profiler`**: Simple performance profiler
```rust
pub struct Profiler {
start_time: Instant, // Start time
operations: std::collections::HashMap<String, u64>, // Operations counter
}
// Methods:
impl Profiler {
pub fn new() -> Self; // Create new profiler
pub fn record(&mut self, operation: &str, count: u64); // Record an operation count
pub fn elapsed(&self) -> std::time::Duration; // Get elapsed time since profiler creation
pub fn print_stats(&self); // Print statistics
pub fn operations_per_second(&self, operation: &str) -> f64; // Get operations per second
}
```
**Performance Functions**:
```rust
pub fn measure_time<T>(f: impl FnOnce() -> T) -> (T, std::time::Duration); // Measure function execution time
pub fn benchmark<T>(iterations: usize, mut f: impl FnMut() -> T) -> (T, std::time::Duration, std::time::Duration); // Benchmark a function
```
#### 5.6 Serialization Utilities (`src/util/mod.rs` - serialization submodule)
```rust
pub fn serialize_to_vec<T: serde::Serialize>(data: &T) -> Result<Vec<u8>>; // Serialize data with error handling
pub fn deserialize_from_bytes<T: serde::de::DeserializeOwned>(bytes: &[u8]) -> Result<T>; // Deserialize data with error handling
pub fn serialize_and_compress<T: serde::Serialize>(data: &T, compression: compression::CompressionAlgorithm) -> Result<Vec<u8>>; // Serialize and compress data
pub fn decompress_and_deserialize<T: serde::de::DeserializeOwned>(compressed: &[u8], compression: compression::CompressionAlgorithm) -> Result<T>; // Decompress and deserialize data
pub fn serialize_with_metadata<T: serde::Serialize>(data: &T, version: &str) -> Result<Vec<u8>>; // Serialize with metadata (version, timestamp, etc.)
pub fn deserialize_with_metadata<T: serde::de::DeserializeOwned>(bytes: &[u8], expected_version: &str) -> Result<T>; // Deserialize with metadata
```
#### 5.7 Hash Utilities (`src/util/mod.rs` - hash submodule)
```rust
pub fn fast_hash(data: &[u8]) -> u64; // Calculate hash using fast non-cryptographic hash
pub fn secure_hash(data: &[u8]) -> Vec<u8>; // Calculate hash using cryptographically secure hash
pub fn hash_file(path: &Path, secure: bool) -> Result<Vec<u8>>; // Hash a file
```
#### 5.8 Test Utilities (`src/util/mod.rs` - test_utils submodule)
```rust
pub fn generate_test_entries(count: usize) -> Vec<(String, Vec<u8>)>; // Generate test dictionary entries
pub fn temp_dir() -> Result<std::path::PathBuf>; // Create temporary directory for testing
pub fn cleanup_temp_dir(path: &std::path::Path) -> Result<()>; // Clean up temporary directory
pub fn validate_dictionary_integrity<K: std::fmt::Display + std::cmp::PartialEq + std::cmp::Ord>(entries: &[(K, Vec<u8>)]) -> Result<()>; // Validate dictionary integrity
pub fn benchmark_dict_operations<K, D>(dict: &D, test_keys: &[K], iterations: usize) -> Result<std::collections::HashMap<String, f64>> // Benchmark dictionary operations
where K: Clone + std::fmt::Display + std::hash::Hash + std::cmp::Eq, D: crate::traits::Dict<K>;
```
### 6. Compression Utilities (`src/util/compression.rs`)
Compression and decompression functions using various algorithms.
#### 6.1 Core Functions
**`CompressionAlgorithm`**: Compression algorithm types
```rust
pub enum CompressionAlgorithm {
None, // No compression
Gzip, // GZIP compression
Lz4, // LZ4 compression
Zstd, // Zstandard compression
}
```
**Main Functions**:
```rust
pub fn compress(data: &[u8], algorithm: CompressionAlgorithm) -> Result<Vec<u8>>; // Compress data using specified algorithm
pub fn decompress(compressed: &[u8], algorithm: CompressionAlgorithm) -> Result<Vec<u8>>; // Decompress data using specified algorithm
pub fn compression_level(level: u32, algorithm: &CompressionAlgorithm) -> u32; // Get compression level based on algorithm
pub fn max_compression_level(algorithm: &CompressionAlgorithm) -> u32; // Get maximum compression level for algorithm
pub fn suggested_compression_level(algorithm: &CompressionAlgorithm) -> u32; // Get suggested compression level for size vs speed
```
**Compression Analysis**:
```rust
pub fn estimate_compression_ratio(original_size: u64, algorithm: &CompressionAlgorithm, level: u32) -> f32; // Estimate compression ratio
pub fn get_algorithm_settings(algorithm: &CompressionAlgorithm) -> AlgorithmSettings; // Get algorithm-specific settings
```
**`AlgorithmSettings`**: Algorithm-specific settings
```rust
pub struct AlgorithmSettings {
pub supports_streaming: bool, // Supports streaming
pub supports_dictionary: bool, // Supports dictionary
pub typical_ratio: f32, // Typical compression ratio
pub speed_category: SpeedCategory, // Speed category
pub memory_overhead: u64, // Memory overhead
}
```
**`SpeedCategory`**: Speed categories
```rust
pub enum SpeedCategory {
VeryFast, // Very fast
Fast, // Fast
Medium, // Medium
Slow, // Slow
}
```
**Streaming Functions**:
```rust
pub fn compress_stream<R: Read, W: Write>(input: &mut R, output: &mut W, algorithm: CompressionAlgorithm) -> Result<u64>; // Compress data with streaming for large datasets
pub fn decompress_stream<R: Read, W: Write>(input: &mut R, output: &mut W, algorithm: CompressionAlgorithm) -> Result<u64>; // Decompress data with streaming for large datasets
```
### 7. Encoding Utilities (`src/util/encoding.rs`)
Encoding detection and conversion utilities.
#### 7.1 Text Encoding Types
**`TextEncoding`**: Supported text encodings
```rust
pub enum TextEncoding {
Utf8, // UTF-8 encoding
Utf16Le, // UTF-16 Little Endian
Utf16Be, // UTF-16 Big Endian
Windows1252, // Windows-1252 (Latin-1)
Iso88591, // ISO-8859-1 (Latin-1)
Gb2312, // GBK/GB2312 (Chinese)
Big5, // Big5 (Traditional Chinese)
ShiftJis, // Shift-JIS (Japanese)
EucKr, // EUC-KR (Korean)
Unknown, // Unknown encoding
}
```
**Methods for TextEncoding**:
```rust
impl TextEncoding {
pub fn name(&self) -> &'static str; // Get canonical name
pub fn is_unicode(&self) -> bool; // Check if Unicode-based
pub fn is_variable_width(&self) -> bool; // Check if variable-width characters
pub fn max_char_bytes(&self) -> usize; // Get maximum byte length for single character
}
```
#### 7.2 Detection and Conversion Functions
**Core Functions**:
```rust
pub fn detect_encoding(data: &[u8]) -> Result<TextEncoding>; // Detect encoding of byte data
pub fn convert_to_utf8(data: &[u8], from_encoding: TextEncoding) -> Result<String>; // Convert byte data from one encoding to UTF-8 string
pub fn is_valid_utf8_str(s: &str) -> bool; // Validate if string is valid UTF-8
pub fn get_encoding_stats(encoding: TextEncoding) -> EncodingStats; // Get encoding statistics
```
**`EncodingStats`**: Encoding statistics
```rust
pub struct EncodingStats {
pub name: &'static str, // Encoding name
pub supports_unicode: bool, // Supports Unicode
pub max_char_size: usize, // Maximum character size
pub is_variable_width: bool, // Is variable width
pub common_in: Vec<&'static str>, // Common usage contexts
}
```
## Feature Flags
- **`cli`**: Enables command-line interface utilities
- **`bench`**: Enables benchmarking tests
## Usage Examples
### Basic Dictionary Loading
```rust
use dictutils::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let loader = DictLoader::new();
let dict = loader.load("path/to/dictionary.mdict")?;
let entry = dict.get(&"example".to_string())?;
println!("Entry: {}", String::from_utf8_lossy(&entry));
Ok(())
}
```
### Batch Operations
```rust
use dictutils::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let loader = DictLoader::new();
let dict = loader.load("path/to/dictionary.mdict")?;
let keys = vec!["word1".to_string(), "word2".to_string(), "word3".to_string()];
let results = dict.get_batch(&keys, Some(10))?;
for result in results {
match result.entry {
Some(entry) => println!("Found: {}", String::from_utf8_lossy(&entry)),
None => println!("Not found: {}", result.word),
}
}
Ok(())
}
```
### Search Operations
```rust
use dictutils::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let loader = DictLoader::new();
let dict = loader.load("path/to/dictionary.mdict")?;
// Prefix search
let prefix_results = dict.search_prefix("pre", Some(10))?;
for result in prefix_results {
println!("Prefix match: {}", result.word);
}
// Full-text search
let fts_results = dict.search_fulltext("search terms")?;
for result in fts_results {
println!("FTS match: {}", result.word);
}
Ok(())
}
```
### Custom Configuration
```rust
use dictutils::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = DictConfig {
load_btree: true, // Fast key lookups
load_fts: true, // Full-text search
use_mmap: true, // Memory mapping
cache_size: 2000, // Larger cache
batch_size: 200, // Larger batches
..Default::default()
};
let loader = DictLoader::with_config(config);
let dict = loader.load("path/to/large_dictionary.mdict")?;
Ok(())
}
```
## Error Handling
The library uses a custom `DictError` enum for all error handling:
```rust
match dict.get(&key) {
Ok(entry) => println!("Found: {}", String::from_utf8_lossy(&entry)),
Err(DictError::FileNotFound(path)) => println!("File not found: {}", path),
Err(DictError::InvalidFormat(msg)) => println!("Invalid format: {}", msg),
Err(DictError::IndexError(msg)) => println!("Index error: {}", msg),
Err(e) => println!("Other error: {}", e),
}
```
## Thread Safety
All dictionary operations are thread-safe and can be shared across threads using standard Rust concurrency patterns. The library uses:
- `Arc<RwLock<T>>` for shared mutable access
- `Send + Sync` bounds on trait objects
- Lock-free operations where possible
## Performance Considerations
1. **Memory Mapping**: Enable for files > 100MB using `use_mmap: true`
2. **Caching**: Adjust `cache_size` based on available memory and access patterns
3. **Batch Operations**: Use `get_batch()` for multiple lookups
4. **Index Building**: Build indexes once and reuse them
5. **Compression**: Choose appropriate compression algorithm based on use case
## Supported Dictionary Formats
### Monkey's Dictionary (MDict)
- File extensions: `.mdict`
- Features: B-TREE indexing, FTS, compression
- Supports: UTF-16LE, UTF-8, various encodings
### StarDict
- File extensions: `.ifo` (entry point), `.idx`, `.dict`
- Features: B-TREE indexing, FTS, synonyms (`.syn`)
- Supports: DICTZIP compression (`.dict.dz`)
### ZIM
- File extensions: `.zim`
- Features: Article-based storage, compression
- Requires: External indexing for full functionality
### Babylon (BGL)
- File extensions: `.bgl`
- Features: Sidecar index support
- Requires: `.bglx` or `.idx` sidecar files
- **Important**: BGL implementation does NOT parse raw `.bgl` binaries directly. It requires externally built sidecar index files (`.btree` and `.fts`) that must be provided by an external tool like GoldenDict's indexer. The BGL parser only consumes these pre-built indexes and does not implement raw BGL binary parsing.
### DSL (ABBYY Lingvo)
- File extensions: `.dsl`, `.dsl.dz`
- Features: Text-based, compression support
- Supports: UTF-16LE, UTF-8, Windows encodings
## Implementation Details
### B-TREE Index
- Configurable order (default: 256)
- Thread-safe with RwLock
- Persistent storage with serialization
- Range queries and validation
### Full-Text Search Index
- Inverted indexing
- TF-IDF scoring
- Stop word filtering
- Configurable tokenization
### Compression Support
- GZIP, LZ4, Zstandard
- Streaming operations for large files
- Memory-efficient decompression
- Algorithm-specific optimizations
### Encoding Detection
- BOM detection
- Statistical analysis
- UTF-8 validation
- Multi-encoding support
This documentation covers all public APIs, types, and functionality provided by the DictUtils crate. The library is designed for high-performance dictionary operations with a focus on flexibility, thread safety, and support for multiple dictionary formats.