liblevenshtein-rust
A Rust implementation of liblevenshtein for fast approximate string matching using Levenshtein Automata—deterministic finite automata that enable O(|W|) construction complexity for error-tolerant dictionary queries.
Based on "Fast String Correction with Levenshtein-Automata" (Schulz & Mihov, 2002).
Overview
This library provides efficient fuzzy string matching against large dictionaries using finite state automata. It supports multiple Levenshtein distance algorithms and pluggable dictionary backends.
What Makes This Library Fast
Unlike naive approaches that compute Levenshtein distance for every dictionary word (O(|D| × |W| × |V|)), this library uses Levenshtein Automata—a deterministic finite automaton that accepts exactly all words within distance n from a query word.
Key Algorithmic Advantages
- O(|W|) Automaton Construction - Linear in query word length for fixed error bound n
- O(|D|) Dictionary Traversal - Single pass through dictionary (measured as total edges)
- No Distance Recomputation - Automaton guides search, avoiding per-word calculations
- Proven Correctness - Accepts exactly L_Lev(n,W) = {V | d_L(W,V) ≤ n}
Result: Query time is effectively independent of dictionary size for well-distributed tries!
Theory: Based on "Fast String Correction with Levenshtein-Automata" (Schulz & Mihov, 2002). See complete algorithm documentation for algorithmic details including Position structures, Subsumption relations, and Characteristic vectors.
Theoretical Background
How Levenshtein Automata Work
Traditional approaches compute edit distance for every dictionary word, requiring O(|D| × |W| × |V|) operations. This library uses Levenshtein Automata—deterministic finite automata that accept exactly all words within distance n from query word W:
LEV_n(W) accepts L_Lev(n,W) = {V | d_L(W,V) ≤ n}
Key Concepts
Position (i#e): Represents "matched i characters of W with e errors"
- Example: 3#1 means "matched 3 characters, used 1 error"
- Range: 0 ≤ i ≤ |W|, 0 ≤ e ≤ n
Subsumption (⊑): Eliminates redundant positions for minimal state sets
- Position i#e subsumes j#f if: (e < f) ∧ (|j-i| ≤ f-e)
- Any word accepted from j#f is also accepted from i#e
Characteristic Vectors (χ): Encode where characters appear in query word
- χ(x,W) = ⟨b₁,...,b_w⟩ where b_i = 1 if W[i] = x
- Used to determine valid transitions without recomputing distances
Elementary Transitions: Move between positions based on dictionary character
- Defined in Tables 4.1, 7.1, 8.1 of the paper for each algorithm variant
- Enable O(1) state updates during traversal
Result: Automaton construction is O(|W|), dictionary traversal is O(|D|), regardless of dictionary size!
Foundational Papers
Core Algorithm:
- Schulz, Klaus U., and Stoyan Mihov. "Fast string correction with Levenshtein automata." International Journal on Document Analysis and Recognition 5.1 (2002): 67-85.
Extensions:
- Mitankin, Petar, Stoyan Mihov, and Klaus Schulz. "Universal Levenshtein Automata. Building and Properties." Information Processing & Management 41.4 (2005): 687-702.
Documentation
- Complete Algorithm Documentation - Detailed paper summary
- Implementation Mapping - Code-to-paper correspondence
- Glossary - Terms and notation reference
- Universal Levenshtein Automata Research - Restricted substitutions
- Weighted Distance Research - Variable operation costs
Features
- Fast approximate string matching using Levenshtein Automata with O(|W|) construction complexity
- Multiple algorithms (all maintaining O(|W|) construction complexity):
Standard: Insert, delete, substitute operations (Chapters 4-6 of foundational paper)- Example: "hello" → "helo" (delete), "helo" → "hello" (insert), "hello" → "hallo" (substitute)
- Use for: General fuzzy matching, spell checking
Transposition: Adds adjacent character swaps (Chapter 7) - Damerau-Levenshtein distance- Example: "teh" → "the" costs 1 (not 2), "recieve" → "receive" costs 1
- Use for: Typing errors, keyboard input corrections
MergeAndSplit: Adds two-character ↔ one-character operations (Chapter 8)- Example: "rn" ↔ "m", "vv" ↔ "w" (common in OCR)
- Use for: Scanned documents, character recognition systems
- Multiple dictionary backends:
- DoubleArrayTrie (recommended, default) - O(1) transitions with excellent cache locality and minimal memory footprint
- DoubleArrayTrieChar - Character-level variant for correct Unicode Levenshtein distances
- PathMap (optional
pathmap-backendfeature) - high-performance trie with structural sharing, supports dynamic updates - PathMapChar (optional
pathmap-backendfeature) - Character-level PathMap for Unicode fuzzy matching - DAWG - Directed Acyclic Word Graph for space-efficient storage
- OptimizedDawg - Arena-based DAWG with 20-25% faster construction
- DynamicDawg - DAWG with online insert/delete/minimize operations
- DynamicDawgChar - Character-level DAWG with Unicode support and online updates
- SuffixAutomaton - For substring/infix matching
- SuffixAutomatonChar - Character-level suffix automaton for Unicode substring matching
- Extensible trait-based design for custom backends
- Ordered results:
OrderedQueryIteratorreturns results sorted by distance first, then lexicographically - Filtering and prefix matching: Filter results with custom predicates, enable prefix mode for code completion
- Serialization support (optional
serializationfeature):- Multiple formats: Bincode (binary), JSON, Plain Text (newline-delimited UTF-8), Protobuf
- Optimized Protobuf formats: V2 (62% smaller), DAT-specific, SuffixAutomaton-specific
- Gzip compression (optional
compressionfeature) - 85% file size reduction - Save and load dictionaries to/from disk
- Full-featured CLI tool (optional
clifeature):- Interactive REPL for exploration
- Query, insert, delete, convert operations
- Support for all serialization formats including compressed variants
- Runtime dictionary updates:
- Thread-safe insert, remove, and clear operations (PathMap, DynamicDawg)
- Queries automatically see updates via
RwLock-based interior mutability - Concurrent queries during modifications
- Lazy evaluation - results generated on-demand
- Efficient memory usage - shared dictionary state across transducers
- SIMD acceleration (optional
simdfeature, x86_64 only):- AVX2/SSE4.1 optimized operations with runtime CPU feature detection
- 20-64% faster query performance across all workloads
- Optimized characteristic vectors, position subsumption, state operations
- Dictionary edge lookup with data-driven threshold tuning
- Distance matrix computation with vectorized operations
- Automatic fallback to scalar implementation when SIMD unavailable
- WASM/WASI support (optional
wasmfeature):- WebAssembly compilation for browser and Node.js targets
- WASI support for server-side WebAssembly runtimes
wasm-phoneticfeature combines WASM with phonetic rules- C FFI via
ffifeature for native integration
- LLev/LLRE - Phonetic Pattern Languages (requires
phonetic-rulesfeature):.llevformat for phonetic rewrite rules with metadata and includes.llreformat for regex-style patterns with NFA compilation- Regex-like syntax: quantifiers (
*,+,?,{n,m}), alternation, grouping - Built-in named character classes (
[:alpha:],[:vowel:], etc.) - NFA engine with automatic optimization (epsilon elimination, dead state removal)
- AOT compilation to binary format for instant loading
- See LLev Grammar and LLRE Grammar
Installation
From crates.io (Recommended)
Add to your Cargo.toml:
[]
= "0.8"
# Or with SIMD acceleration (x86_64 only, requires SSE4.1/AVX2):
= { = "0.8", = ["simd"] }
Or install the CLI tool:
Pre-built Packages
Download pre-built packages from the GitHub Releases page:
- Debian/Ubuntu:
.debpackages - Fedora/RHEL/CentOS:
.rpmpackages - Arch Linux:
.pkg.tar.zstpackages - macOS:
.dmgdisk images for easy installation (x86_64 and ARM64) - Binaries:
.tar.gzand.ziparchives for Linux and macOS (x86_64 and ARM64)
Usage
Basic Querying
use *;
// Create a dictionary from terms (using DoubleArrayTrie for best performance)
let terms = vec!;
let dict = from_terms;
// Create a transducer with Standard algorithm
let transducer = new;
// Query for terms within edit distance 2
for term in transducer.query
// Query with distances
for candidate in transducer.query_with_distance
Output:
Match: test
Match: tester
Match: test (distance: 1)
Match: tester (distance: 2)
Unicode Support
For correct character-level Levenshtein distances with Unicode text, use the character-level dictionary variants:
use DoubleArrayTrieChar;
use *;
// Create a character-level dictionary for Unicode content
let terms = vec!;
let dict = from_terms;
let transducer = new;
// Character-level distance: "" → "¡" is distance 1 (one character)
// Byte-level would incorrectly report distance 2 (two UTF-8 bytes)
let results: = transducer.query.collect;
// Results include single-character Unicode terms
// Works with accented characters
for candidate in transducer.query_with_distance
Character-level backends:
DoubleArrayTrieChar- Character-level Double-Array TriePathMapDictionaryChar(requirespathmap-backendfeature) - Character-level PathMap with dynamic updates
When to use:
- ✅ Dictionaries containing non-ASCII Unicode (accented characters, CJK, emoji)
- ✅ When edit distance must count characters, not bytes
- ✅ Multi-language applications requiring correct Unicode semantics
Performance: ~5% overhead for UTF-8 decoding, 4x memory for edge labels (char vs u8).
Runtime Dictionary Updates
The DynamicDawg backend supports thread-safe runtime updates:
use *;
// Create dictionary with runtime update support
let dict = from_terms;
let transducer = new;
// Insert new terms at runtime
dict.insert;
dict.insert;
// Queries immediately see the new terms
let results: = transducer.query.collect;
// Results: ["cat", "cot"]
// Remove terms
dict.remove;
Thread Safety: The dictionary uses RwLock for interior mutability:
- Multiple concurrent queries are allowed (read locks)
- Modifications acquire exclusive write locks
- Active
Transducerinstances automatically see updates
Performance Optimizations: Configure Bloom filter and auto-minimization for better performance:
use *;
// Enable Bloom filter for 10x faster contains() operations
let dict = with_config;
// Or enable auto-minimization for bulk insertions (30% faster for large datasets)
let dict = with_config;
// Or enable both optimizations
let dict = with_config;
Optimization Guide:
- Bloom Filter: Use when frequently checking if terms exist (88-93% faster
contains()) - Auto-Minimization: Use for bulk insertions of 1000+ terms (30% faster, prevents memory bloat)
- Default: Both disabled for maximum flexibility and minimal overhead on small datasets
Note: PathMapDictionary also supports runtime updates but requires the optional pathmap-backend feature.
See examples/dynamic_dictionary.rs for a complete demonstration.
Value-Mapped Dictionaries
Store metadata with dictionary terms using value-mapped dictionaries:
use *;
use HashSet;
// DynamicDawg with integer values (e.g., word frequencies)
let dict: = new;
dict.insert_with_value;
dict.insert_with_value;
// Retrieve values
assert_eq!;
assert_eq!;
// Use with FuzzyMap for fuzzy lookups with values
let map = new;
let results = map.get_with_distance; // fuzzy lookup
// Results include both terms and their values
// Or use FuzzyMultiMap to aggregate multiple values
let dict: = new;
dict.insert_with_value;
Supported backends:
DynamicDawg<V>- Dynamic dictionary with values of typeVDynamicDawgChar<V>- Character-level dynamic dictionary with Unicode supportPathMapDictionary<V>- PathMap with values (requirespathmap-backend)PathMapDictionaryChar<V>- Character-level PathMap with values
Common value types:
u32/u64- Frequencies, scores, IDsHashSet<T>- Multiple associations per termVec<T>- Ordered collections- Any type implementing
Clone + Send + Sync + 'static
Integration with contextual completion:
use *;
// Use DynamicDawg backend for contextual completion
let dict: = new;
let engine = with_dictionary;
// Insert terms with context IDs
let ctx = engine.create_root_context;
engine.insert_finalized;
Ordered Results
Get results sorted by edit distance first, then alphabetically:
use *;
let dict = from_terms;
let transducer = new;
// Results ordered by distance, then alphabetically
for candidate in transducer.query_ordered
Output:
ape: 1
apple: 1
apply: 1
Filtering and Prefix Matching
Filter results and enable prefix matching for code completion:
use *;
let dict = from_terms;
let transducer = new;
// Prefix matching with filtering
for candidate in transducer
.query_ordered
.prefix // Match terms starting with query ± edits
.filter // Only getter methods
Output:
getValue: 0
getVariable: 1
See examples/code_completion_demo.rs and examples/contextual_filtering_optimization.rs for more examples.
Dictionary Backend Comparison
The library provides multiple dictionary backends optimized for different use cases:
| Backend | Best For | Performance Highlights |
|---|---|---|
| DoubleArrayTrie | General use (recommended) | 3x faster queries, 30x faster contains, 8 bytes/state |
| DoubleArrayTrieChar | Unicode text, character-level distances | Correct Unicode semantics, ~5% overhead |
| PathMap | Dynamic updates, runtime modifications | Thread-safe insert/delete, fastest dynamic backend |
| PathMapChar | Unicode + dynamic updates | Character-level distances with runtime modifications |
| DAWG | Static dictionaries, moderate size | Good balance of speed and memory |
| OptimizedDawg | Static dictionaries, construction speed | 20-25% faster construction than DAWG |
| DynamicDawg | Occasional modifications | Best fuzzy matching for dynamic use |
| DynamicDawgChar | Unicode + occasional modifications | Character-level with insert/remove, ~5% overhead |
| SuffixAutomaton | Substring/infix matching | Find patterns anywhere in text |
Performance Comparison (10,000 words):
Construction: DAT: 3.2ms PathMap: 3.5ms DAWG: 7.2ms
Exact Match: DAT: 6.6µs DAWG: 19.8µs PathMap: 71.1µs
Contains (100): DAT: 0.22µs DAWG: 6.7µs PathMap: 132µs
Distance 1: DAT: 12.9µs DAWG: 319µs PathMap: 888µs
Distance 2: DAT: 16.3µs DAWG: 2,150µs PathMap: 5,919µs
Memory/state: DAT: ~8B OptDawg: ~13B DAWG: ~32B
Why DoubleArrayTrie is Fastest:
- O(1) transitions - Direct array indexing vs pointer chasing
- Excellent cache locality - Sequential memory layout minimizes cache misses
- Minimal memory footprint - ~8 bytes per state (vs 32 bytes for DAWG)
- Perfect for parallel traversal - Optimal implementation of dictionary automaton A^D from paper
All backends implement the dictionary automaton A^D that is traversed in parallel with the Levenshtein automaton LEV_n(W) during queries. The choice of backend affects the constant factors in the O(|D|) query complexity, with DoubleArrayTrie providing the smallest constants due to hardware-friendly memory access patterns.
Prefix Matching Support: All backends except SuffixAutomaton support efficient prefix matching through the .prefix() method, making them ideal for code completion and autocomplete applications.
When to use each backend:
- Static dictionaries (ASCII/Latin-1) →
DoubleArrayTrie(best overall performance, default choice) - Static dictionaries (Unicode) →
DoubleArrayTrieChar(correct character-level distances) - Dynamic updates (ASCII/Latin-1) →
DynamicDawg(thread-safe runtime modifications) - Dynamic updates (Unicode) →
DynamicDawgChar(character-level with insert/remove, best fuzzy matching) - Dynamic updates (Unicode, frequent) →
PathMapChar(alternative, requirespathmap-backend) - Substring search →
SuffixAutomaton(finds patterns anywhere in text) - Memory-constrained →
DoubleArrayTrie(8 bytes/state, most efficient) - Multi-language apps → Character-level variants (
*Char) for correct Unicode semantics
Serialization and Compression
Save and load dictionaries with optional compression:
use *;
use File;
let dict = from_terms;
// Save with compression (85% file size reduction)
let file = create?;
serialize?;
// Load compressed dictionary
let file = open?;
let dict: DoubleArrayTrie = deserialize?;
// Or use optimized Protobuf formats
let file = create?;
serialize?; // 62% smaller than standard
Requires serialization and compression features:
[]
= { = "https://github.com/universal-automata/liblevenshtein-rust", = "v0.8.0", = ["serialization", "compression"] }
CLI Tool
The library includes a full-featured command-line tool with interactive REPL:
# Install with CLI support (from GitHub)
# Or download pre-built binaries from GitHub Releases
# Query a dictionary
# Convert between formats with compression
# Launch interactive REPL
The CLI auto-detects formats from file extensions and supports:
- Formats: text, bincode, json, protobuf (plus
-gzcompressed variants) - Backends: path-map, dawg, dynamic-dawg
- Operations: query, insert, delete, convert, save, info
See docs/developer-guide/building.md for comprehensive CLI documentation.
FuzzyMultiMap - Aggregating Values
The FuzzyMultiMap enables fuzzy lookup of associated values, aggregating results from all matching terms:
use *;
use FuzzyMultiMap;
use HashSet;
// Create a dictionary with values (e.g., user IDs for each name)
let dict = from_terms_with_values;
// Create fuzzy multimap
let fuzzy_map = new;
// Query "alise" - matches both "alice" and "alicia" at distance 1
let user_ids = fuzzy_map.query.unwrap;
// Returns: HashSet {101, 102, 103} - union of all matching values
// Practical use case: find all IDs for fuzzy name search
let ids = fuzzy_map.query.unwrap;
// Returns IDs for both "bob" (distance 1) and "robert" (distance 2)
Supported collection types:
HashSet<T>- Union of all matching setsBTreeSet<T>- Union of all matching setsVec<T>- Concatenation of all matching vectors
Use cases:
- User ID lookup with fuzzy name matching
- Tag aggregation across similar terms
- Multi-valued dictionary queries with error tolerance
Contextual Code Completion (Zipper-Based)
The ContextualCompletionEngine provides hierarchical scope-aware code completion with draft state management:
use ContextualCompletionEngine;
use Algorithm;
// Create engine (thread-safe, shareable with Arc)
let engine = with_algorithm;
// Create hierarchical scopes (global → function → block)
let global = engine.create_root_context;
let function = engine.create_child_context.unwrap;
let block = engine.create_child_context.unwrap;
// Add finalized terms to each scope
engine.finalize_direct.unwrap;
engine.finalize_direct.unwrap;
engine.finalize_direct.unwrap;
engine.finalize_direct.unwrap;
// Incremental typing in block scope (draft state)
engine.insert_str.unwrap;
// Query completions - sees all visible scopes (block + function + global)
let completions = engine.complete;
for comp in completions
// Output: parameter (distance: 0, draft: false, from context: [1])
// Checkpoint/undo support for editor integration
engine.checkpoint.unwrap;
engine.insert_str.unwrap; // Now: "local_variable"
engine.undo.unwrap; // Restore to: "local_var"
// Finalize draft to add to dictionary
let term = engine.finalize.unwrap; // Returns "local_var"
assert!;
// Query now sees finalized term
let results = engine.complete;
// Returns: local_var (now finalized, visible in this context)
Features:
- Hierarchical visibility: Child contexts see parent terms (global → function → block)
- Draft state: Incremental typing without polluting finalized dictionary
- Checkpoint/undo: Editor-friendly state management
- Thread-safe: Share engine across threads with
Arc - Mixed queries: Search both drafts and finalized terms simultaneously
Performance (sub-millisecond for interactive use):
- Insert character: ~4 µs
- Checkpoint: ~116 ns
- Query (500 terms, distance 1): ~11.5 µs
- Query (distance 2): ~309 µs
Use cases:
- LSP servers with multi-file scope awareness
- Code editors with context-sensitive completion
- REPL environments with session-scoped symbols
- Any application requiring hierarchical fuzzy matching
See examples/contextual_completion.rs for a complete example.
Prefix-Based Iteration (PrefixZipper)
For efficient autocomplete and code completion scenarios, use the PrefixZipper trait to navigate directly to a prefix and iterate only matching terms. This is 5-10× faster than full dictionary iteration with filtering when the prefix matches a small subset of terms.
Performance: O(k) navigation + O(m) iteration, where k = prefix length, m = matching terms.
use *;
use PrefixZipper;
use DoubleArrayTrieZipper;
let terms = vec!;
let dict = from_terms;
// Create zipper and navigate to prefix
let zipper = new_from_dict;
if let Some = zipper.with_prefix
Unicode support (character-level):
use DoubleArrayTrieChar;
use DoubleArrayTrieCharZipper;
use PrefixZipper;
let terms = vec!;
let dict = from_terms;
let zipper = new_from_dict;
let prefix: = "caf".chars.collect;
if let Some = zipper.with_prefix
Valued dictionaries (for metadata like scope IDs, frequencies, etc.):
use ValuedPrefixZipper;
let dict = from_terms_with_values;
let zipper = new_from_dict;
if let Some = zipper.with_prefix_values
Use cases:
- Code completion / autocomplete
- Prefix search in large dictionaries
- Pattern-aware completion (LSP servers)
- Any scenario requiring "terms starting with X"
Backend support: All dictionary backends support PrefixZipper (DoubleArrayTrie, DynamicDawg, PathMap, etc.) through blanket trait implementation.
Advanced: Direct Zipper API
For fine-grained control over traversal, use the zipper API directly:
use PathMapDictionary;
use PathMapZipper;
use ;
use IntersectionZipper;
// Create dictionary
let dict = new;
dict.insert;
dict.insert;
dict.insert;
// Create zippers
let dict_zipper = new_from_dict;
let auto_zipper = new;
// Intersect dictionary and automaton
let intersection = new;
// Manual traversal
let mut pool = new;
for in intersection.children
When to use:
- Custom traversal algorithms (DFS, A*, beam search)
- Early termination with custom heuristics
- Integration with external data structures
- Research and experimentation
Note: Most users should use Transducer::query() or ContextualCompletionEngine instead. The zipper API is lower-level and requires manual state management.
Value-Filtered Queries
For performance optimization when querying dictionaries with scoped values:
use *;
use HashSet;
// Dictionary with scope IDs
let dict = from_terms_with_values;
let transducer = new;
// Query only specific scopes (e.g., visible from scope 1)
let visible_scopes = from; // global + function
for term in transducer.query_by_value_set
// Output: function_param (scope 1 is visible)
// Does NOT return: local_var (scope 2 not in visible set)
// Or use custom predicate
for term in transducer.query_filtered
// Returns: global_var, function_param (scopes 0, 1)
Performance benefit: Filtering by value during traversal is significantly faster than post-filtering results, especially for large dictionaries with many out-of-scope matches.
Use cases:
- Scope-based code completion (only show visible symbols)
- Access control (filter by user permissions)
- Multi-tenancy (filter by tenant ID)
Phonetic Rewrite Rules (Formally Verified)
The phonetic-rules feature provides mathematically verified phonetic transformation rules for fuzzy matching. Compose phonetic NFAs with Levenshtein automata for approximate string matching without normalization.
Compile-Time Macros
Use llre! and llev! macros to compile patterns and rules at build time (NFA embedded in binary):
use ;
// Compile regex pattern at build time - NFA embedded in binary
let pattern = llre!;
assert!;
assert!;
// Compile LLev rules at build time
let rules = llev!;
// Load from files (imports resolved at compile time)
let phonetic = llre_file!;
let english = llev_file!;
Composing NFAs with Levenshtein Automata
Build fuzzy regex matchers by composing phonetic NFAs with Levenshtein automata:
use ;
use parse;
// Step 1: Parse phonetic regex pattern
let regex = parse?;
// Step 2: Compile regex to NFA
let nfa = compile?;
// Step 3: Compose NFA with Levenshtein automaton (max distance 2)
let product = new;
// Step 4: Fuzzy match without preprocessing
assert!; // exact match (distance 0)
assert!; // exact match (distance 0)
assert!; // distance 1 (insertion)
assert!; // distance 1 (deletion)
assert!; // distance 1 (substitution)
// Get minimum edit distance
assert_eq!;
assert_eq!;
assert_eq!; // outside budget
Spelling Correction with Embedded English Rules
Combine pre-compiled English phonetic rules and search a dictionary for correction candidates:
use english;
use RuleSetChar;
use rules_to_nfa_char;
use ProductAutomatonChar;
use DynamicDawgChar;
// Load pre-compiled English phonetic rules
let zompist = zompist; // 62 rules: ph→f, gh→∅, tion→ʃən
let homophones = homophones; // too/two→to, their/there→ther
let text_speak = text_speak; // u→you, 2→to, thx→thanks
// Combine all rules into a new RuleSetChar
let mut combined = new;
combined.merge;
combined.merge;
combined.merge;
// Convert combined rules to NFA for fuzzy matching (no normalization)
let rules_nfa = rules_to_nfa_char;
let product = new;
// Build a dictionary of terms to search
let mut dictionary = new;
dictionary.insert;
dictionary.insert;
dictionary.insert;
dictionary.insert;
// Query for correction candidates by filtering dictionary terms
let query = "fone";
let mut candidates: = dictionary
.iter
.filter_map
.collect;
// Sort by distance (best matches first)
candidates.sort_by_key;
// candidates = [("fone", 0), ("phone", 0), ...]
// Optional: Apply rules to normalize text
let normalized = combined.apply; // "fone" (ph→f)
let normalized = combined.apply; // "nit" (silent k, gh→∅)
Algorithm Variants
use Algorithm;
// Standard Levenshtein
let standard = new;
// Transposition-aware (character swaps count as 1 edit)
let transposition = with_algorithm;
// Merge-and-split (for OCR: "cl"→"d", "ä"→"ae")
let merge_split = with_algorithm;
PhoneticGrep (Convenience API)
For quick on-the-fly matching without building a dictionary:
use PhoneticGrep;
// Quick on-the-fly matching
let grep = from_pattern?;
assert!;
assert!;
// With phonetic flags (case + accent insensitive)
let grep = from_pattern?;
assert!;
// With rules from file
let grep = with_rules?;
assert!;
Named Character Classes (phonetic-aware):
// [:vowel:] includes ASCII vowels + IPA vowels (ə, ɪ, ʊ, ɛ, etc.)
// [:consonant:] includes ASCII consonants + IPA symbols
// [:fricative:] matches f,v,s,z,h + digraphs (sh,th,zh)
// [:nasal:] matches m,n + ng digraph + IPA ŋ
// [:stop:] / [:plosive:] matches p,b,t,d,k,g
// [:front_vowel:] / [:back_vowel:] by tongue position
// [:voiced:] / [:voiceless:] by voicing
Formal Verification: All algorithms are proven correct in Coq/Rocq:
Phonetic Rules (5 theorems):
- Well-formedness - All rules satisfy structural constraints
- Bounded Expansion - Output length ≤ input length + 20 (guaranteed)
- Non-Confluence - Rule application order matters (proven)
- Termination - Sequential application always terminates (guaranteed)
- Idempotence - Fixed points remain unchanged (stable semantics)
Levenshtein Automata:
- Complete axiom-free proofs for distance properties (triangle inequality, lower bounds)
- Position skipping optimization: 50/50 proofs complete (100% verified)
- Modular proof decomposition with 0 Admitted lemmas
Verification Artifacts:
- Coq proofs:
docs/verification/phonetic/ - Rust implementation:
src/phonetic/ - Property tests:
src/phonetic/properties.rs - Benchmarks:
benches/phonetic_rules.rs - Example:
examples/phonetic_rewrite.rs
Enable with:
[]
= { = "https://github.com/universal-automata/liblevenshtein-rust", = ["phonetic-rules"] }
Source: Based on Zompist's English spelling rules with complete formal verification in Coq/Rocq (630+ lines of proofs, 100% proven, zero Admitted).
LLev/LLRE - Phonetic Pattern Languages
For complex phonetic matching, use .llev files for rewrite rules or .llre files for regex-style patterns with phonetic extensions:
use parse_str;
use llre;
// LLev: Rewrite rules with phonetic context and named classes
let rules = parse_str?;
// LLRE: Compile regex pattern to NFA
let pattern = compile_pattern?;
assert!; // f ∈ [:fricative:]
assert!; // sh ∈ [:fricative:]
// Compose LLRE pattern with Levenshtein for fuzzy matching
use ProductAutomatonChar;
let product = new;
assert!; // distance 1 from pattern
See examples/phonetic_spellcheck for a complete demo, and the LLev Grammar / LLRE Grammar for full syntax reference.
Documentation
- Technical Glossary - Comprehensive reference for all technical terms (70+ entries)
- User Guide - Getting started, algorithms, backends, and features
- Developer Guide - Architecture, building, contributing, and performance
- Building and Testing - Comprehensive build instructions and CLI usage
- Contributing Guidelines - How to contribute to the project
- Features Overview - Detailed feature documentation
- Publishing Guide - Requirements for publishing to crates.io
- Changelog - Version history and release notes
How Code Maps to Theory
For developers wanting to understand the implementation or extend the algorithms:
| Paper Concept | Code Location | Description |
|---|---|---|
| Position (i#e) | src/transducer/position.rs:11-35 |
Index + error count structure |
| Subsumption (⊑) | src/transducer/position.rs:231-269 |
Position redundancy elimination |
| Characteristic Vector (χ) | src/transducer/position.rs |
Character occurrence encoding |
| Elementary Transitions (δ) | src/transducer/transition.rs:119-438 |
Table 4.1, 7.1, 8.1 from paper |
| State Transitions (Δ) | src/transducer/query.rs |
Parallel traversal implementation |
| Algorithm Variants | src/transducer/algorithm.rs |
Standard/Transposition/MergeAndSplit |
| Imitation Method | src/transducer/query.rs:86-188 |
Chapter 6 - on-demand state generation |
| Dictionary Automaton (A^D) | src/dictionary/ |
Multiple backend implementations |
Key Implementation Patterns:
- Position Structure: Tracks (term_index, num_errors, is_special) corresponding to i#e_t or i#e_s in the paper
- Subsumption Checking: Ensures minimal state sets by eliminating redundant positions
- Transition Functions: Implement the case analysis from Tables 4.1, 7.1, and 8.1
- Parallel Traversal: Simultaneously navigate dictionary automaton A^D and Levenshtein automaton LEV_n(W)
See Implementation Mapping for detailed code-to-paper correspondence with examples.
Performance
The library maintains the theoretical guarantees from the foundational paper while adding practical optimizations:
Theoretical Complexity
- Construction: O(|W|) time for automaton construction (Theorem 5.2.1)
- Query: O(|D|) time for dictionary traversal where |D| is total edges (Chapter 3)
- Space: O(|W|) states for fixed error bound n (Theorem 4.0.32)
vs Naive Approach: Computing Levenshtein distance for every dictionary entry requires O(|D| × |W| × |V|) operations. This library achieves 100-1000× speedup on large dictionaries by avoiding per-word distance calculations through automata-guided search.
Core Optimizations
Beyond the algorithmic foundations, the implementation includes:
- Arc Path Sharing: Eliminated expensive cloning operations during traversal
- StatePool: Object pool pattern for state reuse with exceptional performance gains
- SmallVec Integration: Stack-allocated vectors reduce heap allocation pressure
- Lazy Edge Iteration: 15-50% faster PathMap edge iteration with zero-copy implementation
- Aggressive Inlining: Hot path functions inlined for optimal performance
Benchmarks show 3.3x speedup for DAWG operations and 5-18% improvements across filtering/prefix scenarios.
SIMD Acceleration (optional simd feature)
When compiled with the simd feature on x86_64 platforms, the library achieves 20-64% performance gains through:
- 8 SIMD-optimized components across critical performance paths
- AVX2/SSE4.1 implementations with runtime CPU feature detection
- Data-driven threshold tuning based on empirical benchmarking
- Automatic fallback to scalar implementation when SIMD unavailable
Key optimizations:
- Characteristic vector operations (vectorized bit manipulation)
- Position subsumption checking (parallel state comparisons)
- State operations (vectorized distance computations)
- Dictionary edge lookups (batched queries with adaptive thresholds)
- Distance matrix computation (vectorized row/column operations)
Performance improvements vary by workload:
- Small dictionaries (< 1,000 terms): 20-30% faster
- Medium dictionaries (1,000-10,000 terms): 30-45% faster
- Large dictionaries (> 10,000 terms): 45-64% faster
See docs/analysis/ for detailed SIMD performance analysis (950+ lines of documentation).
Unicode Performance
Character-level dictionary variants (*Char) for Unicode support:
- ~5% overhead for UTF-8 decoding during traversal
- 4x memory for edge labels (4 bytes per
charvs 1 byte peru8) - Zero-cost abstraction via monomorphization (no runtime polymorphism)
- Same query performance characteristics as byte-level variants
When to use:
- ✅ Use
*Charvariants for multi-language dictionaries with non-ASCII Unicode - ✅ Use byte-level variants (
DoubleArrayTrie,PathMapDictionary) for ASCII/Latin-1 content
Phonetic Pattern Matching
When using the phonetic-rules feature:
- NFA optimizations: 2-7× speedup through epsilon elimination, dead state removal, and transition deduplication
- Position skipping: Up to 26.6× speedup for pattern matching with skip-ahead optimization
- Lexer optimizations: 7-11% speedup via string interning and stack-based operations
Research & Planned Features
This library includes extensive research documentation on potential enhancements:
Universal Levenshtein Automata (Planned)
Support for restricted substitutions where only specific character pairs can be substituted:
Use Cases:
- Keyboard proximity - Only adjacent keys allowed (QWERTY/AZERTY/Dvorak layouts)
- Example: 'a' ↔ 's' ↔ 'd' (adjacent on QWERTY)
- OCR error modeling - Visual similarity constraints
- Example: 'l' ↔ 'I' ↔ '1', 'O' ↔ '0', 'rn' ↔ 'm'
- Phonetic matching - Sound-alike rules
- Example: 'f' ↔ 'ph', 'k' ↔ 'c', 's' ↔ 'z'
Theory: Based on "Universal Levenshtein Automata. Building and Properties" (Mitankin, Mihov, Schulz, 2005). Maintains O(|W|) construction complexity with restricted substitution set S ⊆ Σ × Σ.
Status: Research complete, implementation planned. See Universal Levenshtein Automata Documentation for complete details.
Weighted Operations (Research Phase)
Support for variable operation costs through discretization:
Use Cases:
- Keyboard distance costs - Closer keys cost less (adjacent: 0.5, same row: 1.0, different rows: 1.5)
- Context-dependent costs - Position-aware error penalties (beginning/end of word)
- Frequency-based costs - Common errors cost less than rare ones
Complexity: O(|W| × max_cost/precision) through cost discretization—still linear in query length when cost range and precision are fixed.
Status: Methodology documented, implementation research phase. See Weighted Levenshtein Automata Research for feasibility analysis and implementation approach.
Contributing Research
Interested in implementing these features or researching new extensions? See:
The research documentation provides complete theoretical foundations and implementation roadmaps for extending the library.
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.