amt-phonetic (Rust)
Articulatory Moment Transform — language-agnostic phonetic token matching.
Crate is published as amt-phonetic; the library is imported as amt.
Designed and benchmarked for personal names across Latin, Arabic,
CJK, Cyrillic, Devanagari, and Hebrew scripts. The core encoder
generalizes to other short tokens (places, brands, drugs); see the
top-level README
for the caveats around the name-specific preprocessing
(ال / AL-EL-UL-AS-ES prefix stripping, silent trailing H).
[]
= "1.0"
use ;
assert!;
assert!; // Latin ↔ Arabic
assert!; // Egyptian ↔ Standard
assert!;
let s: f32 = similarity; // ≈ 1.0
let mut tree: = new;
for name in &customer_names
let q = encode_token;
let hits = tree.query;
Features
| flag | default | what it does |
|---|---|---|
smallvec |
on | Stack-allocate small class / spectral / bloom tuples. |
Disable with default-features = false if you cannot pull in smallvec.
API surface
| Item | Purpose |
|---|---|
encode_token(s) |
Encode one token → Code { spectrals, blooms, .. } |
encode(name) |
Encode multi-token name → Vec<Code> |
matches(a, b) |
Boolean phonetic match |
similarity(a, b) -> f32 |
Graded similarity in [0, 1] |
token_distance(&a, &b) |
Token-level distance |
BKTree<T> |
Metric tree for radius-bounded fuzzy search |
Code / class_of(c) |
Inspect raw fingerprints / sonority class of a char |
Benchmarks
Run the in-tree throughput benchmark (Criterion):
End-to-end corpus + recall benchmarks (require regenerated data — see
../benchmarks/README.md):
Algorithm
See the whitepaper for the full mathematical treatment, sonority classes, and head-to-head recall numbers vs Soundex, Metaphone, NYSIIS, Beider-Morse, and friends.
License
MIT — see LICENSE.