shabdakosh
shabdakosh (Sanskrit: dictionary) — Pronunciation dictionary crate for AGNOS.
Maps words to svara Phoneme sequences using a 10,600+ entry English dictionary derived from the CMU Pronouncing Dictionary. Multi-language support via optional varna integration.
Features
- 10,600+ entry English dictionary generated at compile time from CMUdict (zero runtime parsing)
- ARPABET mapping — bidirectional conversion between ARPABET notation and svara phonemes
- IPA mapping — bidirectional IPA-Phoneme conversion with greedy parser
- User overlay — application-specific entries that override the base dictionary
- Variant pronunciations — heteronyms (read, live, wind) with frequency and region metadata
- Import/export — CMUdict, IPA, JSON, W3C PLS, SSML
<phoneme>tags - Dictionary operations — merge (override/conservative), diff (added/removed/changed)
- Multi-language (varna feature) — inventory validation, lexicon ingestion, script/language detection
- no_std compatible — works with
alloc, no standard library required
Quick Start
use PronunciationDict;
let dict = english;
assert!;
assert!;
User Overlay
Override or extend the built-in dictionary with application-specific pronunciations:
use PronunciationDict;
use Phoneme;
let mut dict = english;
// Add a custom word
dict.insert_user;
// User entries take precedence over base entries
assert!;
Import/Export
use format;
// Parse CMUdict format
let input = "hello HH AH0 L OW1\nworld W ER1 L D\n";
let dict = parse_cmudict.unwrap;
// Export back to CMUdict format
let output = to_cmudict;
Also supported: IPA text, JSON (json feature), W3C PLS XML, SSML <phoneme> tags.
Multi-Language (varna feature)
use PronunciationDict;
// Ingest a varna lexicon
let lexicon = by_code.unwrap;
let dict = from_lexicon;
assert_eq!;
// Detect script from Unicode
use detect;
assert_eq!;
Feature Flags
| Feature | Default | Description |
|---|---|---|
std |
Yes | Standard library support. Disable for no_std + alloc |
json |
No | JSON import/export via serde_json |
varna |
No | Multi-language: validation, lexicon ingestion, script detection |
full |
No | All features |
Architecture
shabdakosh
├── arpabet.rs ARPABET <-> svara Phoneme
├── ipa.rs IPA <-> svara Phoneme
└── dictionary/
├── mod.rs PronunciationDict (hashbrown + BTreeMap overlay)
├── entry.rs DictEntry, Pronunciation, Region
├── validate.rs inventory validation (varna feature)
├── detect.rs script/language detection (varna feature)
└── format/
├── mod.rs CMUdict, IPA, JSON import/export
├── pls.rs W3C PLS XML
└── ssml.rs SSML <phoneme> tags
See docs/architecture/overview.md for the full architecture overview.
Documentation
- Usage Guide — comprehensive examples for all features
- Architecture Overview — module map, data flow, design principles
- ADRs — architecture decision records
Consumers
- shabda — G2P engine (dictionary lookup + rules fallback)
- dhvani — Audio engine
- vansh — Voice AI shell
License
GPL-3.0-only