shabdakosh

shabdakosh (Sanskrit: dictionary) — Pronunciation dictionary crate for AGNOS.

Maps words to svara Phoneme sequences using a 10,600+ entry English dictionary derived from the CMU Pronouncing Dictionary. Multi-language support via optional varna integration.

Features

10,600+ entry English dictionary generated at compile time from CMUdict (zero runtime parsing)
ARPABET mapping — bidirectional conversion between ARPABET notation and svara phonemes
IPA mapping — bidirectional IPA-Phoneme conversion with greedy parser
User overlay — application-specific entries that override the base dictionary
Variant pronunciations — heteronyms (read, live, wind) with frequency and region metadata
Import/export — CMUdict, IPA, JSON, W3C PLS, SSML <phoneme> tags
Dictionary operations — merge (override/conservative), diff (added/removed/changed)
Multi-language (varna feature) — inventory validation, lexicon ingestion, script/language detection
no_std compatible — works with alloc, no standard library required

Quick Start

use shabdakosh::PronunciationDict;

let dict = PronunciationDict::english();
assert!(dict.lookup("hello").is_some());
assert!(dict.len() >= 10000);

User Overlay

Override or extend the built-in dictionary with application-specific pronunciations:

use shabdakosh::PronunciationDict;
use svara::phoneme::Phoneme;

let mut dict = PronunciationDict::english();

// Add a custom word
dict.insert_user("agnos", &[
    Phoneme::VowelAsh, Phoneme::PlosiveG,
    Phoneme::NasalN, Phoneme::VowelO, Phoneme::FricativeS,
]);

// User entries take precedence over base entries
assert!(dict.lookup("agnos").is_some());

Import/Export

use shabdakosh::dictionary::format;

// Parse CMUdict format
let input = "hello  HH AH0 L OW1\nworld  W ER1 L D\n";
let dict = format::parse_cmudict(input).unwrap();

// Export back to CMUdict format
let output = format::to_cmudict(&dict);

Also supported: IPA text, JSON (json feature), W3C PLS XML, SSML <phoneme> tags.

Multi-Language (varna feature)

use shabdakosh::PronunciationDict;

// Ingest a varna lexicon
let lexicon = varna::lexicon::swadesh::by_code("es").unwrap();
let dict = PronunciationDict::from_lexicon(&lexicon);
assert_eq!(dict.language(), Some("es"));

// Detect script from Unicode
use shabdakosh::dictionary::detect;
assert_eq!(detect::detect_script("नमस्ते"), Some("Deva"));

Feature Flags

Feature	Default	Description
`std`	Yes	Standard library support. Disable for `no_std` + `alloc`
`json`	No	JSON import/export via serde_json
`varna`	No	Multi-language: validation, lexicon ingestion, script detection
`full`	No	All features

Architecture

shabdakosh
├── arpabet.rs          ARPABET <-> svara Phoneme
├── ipa.rs              IPA <-> svara Phoneme
└── dictionary/
    ├── mod.rs           PronunciationDict (hashbrown + BTreeMap overlay)
    ├── entry.rs         DictEntry, Pronunciation, Region
    ├── validate.rs      inventory validation (varna feature)
    ├── detect.rs        script/language detection (varna feature)
    └── format/
        ├── mod.rs       CMUdict, IPA, JSON import/export
        ├── pls.rs       W3C PLS XML
        └── ssml.rs      SSML <phoneme> tags

See docs/architecture/overview.md for the full architecture overview.

Documentation

Usage Guide — comprehensive examples for all features
Architecture Overview — module map, data flow, design principles
ADRs — architecture decision records

Consumers

shabda — G2P engine (dictionary lookup + rules fallback)
dhvani — Audio engine
vansh — Voice AI shell

License

GPL-3.0-only

shabdakosh 2.0.0