shabdakosh 2.0.0

shabdakosh — Pronunciation dictionary with ARPABET/CMUdict support for svara phonemes
Documentation

shabdakosh

shabdakosh (Sanskrit: dictionary) — Pronunciation dictionary crate for AGNOS.

Maps words to svara Phoneme sequences using a 10,600+ entry English dictionary derived from the CMU Pronouncing Dictionary. Multi-language support via optional varna integration.

Features

  • 10,600+ entry English dictionary generated at compile time from CMUdict (zero runtime parsing)
  • ARPABET mapping — bidirectional conversion between ARPABET notation and svara phonemes
  • IPA mapping — bidirectional IPA-Phoneme conversion with greedy parser
  • User overlay — application-specific entries that override the base dictionary
  • Variant pronunciations — heteronyms (read, live, wind) with frequency and region metadata
  • Import/export — CMUdict, IPA, JSON, W3C PLS, SSML <phoneme> tags
  • Dictionary operations — merge (override/conservative), diff (added/removed/changed)
  • Multi-language (varna feature) — inventory validation, lexicon ingestion, script/language detection
  • no_std compatible — works with alloc, no standard library required

Quick Start

use shabdakosh::PronunciationDict;

let dict = PronunciationDict::english();
assert!(dict.lookup("hello").is_some());
assert!(dict.len() >= 10000);

User Overlay

Override or extend the built-in dictionary with application-specific pronunciations:

use shabdakosh::PronunciationDict;
use svara::phoneme::Phoneme;

let mut dict = PronunciationDict::english();

// Add a custom word
dict.insert_user("agnos", &[
    Phoneme::VowelAsh, Phoneme::PlosiveG,
    Phoneme::NasalN, Phoneme::VowelO, Phoneme::FricativeS,
]);

// User entries take precedence over base entries
assert!(dict.lookup("agnos").is_some());

Import/Export

use shabdakosh::dictionary::format;

// Parse CMUdict format
let input = "hello  HH AH0 L OW1\nworld  W ER1 L D\n";
let dict = format::parse_cmudict(input).unwrap();

// Export back to CMUdict format
let output = format::to_cmudict(&dict);

Also supported: IPA text, JSON (json feature), W3C PLS XML, SSML <phoneme> tags.

Multi-Language (varna feature)

use shabdakosh::PronunciationDict;

// Ingest a varna lexicon
let lexicon = varna::lexicon::swadesh::by_code("es").unwrap();
let dict = PronunciationDict::from_lexicon(&lexicon);
assert_eq!(dict.language(), Some("es"));

// Detect script from Unicode
use shabdakosh::dictionary::detect;
assert_eq!(detect::detect_script("नमस्ते"), Some("Deva"));

Feature Flags

Feature Default Description
std Yes Standard library support. Disable for no_std + alloc
json No JSON import/export via serde_json
varna No Multi-language: validation, lexicon ingestion, script detection
full No All features

Architecture

shabdakosh
├── arpabet.rs          ARPABET <-> svara Phoneme
├── ipa.rs              IPA <-> svara Phoneme
└── dictionary/
    ├── mod.rs           PronunciationDict (hashbrown + BTreeMap overlay)
    ├── entry.rs         DictEntry, Pronunciation, Region
    ├── validate.rs      inventory validation (varna feature)
    ├── detect.rs        script/language detection (varna feature)
    └── format/
        ├── mod.rs       CMUdict, IPA, JSON import/export
        ├── pls.rs       W3C PLS XML
        └── ssml.rs      SSML <phoneme> tags

See docs/architecture/overview.md for the full architecture overview.

Documentation

Consumers

  • shabda — G2P engine (dictionary lookup + rules fallback)
  • dhvani — Audio engine
  • vansh — Voice AI shell

License

GPL-3.0-only