glossa 0.0.6 - Docs.rs

# glossa

[![glossa.crate](https://img.shields.io/crates/v/glossa.svg?logo=rust&logoColor=lightsalmon&label=glossa)](https://crates.io/crates/glossa)

[![Documentation](https://docs.rs/glossa/badge.svg)](https://docs.rs/glossa)
[![Apache-2 licensed](https://img.shields.io/crates/l/glossa.svg?logo=apache)](./License)

<!-- Language -->
<details>
<summary>
<a href="Readme-zh.md">
<img alt="Language/语言" src="./svg/language.svg"/>
</a>
</summary>

- en: English
- [zh: 中文](Readme-zh.md)
- [zh-Hant: 繁體中文](Readme-zh-Hant.md)

</details>

<!-- TOC -->
<details open>
<summary>
<img alt="Table of Contents" src="./svg/toc/toc.svg" />
</summary>

- [Locale Fallback Chain](#locale-fallback-chain)
  - [Example: zh-Hans-HK](#example-zh-hans-hk)
  - [Example: en-AU](#example-en-au)
  - [Example: gsw-LI](#example-gsw-li)
- [Practical Usage](#practical-usage)
  - [Code Generation](#code-generation)
  - [LocaleContext](#localecontext)
  - [Trait Example](#trait-example)
  - [Bilingual Example](#bilingual-example)

</details>

<!--  -->

## Locale Fallback Chain

The core functionality of the glossa crate:

- Generates an array based on the **similarity** between the current locale and all available locales.
  - (Theoretically) Higher similarity locales are prioritized.

Q: Why is fallback necessary?

A:
When localized text for the current locale is missing, falling back to a more familiar language (e.g., another variant of the current language) ensures a better user experience.

> A person may master multiple languages (or different variants of the same language).

Assume the current locale is `pt-PT` (Português, Portugal), and the available locales are `pt-PT`, `pt` (Português, Brasil), `es-419` (Español, Latinoamérica), and `en`.

In this case, the i18n library should retrieve localized text in the order `[pt-PT, pt, en]`, not `[pt-PT, en]`.

Ignoring language similarity and directly falling back to `en` not only reduces localization (L10n) coverage but may also increase cognitive load for users.

### Example: zh-Hans-HK

Assume the current locale is `zh-Hans-HK`, and the available locales are `zh-Hant-MO`, `zh-SG`, `ru`, `zh-Hant`, `fr`, `zh`, `ar`, `zh-HK`, `en-001`, `lzh`.

After calling `try_init_chain()`, the generated locale chain is: `["zh", "zh-SG", "zh-HK", "zh-Hant-MO", "zh-Hant"]`.

When the log level is `debug` or `trace`, you can see `[... DEBUG glossa::fallback] ...<(id, score)>`:

```rust
[
  ("zh", 37),       // zh-Hans-CN
  ("zh-SG", 36),    // zh-Hans-SG
  ("zh-HK", 35),    // zh-Hant-HK
  ("zh-Hant-MO", 31),
  ("zh-Hant", 28)   // zh-Hant-TW
]
```

> Higher scores indicate higher priority.

- Exact match: full score (50 points).
- Partial matches:
  - Same language: +20 points.
    - Since the current language is `zh` (Chinese), and no other languages are included in the built-in rules, only `zh` variants appear in the chain.
    - Theoretically, `lzh` (Classical Chinese) shares some similarity with modern Chinese, but it is not included in the built-in fallback rules for `zh-Hans-HK`.
  - Same script: +15 points.
    - The current script is `Hans` (Simplified). `Hans` scores higher than `Hant`.
      - `zh-HK` is essentially `zh-Hant-HK`.
        - Since `Hans` scores higher than `Hant`, and `zh-Hans` resources exist, `zh-HK` does not have the highest score.
  - Matches built-in fallback rules:
    - Full match: +9 points.
    - Partial match (language + script): +6 points.
  - Same region: +4 points.
    - Comparing `zh-Hant` (zh-Hant-TW), `zh-Hant-MO`, and `zh-HK` (zh-Hant-HK):
      - `zh-HK` shares the same region (HK) as the current locale, earning +4 points.
      - `zh-Hant` and `zh-Hant-MO` do not share the HK region, so no bonus.
  - Proximity bonus:
    - Same sub-region (e.g., East Asia): +2 points.
    - Same continent (e.g., Asia): +1 point.
    - Comparing `zh` (zh-Hans-CN) and `zh-SG` (zh-Hans-SG):
      - HK (HongKong SAR, China) and CN (Mainland China) are both in East Asia (+2).
      - SG (Singapore) is in Southeast Asia, sharing the same continent (Asia) with HK (+1).

### Example: en-AU

Assume the current locale is `en-AU`, with extensive localization resources for various regions (including sparsely populated islands).

From a linguistic similarity perspective, `en-NZ` (New Zealand English) is closer to `en-AU` (Australian English) than `en-GB` (British English).

However, the chain generated by glossa may not guarantee 100% accuracy.

```rust
// <(id, score)>:
[
  ("en-AU", 50), ("en-GB", 44), ("en-CC", 43), ("en-CX", 43), ("en-NF", 43),
  ("en-NZ", 43), ("en-UM", 42), ("en-CK", 42), ("en-DG", 42), ("en-FJ", 42),
  ("en-FM", 42), ("en-KI", 42), ("en-NR", 42), ("en-NU", 42), ("en-PG", 42),
  ("en-PN", 42), ("en-PW", 42), ("en-SB", 42), ("en-TK", 42), ("en-TO", 42),
  ("en-TV", 42), ("en-VU", 42), ("en-WS", 42), ("en-AS", 42), ("en-GU", 42),
  ("en-MH", 42), ("en-MP", 42), ("en-US", 22), ...
]
```

### Example: gsw-LI

> `gsw` is Swiss German (Schwiizertüütsch), while `de` is Standard German (Deutsch).

```rust
use glossa::{
  error::GlossaError, fallback::conv_to_str_chain,
  try_init_chain_from_slice,
};

let chain = try_init_chain_from_slice(
  // current:
  "gsw-LI",

  // all_locales:
  &[
     "en", "es", "pt", "zh", "gsw", "gsw-FR", "gsw-LI", "de", "de-AT", "de-BE", "de-CH", "de-IT",
    "de-LI", "de-LU",
  ],
)?;
// <(id, score)>:
// [ ("gsw-LI", 50), ("gsw", 37), ("gsw-FR", 37), ("de-LI", 27), ("de", 26),
//   ("de-AT", 23), ("de-BE", 23), ("de-CH", 23), ("de-LU", 23), ("de-IT", 22) ]

let v = conv_to_str_chain(&chain);

assert_eq!(
  v.as_ref(),
  [
    "gsw-LI", "gsw", "gsw-FR", "de-LI", "de", "de-AT", "de-BE", "de-CH",
    "de-LU", "de-IT",
  ]
);
```

## Practical Usage

> Implement corresponding logic based on the **localization resource (L10n Map)** types generated by `glossa-codegen`.

### Code Generation

```rust
use glossa_codegen::{Generator, L10nResources, Visibility, generator::MapType};

let generator = Generator::default()
  .with_resources(L10nResources::new("locales").with_include_map_names(["yes-no"]))
  .with_visibility(Visibility::Pub);
```

The `Generator` supports outputting various types.
If you invoke `generator.output_match_fn_all_in_one_without_map_name(MapType::Regular)?`, the generated code will resemble:

```rust
pub const fn map(language: &[u8], key: &[u8]) -> &'static str {
  match (language, key) {
    (b"cs", b"cancel") => r#####"Zrušit"#####,
    (b"cs", b"no") => r#####"Ne"#####,
    (b"cs", b"yes") => r#####"Ano"#####,
    (b"de", b"cancel") => r#####"Abbrechen"#####,
    (b"de", b"no") => r#####"Nein"#####,
    (b"de", b"yes") => r#####"Ja"#####,
    (b"en", b"cancel") => r#####"Cancel"#####,
    (b"en", b"no") => r#####"No"#####,
    (b"en", b"ok") => r#####"OK"#####,
    (b"en", b"yes") => r#####"Yes"#####,
    (b"es", b"cancel") => r#####"Cancelar"#####,
    (b"es", b"ok") => r#####"Aceptar"#####,
    (b"es", b"yes") => r#####"Sí"#####,
    (b"fr", b"cancel") => r#####"Annuler"#####,
    (b"fr", b"no") => r#####"Non"#####,
    (b"fr", b"yes") => r#####"Oui"#####,
    (b"ja", b"cancel") => r#####"取消"#####,
    (b"ja", b"no") => r#####"いいえ"#####,
    (b"ja", b"ok") => r#####"了解"#####,
    (b"ja", b"yes") => r#####"はい"#####,
    (b"ko", b"cancel") => r#####"취소"#####,
    (b"ko", b"no") => r#####"아니오"#####,
    (b"ko", b"ok") => r#####"확인"#####,
    (b"ko", b"yes") => r#####"예"#####,
    (b"ru", b"no") => r#####"Нет"#####,
    (b"ru", b"yes") => r#####"Да"#####,
    (b"zh-Hant", b"cancel") => r#####"取消"#####,
    (b"zh-Hant", b"no") => r#####"否"#####,
    (b"zh-Hant", b"ok") => r#####"確定"#####,
    (b"zh-Hant", b"yes") => r#####"是"#####,
    (b"zh-Latn-CN", b"cancel") => r#####"QuXiao"#####,
    (b"zh-Latn-CN", b"no") => r#####"Fou"#####,
    (b"zh-Latn-CN", b"ok") => r#####"QueDing"#####,
    (b"zh-Latn-CN", b"yes") => r#####"Shi"#####,
    _ => "",
  }
}
```

Invoking `generator.output_locales_fn(MapType::Regular, true)?` generates:

```rust
// super: use glossa_shared::lang_id;

pub const fn all_locales() -> [super::lang_id::LangID; 10] {
  #[allow(unused_imports)]
  use super::lang_id::RawID;
  use super::lang_id::consts::*;
  [
    lang_id_cs(),
    lang_id_de(),
    lang_id_en(),
    lang_id_es(),
    lang_id_fr(),
    lang_id_ja(),
    lang_id_ko(),
    lang_id_ru(),
    lang_id_zh_hant(),
    lang_id_zh_pinyin(),
  ]
}
```

### LocaleContext

Next, implement logic to lookup localized texts based on the types generated by codegen.
As shown above, codegen produces a `match_fn`.

Given the function definition: `const fn map(language: &[u8], key: &[u8]) -> &'static str`, the lookup logic is:

```rust
let lookup = |(language, key)| match map(language, key) {
  "" => None,
  s => Some(s),
};
```

If the generated function uses `map(language, map_name, key)`, adjust the lookup accordingly:

```rust
let lookup = |(language, map_name, key)| match map(language, map_name, key) {
  "" => None,
  s => Some(s),
};
```

For binary serialized data (e.g., bincode), deserialize it into a `HashMap` or `BTreeMap`.
And we can use `.get()` to lookup.

```rust
let map = glossa_shared::decode::file::decode_file_to_maps(path)?;
let lookup = |language, tuple_key| {
  map
    .get(language)?
    .get(&tuple_key)
};
```

### Trait Example

```rust
use glossa::{LocaleContext, traits::ChainProvider};

trait GetL10nText: ChainProvider {
  fn try_get_by_key<'t>(&self, key: &[u8]) -> Option<&'t str> {
    let lookup = |(language, key)| match map(language, key) {
      "" => None,
      s => Some(s),
    };

    self
      .provide_chain()?
      .iter()
      .map(|id| (id.as_bytes(), key))
      .find_map(lookup)
  }
}

impl GetL10nText for LocaleContext {}

#[test]
pub(crate) fn print_l10n_text() {
  let new_ctx = || LocaleContext::default().with_all_locales(all_locales());

  // #[cfg(any(target_os = "macos", target_os = "linux"))]
  let set_env_lang = |value| unsafe { std::env::set_var("LANG", value) };

  let display = |ctx: &LocaleContext, key: &str| {
    let text = ctx
      .try_get_by_key(key.as_bytes())
      .unwrap_or_else(|| panic!("{}", glossa::Error::new_text_not_found(key)));
    println!("{key}: {text}")
  };

  {
    // set_env_lang("gsw_CH.UTF-8");
    //
    let ctx = new_ctx()
      .with_current_locale(Some(glossa_shared::lang_id::consts::lang_id_gsw()));
    // [("de", 26)]

    for key in ["yes", "no", "ok", "cancel"] {
      display(&ctx, key)
    }
  }
  // Output:
  //   yes: Ja
  //   no: Nein
  //   ok: OK
  //   cancel: Abbrechen

  {
    set_env_lang("zh_MO.UTF-8");
    // new_ctx();                           // current_locale =>  get_static_locale()
    let ctx = new_ctx().with_current_locale(None);

    log::debug!("\n---\n--- current locale => zh-MO");

    // [("zh-Hant", 43), ("zh-Latn-CN", 22)]
    for key in ["yes", "no", "ok", "cancel", "confirm"] {
      display(&ctx, key)
    }
  }
  // Output:
  //   yes: 是
  //   no: 否
  //   ok: 確定
  //   cancel: 取消
  //   confirm: Confirm
}
```

### Bilingual Example

**Scenario 1**:

In resource-constrained environments, Chinese characters may fail to display properly.
In such cases, we can switch the localization language to **zh-pinyin** (Chinese romanization).

However, due to **polysemous homophones** in Mandarin Chinese, ambiguities may arise in certain contexts.(can only use Pinyin, not Chinese characters.)

This is precisely where the **bilingual functionality** shines brightly ✨!

> The "bilingual functionality" must be **manually implemented**.

---

```rust
#[ignore]
#[test]
// en-GB, zh-pinyin
fn test_bilingual() {
  use glossa_shared::lang_id::consts::{lang_id_en_gb, lang_id_zh_pinyin};

  let new_ctx = |id| {
    LocaleContext::default()
      .with_current_locale(Some(id))
      .with_all_locales(all_locales())
  };
  let zh_pinyin_ctx = new_ctx(lang_id_zh_pinyin());
  let en_gb_ctx = new_ctx(lang_id_en_gb());

  fn get_text<'a>(ctx: &LocaleContext, key: &str) -> Option<&'a str> {
    let key_bytes = key.as_bytes();
    let lookup = |language| match map(language, key_bytes) {
      "" => None,
      x => Some(x),
    };

    ctx
      .get_or_try_init_chain()?
      .iter()
      .map(|id| id.as_bytes())
      .find_map(lookup)
  }

  let get_cancel_text = |ctx| get_text(ctx, "cancel").unwrap_or_default();

  let zh_pinyin_text = get_cancel_text(&zh_pinyin_ctx);
  let en_gb_text = get_cancel_text(&en_gb_ctx);

  let text = match zh_pinyin_text == en_gb_text {
    true => zh_pinyin_text.into(),
    _ => glossa_shared::fmt_compact!("{en_gb_text}; {zh_pinyin_text}"),
  };

  assert_eq!(text, "Cancel; QuXiao")
}
```