static-lang-word-lists 0.3.1

Runtime decompressed statically-included word lists
Documentation
# `static-lang-word-lists`

A collection of word lists for various scripts, compressed at build time, baked into the binary, and decompressed lazily at run time.

## Motivation

Include word lists in the binary, don't take up more space than necessary, be publishable on crates.io (10 MiB size limit)

## Usage

On crates.io as `static-lang-word-lists`

For documentation, please refer to docs.rs

## How this crate works

A build script that downloads the word lists from GitHub, compresses them with Brotli, and embeds that data in the binary, lazily decompressed at runtime

### Technical details

Note: adding or removing a wordlist requires that `egg.py` be re-run

1. A list of all the supported wordlists is generated by `egg.py` (this step is manual and be run when adding or removing a word list!). Their names are generated based on their path - `data/diffenator/latin.txt` becomes `DIFFENATOR_LATIN` in the end crate
2. The build script uses the code generated by `egg.py` (called `chicken.rs`) to construct URLs to download files
3. Each file is downloaded, validated as UTF-8 (to avoid needing to do this at binary runtime), and compressed with Brotli, and saved as a build artifact
4. The build script generates `word_list_codegen.rs` which uses the `wordlist!` macro to make the structs for accessing the data for crate consumers, as well as `map_codegen.rs` which uses [`phf`]https://lib.rs/crates/phf to construct a lookup table for the word lists

## Developing

To build using local files, set the `STATIC_LANG_WORD_LISTS_LOCAL` environment variable

## Credits

Diffenator wordlists are from [diffenator2](https://github.com/googlefonts/diffenator2). [Apache-2.0](https://github.com/googlefonts/diffenator2/blob/69a873d79811e957aa5824e04d4859717f206c47/LICENSE.txt) licensed.

Emoji wordlists are from [unicode.org](https://home.unicode.org/). [Unicode](https://www.unicode.org/license.txt) licensed.

AOSP word lists are from the [aosp-test-texts](https://github.com/googlefonts/aosp-test-texts), using the files produced by `scripts/extract_words.py`. [Apache-2.0](https://github.com/googlefonts/aosp-test-texts/blob/c8134e8ae2be52feb842df5cb5fa03d29e3df06f/LICENSE) licensed.