# `static-lang-word-lists`
A collection of word lists for various scripts, compressed at build time, baked into the binary, and decompressed lazily at run time.
## Motivation
Include word lists in the binary, don't take up more space than necessary, be publishable on crates.io (10 MiB size limit)
## Usage
On crates.io as `static-lang-word-lists`
For documentation, please refer to docs.rs
## How this crate works
A build script that downloads the word lists from GitHub, compresses them with Brotli, and embeds that data in the binary, lazily decompressed at runtime
### Technical details
Note: adding or removing a wordlist requires that `egg.py` be re-run
1. A list of all the supported wordlists is generated by `egg.py` (this step is manual and be run when adding or removing a word list!). Their names are generated based on their path - `data/diffenator/latin.txt` becomes `DIFFENATOR_LATIN` in the end crate
2. The build script uses the code generated by `egg.py` (called `chicken.rs`) to construct URLs to download files
3. Each file is downloaded, validated as UTF-8 (to avoid needing to do this at binary runtime), and compressed with Brotli, and saved as a build artifact
4. The build script generates `word_list_codegen.rs` which uses the `wordlist!` macro to make the structs for accessing the data for crate consumers, as well as `map_codegen.rs` which uses [`phf`](https://lib.rs/crates/phf) to construct a lookup table for the word lists
## Developing
To build using local files, set the `STATIC_LANG_WORD_LISTS_LOCAL` environment variable
## Credits
Diffenator wordlists are from [diffenator2](https://github.com/googlefonts/diffenator2). [Apache-2.0](https://github.com/googlefonts/diffenator2/blob/69a873d79811e957aa5824e04d4859717f206c47/LICENSE.txt) licensed.
Emoji wordlists are from [unicode.org](https://home.unicode.org/). [Unicode](https://www.unicode.org/license.txt) licensed.
AOSP word lists are from the [aosp-test-texts](https://github.com/googlefonts/aosp-test-texts), using the files produced by `scripts/extract_words.py`. [Apache-2.0](https://github.com/googlefonts/aosp-test-texts/blob/c8134e8ae2be52feb842df5cb5fa03d29e3df06f/LICENSE) licensed.