static-lang-word-lists 0.2.2

Runtime decompressed statically-included word lists
Documentation

static-lang-word-lists

A collection of word lists for various scripts, compressed at build time, baked into the binary, and decompressed lazily at run time.

Motivation

Include word lists in the binary, don't take up more space than necessary, be publishable on crates.io (10 MiB size limit)

Usage

On crates.io as static-lang-word-lists

For documentation, please refer to docs.rs

How this crate works

A build script that downloads the word lists from GitHub, compresses them with Brotli, and embeds that data in the binary, lazily decompressed at runtime

Technical details

Note: adding or removing a wordlist requires that egg.py be re-run

  1. A list of all the supported wordlists is generated by egg.py (this step is manual and be run when adding or removing a word list!). Their names are generated based on their path - data/diffenator/latin.txt becomes DIFFENATOR_LATIN in the end crate
  2. The build script uses the code generated by egg.py (called chicken.rs) to construct URLs to download files
  3. Each file is downloaded, validated as UTF-8 (to avoid needing to do this at binary runtime), and compressed with Brotli, and saved as a build artifact
  4. The build script generates word_list_codegen.rs which uses the wordlist! macro to make the structs for accessing the data for crate consumers, as well as map_codegen.rs which uses phf to construct a lookup table for the word lists

Developing

To build using local files, set the STATIC_LANG_WORD_LISTS_LOCAL environment variable

Credits

Diffenator wordlists are from diffenator2. Apache-2.0 licensed.

Emoji wordlists are from unicode.org. Unicode licensed.

AOSP word lists are from the aosp-test-texts, using the files produced by scripts/extract_words.py. Apache-2.0 licensed.