static-lang-word-lists
A collection of word lists for various scripts, compressed at build time, baked into the binary, and decompressed lazily at run time.
Motivation
Include word lists in the binary, don't take up more space than necessary, be publishable on crates.io (10 MiB size limit)
Usage
On crates.io as static-lang-word-lists
For documentation, please refer to docs.rs
How this crate works
A build script that downloads the word lists from GitHub, compresses them with Brotli, and embeds that data in the binary, lazily decompressed at runtime
Technical details
Note: adding or removing a wordlist requires that egg.py
be re-run
- A list of all the supported wordlists is generated by
egg.py
(this step is manual and be run when adding or removing a word list!). Their names are generated based on their path -data/diffenator/latin.txt
becomesDIFFENATOR_LATIN
in the end crate - The build script uses the code generated by
egg.py
(calledchicken.rs
) to construct URLs to download files - Each file is downloaded, validated as UTF-8 (to avoid needing to do this at binary runtime), and compressed with Brotli, and saved as a build artifact
- The build script generates
word_list_codegen.rs
which uses thewordlist!
macro to make the structs for accessing the data for crate consumers, as well asmap_codegen.rs
which usesphf
to construct a lookup table for the word lists
Developing
To build using local files, set the STATIC_LANG_WORD_LISTS_LOCAL
environment variable
Credits
Diffenator wordlists are from diffenator2. Apache-2.0 licensed.
Emoji wordlists are from unicode.org. Unicode licensed.
AOSP word lists are from the aosp-test-texts, using the files produced by scripts/extract_words.py
. Apache-2.0 licensed.