static-lang-word-lists
A collection of word lists for various scripts, compressed at build time, baked into the binary, and decompressed lazily at run time.
Motivation
Include word lists in the binary, don't take up more space than necessary, be publishable on crates.io (10 MiB size limit)
Usage
On crates.io as static-lang-word-lists
For documentation, please refer to docs.rs.
How this crate works
A build script that downloads the word lists from GitHub, compresses them with Brotli, and embeds that data in the binary, lazily decompressed at runtime
Technical details
Note: adding or removing a wordlist requires that cargo xtask slwl be re-run.
See the xtasks' README for more details on what it's doing.
This README only concerns the build script's role
- A list of all the supported wordlists is generated by
cargo xtask slwl(this step is manual and must be run when adding or removing a word list!) - The build script reads the list of paths generated by the xtask (
chicken.rs, which isinclude!d in the build script) - If building from remote sources, a zipball of the repo is downloaded from GitHub, and the selected word lists are extracted (word lists may be enabled or disabled through feature flags)
- Selected word lists are compressed with brotli (compression level is reduced in debug builds to speed up crate build time) and are written to
OUT_DIRunder their relative path, wherestatic-lang-word-lists/src/declarations.rsis expecting them
Developing
To build using local files, set the STATIC_LANG_WORD_LISTS_LOCAL environment variable
Adding a new word list
- Add the word lists as .txt files into
static-lang-word-lists/data, in a subdirectory with a kebab-case name for your source - For each .txt file, create a corresponding TOML file with the same stem. See the schema for the required & optional fields
- Run
cargo xtask slwl. It'll emit crate feature definitions to stdout, copy & paste over the existing[feature]table instatic-lang-word-lists/Cargo.toml - Check the crate builds with
cargo build --package static-lang-word-lists. You will need theSTATIC_LANG_WORD_LISTS_LOCALenvironment variable set
Word list metadata schema
Metadata files are TOML files. The ones that live in this crate have the same file name as their word list, only differing in extension.
| Field name | Field type | Required? | Description |
|---|---|---|---|
name |
string | ✔️ | A cosmetic name for the word list, usually in snake_case |
script |
string | ❌ | An ISO 15924 four-letter capitalised code* |
language |
string | ❌ | An ISO 639-1 two-letter lowercase code* |
(* this is not enforced, but will at least be true of crate-provided word lists.)
Credits
Diffenator wordlists are from diffenator2. Apache-2.0 licensed.
Emoji wordlists are from unicode.org. Unicode licensed.
AOSP word lists are from the aosp-test-texts, using the files produced by scripts/extract_words.py in that repo. Apache-2.0 licensed.
LibreOffice word lists are generated from the LibreOffice dictionaries repo, using the files produced by scripts/import_libreoffice.py. MPL v2.0 and LGPL v3+ dual-licensed.