static-lang-word-lists 0.4.0

Runtime decompressed statically-included word lists
Documentation

static-lang-word-lists

A collection of word lists for various scripts, compressed at build time, baked into the binary, and decompressed lazily at run time.

Motivation

Include word lists in the binary, don't take up more space than necessary, be publishable on crates.io (10 MiB size limit)

Usage

On crates.io as static-lang-word-lists

For documentation, please refer to docs.rs.

How this crate works

A build script that downloads the word lists from GitHub, compresses them with Brotli, and embeds that data in the binary, lazily decompressed at runtime

Technical details

Note: adding or removing a wordlist requires that cargo xtask slwl be re-run. See the xtasks' README for more details on what it's doing. This README only concerns the build script's role

  1. A list of all the supported wordlists is generated by cargo xtask slwl (this step is manual and must be run when adding or removing a word list!)
  2. The build script reads the list of paths generated by the xtask (chicken.rs, which is include!d in the build script)
  3. If building from remote sources, a zipball of the repo is downloaded from GitHub, and the selected word lists are extracted (word lists may be enabled or disabled through feature flags)
  4. Selected word lists are compressed with brotli (compression level is reduced in debug builds to speed up crate build time) and are written to OUT_DIR under their relative path, where static-lang-word-lists/src/declarations.rs is expecting them

Developing

To build using local files, set the STATIC_LANG_WORD_LISTS_LOCAL environment variable

Adding a new word list

  1. Add the word lists as .txt files into static-lang-word-lists/data, in a subdirectory with a kebab-case name for your source
  2. For each .txt file, create a corresponding TOML file with the same stem. See the schema for the required & optional fields
  3. Run cargo xtask slwl. It'll emit crate feature definitions to stdout, copy & paste over the existing [feature] table in static-lang-word-lists/Cargo.toml
  4. Check the crate builds with cargo build --package static-lang-word-lists. You will need the STATIC_LANG_WORD_LISTS_LOCAL environment variable set

Word list metadata schema

Metadata files are TOML files. The ones that live in this crate have the same file name as their word list, only differing in extension.

Field name Field type Required? Description
name string ✔️ A cosmetic name for the word list, usually in snake_case
script string An ISO 15924 four-letter capitalised code*
language string An ISO 639-1 two-letter lowercase code*

(* this is not enforced, but will at least be true of crate-provided word lists.)

Credits

Diffenator wordlists are from diffenator2. Apache-2.0 licensed.

Emoji wordlists are from unicode.org. Unicode licensed.

AOSP word lists are from the aosp-test-texts, using the files produced by scripts/extract_words.py in that repo. Apache-2.0 licensed.

LibreOffice word lists are generated from the LibreOffice dictionaries repo, using the files produced by scripts/import_libreoffice.py. MPL v2.0 and LGPL v3+ dual-licensed.