static_lang_word_lists/lib.rs
1//! # `static-lang-word-lists`
2//!
3//! A collection of word lists for various scripts, compressed at build time
4//! with [Brotli](https://brotli.org/), baked into your compiled binary, and
5//! decompressed lazily at run time.
6//!
7//! Word lists are compressed less when building the "debug" profile to speed up
8//! build times.
9//! Any other profile will use maximum compression.
10//!
11//! ## Accessing word lists
12//!
13//! If there's a specific word list you're after, you can refer to its `static`
14//! by name.
15//! The crate also provides the static [`ALL_WORD_LISTS`] for convenient
16//! iteration/filtering.
17//!
18//! Word lists are decompressed when you call [`WordList::iter`].
19//!
20//! ## Feature flags
21//!
22//! This crate has a plethora of feature flags to help you reduce build times by
23//! only compressing the word lists you plan to use.
24//!
25//! There are three categories of feature flag you can use to choose your word
26//! list:
27//! - Source name (e.g. `diffenator`), enables all the word lists from a
28//! particular source
29//! - Script code (e.g. `script-latn`), enables all the word lists for the [ISO 15924](https://en.wikipedia.org/wiki/ISO_15924)
30//! script (lowercase)
31//! - Language code (e.g. `lang-en`), enables all the word lists for the [ISO 639-1](https://en.wikipedia.org/wiki/ISO_639-1)
32//! language
33//!
34//! You can, of course, mix & match at will. Thanks to the magic of Cargo's
35//! [feature unification](https://doc.rust-lang.org/cargo/reference/features.html#feature-unification),
36//! any word lists needed by other dependencies will still be pulled in &
37//! compiled even if you don't request them in your own crate.
38//!
39//! If there are no word lists for a script/language, there won't be a feature
40//! flag.
41//!
42//! To enable all word lists, there is the catch-all feature `all`.
43//! (This does not enable functionality-related flags, such as `rayon`.)
44//!
45//! **By default, only the diffenator word lists are enabled**.
46//!
47//! It is not considered a breaking change if more word lists get added to a
48//! given feature.
49//!
50//! ## Creating your own word lists
51//!
52//! - In memory words: [`WordList::define`]
53//! - Word list file (with sidecar metadata): [`WordList::load`]
54//! - Word list file (no metadata): [`WordList::load_without_metadata`]
55//!
56//! ## How this crate works (⚠️disclaimer⚠️)
57//!
58//! A build script for this crate downloads a zipball of the GitHub repo for
59//! this project at build time in order to get the word lists.
60//! It is not possible for us to include these in the crate hosted on crates.io
61//! as the crate would immediately exceed the size limit.
62//!
63//! By using a the repository as a path or git dependency you can avoid the
64//! download by setting the environment variable `STATIC_LANG_WORD_LISTS_LOCAL`.
65//! Otherwise, you're welcome to audit the [build script](https://github.com/googlefonts/fontheight/blob/main/static-lang-word-lists/build.rs).
66
67mod word_lists;
68
69pub(crate) use word_lists::WordListMetadata;
70#[cfg(feature = "rayon")]
71pub use word_lists::rayon::ParWordListIter;
72pub use word_lists::{WordList, WordListError, WordListIter};
73
74use crate::word_lists::{Word, WordSource};
75
76fn newline_delimited_words(input: impl AsRef<str>) -> WordSource {
77 input
78 .as_ref()
79 .split_whitespace()
80 .filter(|word| !word.is_empty())
81 .map(Word::from)
82 .collect()
83}
84
85macro_rules! word_list {
86 (
87 ident: $ident:ident,
88 metadata: $metadata:expr,
89 bytes: $bytes:expr $(,)?
90 ) => {
91 /// The
92 #[doc = ::std::stringify!($ident)]
93 /// word list.
94 ///
95 /// Compiled into the binary compressed with Brotli, decompressed at
96 /// runtime.
97 pub static $ident: $crate::WordList = $crate::WordList::new_lazy(
98 $metadata,
99 ::std::sync::LazyLock::new(|| {
100 let mut brotli_bytes: &[u8] = $bytes;
101 let mut buf =
102 ::std::vec::Vec::with_capacity(brotli_bytes.len());
103 ::brotli_decompressor::BrotliDecompress(
104 &mut brotli_bytes,
105 &mut buf,
106 )
107 .unwrap_or_else(|err| {
108 ::std::panic!(
109 "failed to decode {}: {err}",
110 ::std::stringify!($ident),
111 );
112 });
113 let raw_words =
114 // SAFETY: UTF-8 validity is checked by the build script
115 unsafe { ::std::string::String::from_utf8_unchecked(buf) };
116 ::log::debug!("loaded words for {}", ::std::stringify!($ident));
117 $crate::newline_delimited_words(raw_words)
118 }),
119 );
120 };
121}
122
123// Module declaration has to be below macro definition to be able to use it
124mod declarations;
125pub use declarations::*;