# Decompound
Decompose a compound word into its constituent parts. Works in any language, as you
provide the rules around what constitutes a (*single*) word. The algorithm is
Unicode-aware.
Useful for [culling down existing dictionaries at build time](#motivation).
The docs are best viewed via [docs.rs](https://docs.rs/decompound).
[](https://crates.io/crates/decompound)
## Usage
Usage is very straightforward. There is only one (free) function of interest,
[`decompound`]. Its party piece is a closure argument, deciding whether a *single* word
is valid. As this can be highly dynamic and language-specific, this decision is left to
the user.
```rust
use decompound::{decompound, DecompositionOptions};
let is_valid_single_word = |w: &str| ["bed", "room"].contains(&w);
assert_eq!(
decompound(
"bedroom",
&is_valid_single_word,
DecompositionOptions::empty(),
).unwrap(),
vec!["bed", "room"]
);
```
Candidates for validity checks are simple dictionary lookups (for example, using
[`std::collections::HashSet`], [`phf`](https://crates.io/crates/phf), [Finite State
Transducers](https://docs.rs/fst/latest/fst/), [binary
search](https://docs.rs/b4s/latest/b4s/), ...), or any elaborate algorithm of your
choice.
### Configuration
Configuration is exposed as a [bit
field](https://docs.rs/bitflags/latest/bitflags/index.html) via
[`DecompositionOptions`]. It affords more complex use cases, freely combinable.
Usefulness largely depends on the natural language at hand. Some, for example German,
might require:
```rust
use decompound::{decompound, DecompositionError, DecompositionOptions};
let is_valid_single_word = |w: &str| ["Rüben", "Knollen", "Küche"].contains(&w);
assert_eq!(
decompound(
"Rübenknollen-Küche",
&is_valid_single_word,
// Wouldn't find anything without titlecasing `boot` to `Boot`,
// and splitting on hyphens.
DecompositionOptions::SPLIT_HYPHENATED
| DecompositionOptions::TRY_TITLECASE_SUFFIX
).unwrap(),
vec!["Rüben", "Knollen", "Küche"]
);
```
This covers all currently available options already:
```rust
use decompound::DecompositionOptions;
assert!(
(
// This is doc-tested so new options are not forgotten.
DecompositionOptions::SPLIT_HYPHENATED
| DecompositionOptions::TRY_TITLECASE_SUFFIX
).is_all()
);
```
### Failure modes
If the word cannot be decomposed, a [`DecompositionError`] is returned.
```rust
use decompound::{decompound, DecompositionError, DecompositionOptions};
let is_valid_single_word = |w: &str| ["water", "melon"].contains(&w);
assert_eq!(
decompound(
"snowball",
&is_valid_single_word,
DecompositionOptions::empty(),
).unwrap_err(),
DecompositionError::NothingValid
);
```
#### Overeager validity checks
Nothing prevents you from providing a closure *which itself accepts compound words*.
Compound words (like `railroad`) being included in a lookup dictionary (instead of
*only* its root words `rail` and `road`) is an example "pathological" case.
Accommodating compound words *yourself* is precisely what this crate is [supposed to
alleviate](#motivation). If you already have and do not want to or cannot drop that
capability, this crate might be obsolete for your case (hence "overeager checks").
Although [`decompound`] prefers splits if possible, such as
```rust
use decompound::{decompound, DecompositionError, DecompositionOptions};
// Contains a compound word *and* its root words.
let is_valid_single_word = |w: &str| ["blueberry", "blue", "berry"].contains(&w);
assert_eq!(
decompound(
"blueberry",
&is_valid_single_word,
DecompositionOptions::empty(),
).unwrap(),
vec!["blue", "berry"]
);
```
if root words are missing but the compound itself is present, decomposition technically
*fails*. This case is considered an error, and marked as such. That is [more
ergonomic](https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/) than
being returned a [`Vec`] of constituents of length 1, requiring more awkward error
handling at the call site.
```rust
use decompound::{decompound, DecompositionError, DecompositionOptions};
// *Only* contains a compound word, not its root words.
let is_valid_single_word = |w: &str| ["firefly"].contains(&w);
assert_eq!(
decompound(
"firefly",
&is_valid_single_word,
DecompositionOptions::empty(),
).unwrap_err(),
DecompositionError::SingleWord("firefly".to_string())
);
```
Match on this variant if this case is not an error in your domain (this crate itself
does so internally, too).
## Motivation
The crate implementation is simple and nothing you wouldn't be able to write yourself.
There is a catch though. As mentioned, this crate can help you move checks for compound
words from static (a fixed dictionary) to runtime ([`decompound`]). For some languages,
this is strictly *required*, as the set of compound words might be immense, or
(effectively, not mathematically) unbounded, meaning root words may be combined to
arbitrary lengths. German is such a case. No dictionary exists to cover all possible
German words. However, [existing ones](https://sourceforge.net/projects/germandict/) are
almost guaranteed to themselves contain *some* compound words (which is generally
helpful). When using such dictionaries *and* this crate to cover all remaining,
arbitrary compound words, **duplication arises, and the dictionary is no longer
minimal**. Most, perhaps all, compound words in the dictionary could be detected at
runtime instead (providing a single source of truth along the way).
Culling the dictionary might lead to significant, [perhaps
necessary](https://github.com/rust-lang/crates.io/issues/195) savings in size (memory
and executable), so a [build
script](https://doc.rust-lang.org/cargo/reference/build-scripts.html) is needed. But
now, *both* the actual code *and* the [build script
depend](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#build-dependencies)
on that same detection algorithm. If what you cull the dictionary with gets out of sync
with what's done at runtime, bugs arise. The build script cannot depend on what it's
building. Currently (2023-08-19), **there is no place for the compound check to live
except another crate**, external to both the build script and actual code. That's this
crate. It affords a **non-cyclic build graph**, a single source of truth for the
compound check and affords the usage of *any* dictionary, no out-of-band preprocessing
necessary (the original dictionary can be kept).