Features
- Full UTF-8 support.
- Robust parsing.
- Language specific rules (each defined by its own PEG).
- Fast and memory efficient parsing via the pest library.
- Sentences can contain quotes which can contain subsentences.
Bindings
Besides native Rust, bindings for the following programming languages are available:
Supported languages
- Croatian (standard)
- English (standard)
There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.
Example
After adding the cutters dependency to your Cargo.toml file, usage is simple.
This results in the following output (note that the str struct fields are &str).
[
Sentence {
str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
quotes: [],
},
Sentence {
str: "St. Louis 9LX je događaj u svijetu šaha.",
quotes: [],
},
Sentence {
str: "To je prof.dr.sc. Ivan Horvat.",
quotes: [],
},
Sentence {
str: "Volim rock, punk, funk, pop itd.",
quotes: [],
},
Sentence {
str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
quotes: [
Quote {
str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
sentences: [
"Sve sretne obitelji nalik su jedna na drugu.",
"Svaka nesretna obitelj nesretna je na svoj način.",
],
},
],
},
]