Expand description
§Overview
Implementation of Tibor Kiss’ and Jan Strunk’s Punkt algorithm for sentence tokenization. Results have been compared with small and large texts that have been tokenized using NLTK.
§Training
Training data can be provided to a SentenceTokenizer
for better
results. Data can be acquired manually by training with a Trainer
,
or using already compiled data from NLTK (example: TrainingData::english()
).
§Typical Usage
The punkt algorithm allows you to derive all the necessary data to perform sentence tokenization from the document itself.
let trainer: Trainer<Standard> = Trainer::new();
let mut data = TrainingData::new();
trainer.train(doc, &mut data);
for s in SentenceTokenizer::<Standard>::new(doc, &data) {
println!("{:?}", s);
}
rust-punkt
also provides pretrained data that can be loaded for certain languages.
let data = TrainingData::english();
rust-punkt
also allows training data to be incrementally gathered.
let trainer: Trainer<Standard> = Trainer::new();
let mut data = TrainingData::new();
for d in docs.iter() {
trainer.train(d, &mut data);
for s in SentenceTokenizer::<Standard>::new(d, &data) {
println!("{:?}", s);
}
}
§Customization
rust-punkt
exposes a number of traits to customize how the trainer, sentence tokenizer,
and internal tokenizers work. The default settings, which are nearly identical, to the
ones available in the Python library are available in punkt::params::Standard
.
To modify only how the trainer works:
struct MyParams;
impl DefinesInternalPunctuation for MyParams {}
impl DefinesNonPrefixCharacters for MyParams {}
impl DefinesNonWordCharacters for MyParams {}
impl DefinesPunctuation for MyParams {}
impl DefinesSentenceEndings for MyParams {}
impl TrainerParameters for MyParams {
const ABBREV_LOWER_BOUND: f64 = 0.3;
const ABBREV_UPPER_BOUND: f64 = 8f64;
const IGNORE_ABBREV_PENALTY: bool = false;
const COLLOCATION_LOWER_BOUND: f64 = 7.88;
const SENTENCE_STARTER_LOWER_BOUND: f64 = 35f64;
const INCLUDE_ALL_COLLOCATIONS: bool = false;
const INCLUDE_ABBREV_COLLOCATIONS: bool = true;
const COLLOCATION_FREQUENCY_LOWER_BOUND: f64 = 0.8f64;
}
To fully modify how everything works:
struct MyParams;
impl DefinesSentenceEndings for MyParams {
// const SENTENCE_ENDINGS: &'static Set<char> = &phf_set![...];
}
impl DefinesInternalPunctuation for MyParams {
// const INTERNAL_PUNCTUATION: &'static Set<char> = &phf_set![...];
}
impl DefinesNonWordCharacters for MyParams {
// const NONWORD_CHARS: &'static Set<char> = &phf_set![...];
}
impl DefinesPunctuation for MyParams {
// const PUNCTUATION: &'static Set<char> = &phf_set![...];
}
impl DefinesNonPrefixCharacters for MyParams {
// const NONPREFIX_CHARS: &'static Set<char> = &phf_set![...];
}
impl TrainerParameters for MyParams {
// const ABBREV_LOWER_BOUND: f64 = ...;
// const ABBREV_UPPER_BOUND: f64 = ...;
// const IGNORE_ABBREV_PENALTY: bool = ...;
// const COLLOCATION_LOWER_BOUND: f64 = ...;
// const SENTENCE_STARTER_LOWER_BOUND: f64 = ...;
// const INCLUDE_ALL_COLLOCATIONS: bool = ...;
// const INCLUDE_ABBREV_COLLOCATIONS: bool = true;
// const COLLOCATION_FREQUENCY_LOWER_BOUND: f64 = ...;
}
Modules§
- params
- Contains traits for configuring all tokenizers, and the trainer. Also contains default parameters for tokenizers, and the trainer.
Structs§
- Sentence
Byte Offset Tokenizer - Iterator over the byte offsets of a document.
- Sentence
Tokenizer - Iterator over the sentence slices of a document.
- Trainer
- A trainer will build data about abbreviations, sentence starters, collocations, and context that tokens appear in. The data is used by the sentence tokenizer to determine if a period is likely part of an abbreviation, or actually marks the termination of a sentence.
- Training
Data - Stores data that was obtained during training.