Crate punkt

Expand description

§Overview

Implementation of Tibor Kiss’ and Jan Strunk’s Punkt algorithm for sentence tokenization. Results have been compared with small and large texts that have been tokenized using NLTK.

§Training

Training data can be provided to a SentenceTokenizer for better results. Data can be acquired manually by training with a Trainer, or using already compiled data from NLTK (example: TrainingData::english()).

§Typical Usage

The punkt algorithm allows you to derive all the necessary data to perform sentence tokenization from the document itself.

let trainer: Trainer<Standard> = Trainer::new();
let mut data = TrainingData::new();

trainer.train(doc, &mut data);

for s in SentenceTokenizer::<Standard>::new(doc, &data) {
  println!("{:?}", s);
}

rust-punkt also provides pretrained data that can be loaded for certain languages.

let data = TrainingData::english();

rust-punkt also allows training data to be incrementally gathered.

let trainer: Trainer<Standard> = Trainer::new();
let mut data = TrainingData::new();

for d in docs.iter() {
  trainer.train(d, &mut data);

  for s in SentenceTokenizer::<Standard>::new(d, &data) {
    println!("{:?}", s);
  }
}

§Customization

rust-punkt exposes a number of traits to customize how the trainer, sentence tokenizer, and internal tokenizers work. The default settings, which are nearly identical, to the ones available in the Python library are available in punkt::params::Standard.

To modify only how the trainer works:

struct MyParams;

impl DefinesInternalPunctuation for MyParams {}
impl DefinesNonPrefixCharacters for MyParams {}
impl DefinesNonWordCharacters for MyParams {}
impl DefinesPunctuation for MyParams {}
impl DefinesSentenceEndings for MyParams {}

impl TrainerParameters for MyParams {
  const ABBREV_LOWER_BOUND: f64 = 0.3;
  const ABBREV_UPPER_BOUND: f64 = 8f64;
  const IGNORE_ABBREV_PENALTY: bool = false;
  const COLLOCATION_LOWER_BOUND: f64 = 7.88;
  const SENTENCE_STARTER_LOWER_BOUND: f64 = 35f64;
  const INCLUDE_ALL_COLLOCATIONS: bool = false;
  const INCLUDE_ABBREV_COLLOCATIONS: bool = true;
  const COLLOCATION_FREQUENCY_LOWER_BOUND: f64 = 0.8f64;
}

To fully modify how everything works:

struct MyParams;

impl DefinesSentenceEndings for MyParams {
  // const SENTENCE_ENDINGS: &'static Set<char> = &phf_set![...];
}

impl DefinesInternalPunctuation for MyParams {
  // const INTERNAL_PUNCTUATION: &'static Set<char> = &phf_set![...];
}

impl DefinesNonWordCharacters for MyParams {
  // const NONWORD_CHARS: &'static Set<char> = &phf_set![...];
}

impl DefinesPunctuation for MyParams {
  // const PUNCTUATION: &'static Set<char> = &phf_set![...];
}

impl DefinesNonPrefixCharacters for MyParams {
  // const NONPREFIX_CHARS: &'static Set<char> = &phf_set![...];
}

impl TrainerParameters for MyParams {
  // const ABBREV_LOWER_BOUND: f64 = ...;
  // const ABBREV_UPPER_BOUND: f64 = ...;
  // const IGNORE_ABBREV_PENALTY: bool = ...;
  // const COLLOCATION_LOWER_BOUND: f64 = ...;
  // const SENTENCE_STARTER_LOWER_BOUND: f64 = ...;
  // const INCLUDE_ALL_COLLOCATIONS: bool = ...;
  // const INCLUDE_ABBREV_COLLOCATIONS: bool = true;
  // const COLLOCATION_FREQUENCY_LOWER_BOUND: f64 = ...;
}

Modules§

params: Contains traits for configuring all tokenizers, and the trainer. Also contains default parameters for tokenizers, and the trainer.

Structs§

SentenceByteOffsetTokenizer: Iterator over the byte offsets of a document.
SentenceTokenizer: Iterator over the sentence slices of a document.
Trainer: A trainer will build data about abbreviations, sentence starters, collocations, and context that tokens appear in. The data is used by the sentence tokenizer to determine if a period is likely part of an abbreviation, or actually marks the termination of a sentence.
TrainingData: Stores data that was obtained during training.