Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Punkt
Implementation of Tibor Kiss' and Jan Strunk's Punkt algorithm for sentence tokenization. Includes a word tokenizer that tokenizes words based on regexes defined in Python's NLTK library. Results have been compared with small and large texts that have been tokenized with NLTK's library.
Note: The output of this library may be slightly different than that of NLTK's implementation of the algorithm, due to the fact that the WordTokenizer was developed with speed in mind, and avoids the usage of regular expressions.
Usage
The library allows you to train the tokenizer yourself or use pre-trained data.
To use pre-trained data from English texts (other languages available):
extern crate punkt;
use ;
use SentenceTokenizer;
let doc: &str = ...;
let data = english;
for sent in new
The paper describing the Punkt algorithm also states that the tokenizer can learn all of the necessary information from the document it is tokenizing.
To emulate this behavior:
extern crate punkt;
use Default;
use Trainer;
use SentenceTokenizer;
let doc: &str = ...;
let mut data = Default default;
// Trainer requires a mutable reference to data, so it must be instantiated
// within a block, to release the mutable borrow.
for sent in new
You can also manually train the tokenizer:
extern crate punkt;
use Default;
use Trainer;
use SentenceTokenizer;
let docs: = vec!;
let mut data = Default default;
for sent in new
Benchmarks
Specs of my machine:
- i5-4460 @ 3.20 x 4
- 8 GB RAM
- Fedora 20
- SSD
test tokenizer::sentence::bench_sentence_tokenizer_train_on_document_long ... bench: 126307934 ns/iter (+/- 2396747)
test tokenizer::sentence::bench_sentence_tokenizer_train_on_document_medium ... bench: 873410 ns/iter (+/- 32674)
test tokenizer::sentence::bench_sentence_tokenizer_train_on_document_short ... bench: 680775 ns/iter (+/- 21910)
test tokenizer::word::word_tokenizer_bench_long ... bench: 10532662 ns/iter (+/- 349243)
test tokenizer::word::word_tokenizer_bench_medium ... bench: 223920 ns/iter (+/- 10958)
test tokenizer::word::word_tokenizer_bench_short ... bench: 187889 ns/iter (+/- 17145)
test tokenizer::word::word_tokenizer_bench_very_long ... bench: 36701973 ns/iter (+/- 1131175)
test trainer::trainer::bench_trainer_long ... bench: 25637329 ns/iter (+/- 794873)
test trainer::trainer::bench_trainer_medium ... bench: 616952 ns/iter (+/- 41141)
test trainer::trainer::bench_trainer_short ... bench: 478615 ns/iter (+/- 29167)
test trainer::trainer::bench_trainer_very_long ... bench: 87842418 ns/iter (+/- 1961972)
For sentence tokenization, and training on the input document, the Rust implementation runs roughly 10x faster than the Python implementations. I used timeit on the tokenize
method after calling train
to benchmark this.