Crate segtok

Crate segtok 

Source
Expand description

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from the python package (not maintained anymore), and fixes the contractions bug.

use segtok::{segmenter::*, tokenizer::*};

let input = include_str!("../tests/test_google.txt");

let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
    .into_iter()
    .map(|span| split_contractions(web_tokenizer(&span)).collect())
    .collect();

Modulesยง

segmenter
A pattern-based sentence segmentation strategy.
tokenizer