Crate sentencepiece

Source
Expand description

This crate binds the sentencepiece library. sentencepiece is an unsupervised text tokenizer.

The main data structure of this crate is SentencePieceProcessor, which is used to tokenize sentences:

use sentencepiece::SentencePieceProcessor;

let spp = SentencePieceProcessor::open("testdata/toy.model").unwrap();
let pieces = spp.encode("I saw a girl with a telescope.").unwrap()
  .into_iter().map(|p| p.piece).collect::<Vec<_>>();
assert_eq!(pieces, vec!["▁I", "▁saw", "▁a", "▁girl", "▁with",
  "▁a", "▁t", "el", "es", "c", "o", "pe", "."]);

Structs§

PieceWithId
Sentence piece with its identifier and string span.
SentencePieceProcessor
Sentence piece tokenizer.

Enums§

CSentencePieceError
Errors that returned by the sentencepiece library.
SentencePieceError