Crate sentencepiece
source ·Expand description
This crate binds the sentencepiece library. sentencepiece is an unsupervised text tokenizer.
The main data structure of this crate is SentencePieceProcessor
,
which is used to tokenize sentences:
use sentencepiece::SentencePieceProcessor;
let spp = SentencePieceProcessor::open("testdata/toy.model").unwrap();
let pieces = spp.encode("I saw a girl with a telescope.").unwrap()
.into_iter().map(|p| p.piece).collect::<Vec<_>>();
assert_eq!(pieces, vec!["▁I", "▁saw", "▁a", "▁girl", "▁with",
"▁a", "▁t", "el", "es", "c", "o", "pe", "."]);
Structs
- Sentence piece with its identifier and string span.
- Sentence piece tokenizer.
Enums
- Errors that returned by the
sentencepiece
library.