Expand description
This crate provides text tokenizers for preparing inputs for inference of machine-learning models. It provides implementations of popular tokenization methods such as WordPiece (used by BERT), and Byte Pair Encoding (used by GPT-2).
It does not support training new vocabularies and isn’t optimized for processing very large volumes of text. If you need a tokenization crate with more complete functionality, see HuggingFace tokenizers.
Modules§
- Tools for performing string normalization prior to tokenization.
- Tokenizers for converting text into sequences of token IDs.