Crate rten_text

source ·
Expand description

This crate provides text tokenizers for preparing inputs for inference of machine-learning models. It provides implementations of popular tokenization methods such as WordPiece (used by BERT), and Byte Pair Encoding (used by GPT-2).

It does not support training new vocabularies and isn’t optimized for processing very large volumes of text. If you need a tokenization crate with more complete functionality, see HuggingFace tokenizers.

Modules§

  • Tools for performing string normalization prior to tokenization.
  • Tokenizers for converting text into sequences of token IDs.