Skip to main content

Module tokenizer

Module tokenizer 

Source
Expand description

Text tokenizer for embeddings.

Provides BPE (Byte Pair Encoding) and WordPiece tokenization, vocabulary management, special token handling, bidirectional token-to-ID mapping, text encoding/decoding, max-sequence-length truncation, and batch tokenization.

Structs§

EncodeResult
The result of encoding a piece of text.
MergeRule
A single BPE merge rule: pair (left, right) merged into merged.
Tokenizer
A text tokenizer supporting BPE and WordPiece sub-word algorithms.
TokenizerConfig
Configuration for building a Tokenizer.

Enums§

SpecialToken
Well-known special tokens used by transformer models.
TokenizerMode
The sub-word algorithm used by the tokenizer.