Expand description
Tokenizer for language models.
use kitoken::Kitoken;
let encoder = Kitoken::from_file("tests/models/llama2.kit")?;
let tokens = encoder.encode("Your future belongs to me.", true)?;
let string = String::from_utf8(encoder.decode(&tokens, true)?)?;
assert!(string == "Your future belongs to me.");§Overview
Kitoken is a fast and versatile tokenizer for language models with support for BPE, Unigram and WordPiece tokenization.
Kitoken is compatible with many existing tokenizer formats,
including SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken,
and provides utilities for converting these formats. See convert for information about supported the formats and conversion utilities.
See Kitoken for the main entry point and additional information.
§Cargo features
§Default features
std: Enables standard library features, including reading and writing definitions from and to files.serialization: Enablesserdeimplementations and methods for serialization and deserialization of definitions.normalization: Enables all input normalization features. When disabled, individual normalizers can be enabled using the following features:normalization-unicode: Enables unicode input normalization support. This is required for certain models. Can be disabled to reduce binary size if unicode normalization is not required.normalization-charsmap: Enables precompiled charsmap input normalization support. This is required for certain models. Can be disabled to reduce binary size if charsmap normalization is not required.
convert: Enables detection and conversion utilities for common tokenizer data formats. When disabled, individual converters can be enabled using the following features:convert-tokenizers: Enables conversion from HuggingFace Tokenizers tokenizer definitions.convert-sentencepiece: Enables conversion from SentencePiece tokenizer definitions.convert-tiktoken: Enables conversion from OpenAI Tiktoken tokenizer definitions.convert-tekken: Enables conversion from Mistral Tekken tokenizer definitions.convert-detect: Enables detection of supported formats during deserialization. Enables the serialization feature.
regex-perf: Enables additional regex performance optimizations. Can be disabled to reduce binary size.multiversion: Enables the use of multiversion for generating multiple code paths with different CPU feature utilization.
§Optional features
split: Enables additional split features including unicode script splitting.split-unicode-script: Enables unicode script splitting. This is required for certain models. Disabled by default since it increases binary size and the majority of models don’t require it.
regex-unicode: Enables support for additional regex unicode patterns including script and segmentation extensions. Disabled by default since it increases binary size and the majority of models don’t make use of these patterns.regex-onig: Enables use of theonigurumaregex engine instead offancy-regex. Generally not recommended since it has worse runtime performance and adds a dependency on the nativeonigurumalibrary. However, it may be useful for certain models that require specific regex behavior that is not supported by or differs withfancy-regex.
Modules§
- convert
- Utilities for converting different tokenizer formats into Kitoken definitions.
Structs§
- Chars
Map - Character mapping structure for custom normalization rules.
- Configuration
- Configuration for the tokenizer.
- Definition
- Kitoken tokenizer definition.
- Kitoken
- Kitoken tokenizer. A fast and versatile tokenizer for language models.
- Metadata
- Kitoken tokenizer definition metadata.
- Regex
- Regex wrapper for different regex engines with serialization support.
- Regex
Error - Regex error type.
- Special
Token - Special token structure.
- Template
- Output template.
- Token
- Token structure.
Enums§
- Configuration
Error - Errors returned when the configuration fails to validate.
- Decode
Error - Errors encountered during decoding.
- Decoding
- Post-detokenization output decoding configuration.
- Decoding
Replace Pattern - Replacement pattern.
- Deserialization
Error serialization - Errors encountered when deserializing the tokenizer.
- Encode
Error - Errors encountered during encoding.
- Fallback
- Tokenization mode fallback.
- Initialization
Error - Errors encountered during initialization.
- Insertion
Position - Template insertion position.
- Model
- Kitoken tokenizer model.
- Normalization
- Pre-tokenization input normalization configuration.
- Normalization
Condition - Condition for conditional normalization.
- Normalization
Replace Pattern - Replacement pattern.
- Processing
- Post-tokenization output processing configuration.
- Processing
Direction - Processing direction.
- Special
Token Kind - Special token type.
- Split
- Pre-tokenization input split configuration.
- Split
Behavior - Split behavior.
- Split
Pattern - Split pattern.
- Unicode
Normalization - Unicode normalization scheme.
Type Aliases§
- Scores
- List of token scores.
- Special
Token Ident - Identifier for special tokens.
- Special
Vocab - List of special tokens.
- Token
Bytes - Byte sequence of a token.
- TokenId
- Numeric identifier of a token.
- Token
Score - Score of a token.
- Vocab
- List of tokens.