Crate kitoken

Crate kitoken 

Source
Expand description

Tokenizer for language models.

use kitoken::Kitoken;
let encoder = Kitoken::from_file("tests/models/llama2.kit")?;

let tokens = encoder.encode("Your future belongs to me.", true)?;
let string = String::from_utf8(encoder.decode(&tokens, true)?)?;

assert!(string == "Your future belongs to me.");

§Overview

Kitoken is a fast and versatile tokenizer for language models with support for BPE, Unigram and WordPiece tokenization.

Kitoken is compatible with many existing tokenizer formats, including SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken, and provides utilities for converting these formats. See convert for information about supported the formats and conversion utilities.

See Kitoken for the main entry point and additional information.

§Cargo features

§Default features

  • std: Enables standard library features, including reading and writing definitions from and to files.
  • serialization: Enables serde implementations and methods for serialization and deserialization of definitions.
  • normalization: Enables all input normalization features. When disabled, individual normalizers can be enabled using the following features:
    • normalization-unicode: Enables unicode input normalization support. This is required for certain models. Can be disabled to reduce binary size if unicode normalization is not required.
    • normalization-charsmap: Enables precompiled charsmap input normalization support. This is required for certain models. Can be disabled to reduce binary size if charsmap normalization is not required.
  • convert: Enables detection and conversion utilities for common tokenizer data formats. When disabled, individual converters can be enabled using the following features:
    • convert-tokenizers: Enables conversion from HuggingFace Tokenizers tokenizer definitions.
    • convert-sentencepiece: Enables conversion from SentencePiece tokenizer definitions.
    • convert-tiktoken: Enables conversion from OpenAI Tiktoken tokenizer definitions.
    • convert-tekken: Enables conversion from Mistral Tekken tokenizer definitions.
    • convert-detect: Enables detection of supported formats during deserialization. Enables the serialization feature.
  • regex-perf: Enables additional regex performance optimizations. Can be disabled to reduce binary size.
  • multiversion: Enables the use of multiversion for generating multiple code paths with different CPU feature utilization.

§Optional features

  • split: Enables additional split features including unicode script splitting.
    • split-unicode-script: Enables unicode script splitting. This is required for certain models. Disabled by default since it increases binary size and the majority of models don’t require it.
  • regex-unicode: Enables support for additional regex unicode patterns including script and segmentation extensions. Disabled by default since it increases binary size and the majority of models don’t make use of these patterns.
  • regex-onig: Enables use of the oniguruma regex engine instead of fancy-regex. Generally not recommended since it has worse runtime performance and adds a dependency on the native oniguruma library. However, it may be useful for certain models that require specific regex behavior that is not supported by or differs with fancy-regex.

Modules§

convert
Utilities for converting different tokenizer formats into Kitoken definitions.

Structs§

CharsMap
Character mapping structure for custom normalization rules.
Configuration
Configuration for the tokenizer.
Definition
Kitoken tokenizer definition.
Kitoken
Kitoken tokenizer. A fast and versatile tokenizer for language models.
Metadata
Kitoken tokenizer definition metadata.
Regex
Regex wrapper for different regex engines with serialization support.
RegexError
Regex error type.
SpecialToken
Special token structure.
Template
Output template.
Token
Token structure.

Enums§

ConfigurationError
Errors returned when the configuration fails to validate.
DecodeError
Errors encountered during decoding.
Decoding
Post-detokenization output decoding configuration.
DecodingReplacePattern
Replacement pattern.
DeserializationErrorserialization
Errors encountered when deserializing the tokenizer.
EncodeError
Errors encountered during encoding.
Fallback
Tokenization mode fallback.
InitializationError
Errors encountered during initialization.
InsertionPosition
Template insertion position.
Model
Kitoken tokenizer model.
Normalization
Pre-tokenization input normalization configuration.
NormalizationCondition
Condition for conditional normalization.
NormalizationReplacePattern
Replacement pattern.
Processing
Post-tokenization output processing configuration.
ProcessingDirection
Processing direction.
SpecialTokenKind
Special token type.
Split
Pre-tokenization input split configuration.
SplitBehavior
Split behavior.
SplitPattern
Split pattern.
UnicodeNormalization
Unicode normalization scheme.

Type Aliases§

Scores
List of token scores.
SpecialTokenIdent
Identifier for special tokens.
SpecialVocab
List of special tokens.
TokenBytes
Byte sequence of a token.
TokenId
Numeric identifier of a token.
TokenScore
Score of a token.
Vocab
List of tokens.