Crate wordchipper

Source

Expand description

§`wordchipper` LLM Tokenizer Suite

This is a high-performance LLM tokenizer suite.

§Client Summary

§Core Client Types

TokenType - the parameterized integer type used for tokens; choose from { u16, u32, u64 }.
UnifiedTokenVocab<T> - the unified vocabulary type.
TokenEncoder<T> and TokenDecoder<T> - the encoder and decoder interfaces.

§Pre-Trained Models

WordchipperDiskCache - the disk cache for loading models.
OATokenizer - public pre-trained OpenAI tokenizers.

§`TokenType` and `WCHash`* Types

wordchipper is parameterized over an abstract primitive integer TokenType. This permits vocabularies and tokenizers in the { u16, u32, u64 } types.

It is also feature-parameterized over the WCHashSet and WCHashMap types, which are used to represent sets and maps of tokens. These are provided for convenience and are not required for correctness.

§Unified Vocabulary

The core user-facing vocabulary type is UnifiedTokenVocab<T>.

Pre-trained vocabulary loaders return UnifiedTokenVocab<T> instances, which can be converted between TokenTypes via UnifiedTokenVocab::to_token_type.

§Loading and Saving Models

Loading a pre-trained model requires reading in the vocabulary, either as a vocab::SpanMapVocab or vocab::PairMapVocab (either of which must have an attached vocab::ByteMapVocab); and merging that with a spanners::TextSpanningConfig to produce a UnifiedTokenVocab<T>.

A number of IO helpers are provided in vocab::io.

§Loading Public Pre-trained Models

For a number of pretrained models, simplified constructors are available to download, cache, and load the vocabulary.

Most users will want to use the load_vocab function, which will return a UnifiedTokenVocab containing the vocabulary and spanners configuration.

There is also a list_vocabs function which lists the available pretrained models.

See disk_cache::WordchipperDiskCache for details on the disk cache.

use std::sync::Arc;

use wordchipper::{
    Tokenizer,
    TokenizerOptions,
    UnifiedTokenVocab,
    WCResult,
    disk_cache::WordchipperDiskCache,
    load_vocab,
};

fn example() -> WCResult<Arc<Tokenizer<u32>>> {
    let mut disk_cache = WordchipperDiskCache::default();
    let loaded = load_vocab("openai:o200k_harmony", &mut disk_cache)?;
    Ok(TokenizerOptions::default().build(loaded.vocab().clone()))
}

§Crate Features

client — The base set of features needed to load and run pre-trained encoders and decoders. Uses platform-native TLS by default (OpenSSL on Linux). Combine with rustls-tls instead of default-tls for environments without system OpenSSL (e.g. manylinux).
download — The download feature enables downloading vocabularies from the internet.
default-tls — Use the platform-native TLS backend (OpenSSL on Linux).
rustls-tls — Use rustls (pure Rust) TLS backend for downloads. Useful for manylinux builds.
std (enabled by default) — The “std” feature enables the use of the std library. Without “std”, the crate uses hashbrown for HashMap/HashSet.
datagym — Enables datagym io.
fast-hash (enabled by default) — Swaps HashMap/HashSet to foldhash for faster hashing. Works in both std and no_std environments.
parallel (enabled by default) — Enables parallelism wrappers using the rayon crate.
concurrent (enabled by default) — Gates thread pool (PoolToy) and concurrency utilities.
tracing — This enables a number of tracing instrumentation points. This is only useful for timing tracing of the library itself.
testing — Enable test utilities for downstream users.
huggingface — Enable loading pretrained huggingface modules.

Modules§

decoders: Token Decoders
disk_cache: Wordchipper Disk Cache
encoders: Token Encoders
pretrained: Public Vocabulary Information
spanners: Text Segmentation
support: Support and Utility Modules
vocab: Vocabulary

Macros§

carrot_str: Generate a “<|$name|>” string literal.
declare_carrot_special: Declare a special token constant with carrot_str!().
join_patterns: An extension of join_strs!() which uses the “|” as the seperator.
join_strs: A macro to concatenate multiple string literals with a specified separator.
reserved_carrot_str: Generate a “<|reserved_{$value}|>” string literal.“

Structs§

LabeledVocab: Resolved vocabulary with its description and loaded vocabulary.
TokenDecoderOptions: Options for configuring a TokenDecoder.
TokenEncoderOptions: Options for configuring a TokenEncoder.
Tokenizer: Unified Tokenizer.
TokenizerOptions: Options for configuring a Tokenizer.
UnifiedTokenVocab: A unified vocabulary structure for BPE tokenization that provides coherent views of vocabulary components through multiple mapping interfaces.
VocabDescription: A description of a pretrained tokenizer.
VocabListing: A listing of known tokenizer.
VocabQuery: A lookup query.

Enums§

SpecialFilter: A policy for filtering special tokens.
WCError: Errors from wordchipper operations.

Traits§

TokenDecoder: The common trait for &[T] -> Vec<u8>/String> decoders.
TokenEncoder: The common trait for String/&[u8] -> Vec<T> encoders.
TokenType: A type that can be used as a token in a BPE-based encoders.
VocabIndex: Common traits for token vocabularies.

Functions§

hash_map_new: Create a new empty hash map.
hash_map_with_capacity: Create a new hash map with the given capacity.
list_models: List the available pretrained models.
list_vocabs: List all known vocabularies across all loaders.
load_vocab: Load a LabeledVocab by name.
resolve_vocab: Resolve a VocabListing by name.

Type Aliases§

Pair: A pair of tokens.
WCHashIter: Iterator over hash map entries.
WCHashMap: Type Alias for hash maps in this crate.
WCHashSet: Type Alias for hash sets in this crate.
WCResult: Result type for wordchipper operations.