Expand description
§wordchipper LLM Tokenizer Suite
This is a high-performance LLM tokenizer suite.
§Client Summary
§Core Client Types
TokenType- the parameterized integer type used for tokens; choose from{ u16, u32, u64 }.UnifiedTokenVocab<T>- the unified vocabulary type.TokenEncoder<T>andTokenDecoder<T>- the encoder and decoder interfaces.
§Pre-Trained Models
WordchipperDiskCache- the disk cache for loading models.OATokenizer- public pre-trainedOpenAItokenizers.
§TokenType and WCHash* Types
wordchipper is parameterized over an abstract primitive integer
TokenType. This permits vocabularies and tokenizers in the { u16, u32, u64 } types.
It is also feature-parameterized over the WCHashSet and WCHashMap
types, which are used to represent sets and maps of tokens.
These are provided for convenience and are not required for correctness.
§Unified Vocabulary
The core user-facing vocabulary type is UnifiedTokenVocab<T>.
Pre-trained vocabulary loaders return UnifiedTokenVocab<T> instances,
which can be converted between TokenTypes via
UnifiedTokenVocab::to_token_type.
§Loading and Saving Models
Loading a pre-trained model requires reading in the vocabulary,
either as a vocab::SpanMapVocab or vocab::PairMapVocab
(either of which must have an attached vocab::ByteMapVocab);
and merging that with a spanners::TextSpanningConfig
to produce a UnifiedTokenVocab<T>.
A number of IO helpers are provided in vocab::io.
§Loading Public Pre-trained Models
For a number of pretrained models, simplified constructors are available to download, cache, and load the vocabulary.
Most users will want to use the load_vocab function, which will
return a UnifiedTokenVocab containing the vocabulary and
spanners configuration.
There is also a list_vocabs function which lists the available
pretrained models.
See disk_cache::WordchipperDiskCache for details on the disk cache.
use std::sync::Arc;
use wordchipper::{
Tokenizer,
TokenizerOptions,
UnifiedTokenVocab,
WCResult,
disk_cache::WordchipperDiskCache,
load_vocab,
};
fn example() -> WCResult<Arc<Tokenizer<u32>>> {
let mut disk_cache = WordchipperDiskCache::default();
let loaded = load_vocab("openai:o200k_harmony", &mut disk_cache)?;
Ok(TokenizerOptions::default().build(loaded.vocab().clone()))
}§Crate Features
client— The base set of features needed to load and run pre-trained encoders and decoders. Uses platform-native TLS by default (OpenSSL on Linux). Combine withrustls-tlsinstead ofdefault-tlsfor environments without system OpenSSL (e.g. manylinux).download— The download feature enables downloading vocabularies from the internet.default-tls— Use the platform-native TLS backend (OpenSSL on Linux).rustls-tls— Use rustls (pure Rust) TLS backend for downloads. Useful for manylinux builds.std(enabled by default) — The “std” feature enables the use of thestdlibrary. Without “std”, the crate uses hashbrown for HashMap/HashSet.datagym— Enables datagym io.fast-hash(enabled by default) — Swaps HashMap/HashSet tofoldhashfor faster hashing. Works in both std and no_std environments.parallel(enabled by default) — Enables parallelism wrappers using therayoncrate.concurrent(enabled by default) — Gates thread pool (PoolToy) and concurrency utilities.tracing— This enables a number oftracinginstrumentation points. This is only useful for timing tracing of the library itself.testing— Enable test utilities for downstream users.huggingface— Enable loading pretrained huggingface modules.
Modules§
- decoders
- Token Decoders
- disk_
cache - Wordchipper Disk Cache
- encoders
- Token Encoders
- pretrained
- Public Vocabulary Information
- spanners
- Text Segmentation
- support
- Support and Utility Modules
- vocab
- Vocabulary
Macros§
- carrot_
str - Generate a “<|$name|>” string literal.
- declare_
carrot_ special - Declare a special token constant with
carrot_str!(). - join_
patterns - An extension of
join_strs!()which uses the “|” as the seperator. - join_
strs - A macro to concatenate multiple string literals with a specified separator.
- reserved_
carrot_ str - Generate a “<|reserved_{$value}|>” string literal.“
Structs§
- Labeled
Vocab - Resolved vocabulary with its description and loaded vocabulary.
- Token
Decoder Options - Options for configuring a
TokenDecoder. - Token
Encoder Options - Options for configuring a
TokenEncoder. - Tokenizer
- Unified Tokenizer.
- Tokenizer
Options - Options for configuring a
Tokenizer. - Unified
Token Vocab - A unified vocabulary structure for BPE tokenization that provides coherent views of vocabulary components through multiple mapping interfaces.
- Vocab
Description - A description of a pretrained tokenizer.
- Vocab
Listing - A listing of known tokenizer.
- Vocab
Query - A lookup query.
Enums§
- Special
Filter - A policy for filtering special tokens.
- WCError
- Errors from wordchipper operations.
Traits§
- Token
Decoder - The common trait for
&[T] -> Vec<u8>/String>decoders. - Token
Encoder - The common trait for
String/&[u8] -> Vec<T>encoders. - Token
Type - A type that can be used as a token in a BPE-based encoders.
- Vocab
Index - Common traits for token vocabularies.
Functions§
- hash_
map_ new - Create a new empty hash map.
- hash_
map_ with_ capacity - Create a new hash map with the given capacity.
- list_
models - List the available pretrained models.
- list_
vocabs - List all known vocabularies across all loaders.
- load_
vocab - Load a
LabeledVocabby name. - resolve_
vocab - Resolve a
VocabListingby name.
Type Aliases§
- Pair
- A pair of tokens.
- WCHash
Iter - Iterator over hash map entries.
- WCHash
Map - Type Alias for hash maps in this crate.
- WCHash
Set - Type Alias for hash sets in this crate.
- WCResult
- Result type for wordchipper operations.