Skip to main content

Crate wordchipper

Crate wordchipper 

Source
Expand description

§wordchipper LLM Tokenizer Suite

This is a high-performance LLM tokenizer suite.

§Client Summary

§Core Client Types

§Pre-Trained Models

§TokenType and WCHash* Types

wordchipper is parameterized over an abstract primitive integer TokenType. This permits vocabularies and tokenizers in the { u16, u32, u64 } types.

It is also feature-parameterized over the WCHashSet and WCHashMap types, which are used to represent sets and maps of tokens. These are provided for convenience and are not required for correctness.

§Unified Vocabulary

The core user-facing vocabulary type is UnifiedTokenVocab<T>.

Pre-trained vocabulary loaders return UnifiedTokenVocab<T> instances, which can be converted between TokenTypes via UnifiedTokenVocab::to_token_type.

§Loading and Saving Models

Loading a pre-trained model requires reading in the vocabulary, either as a vocab::SpanMapVocab or vocab::PairMapVocab (either of which must have an attached vocab::ByteMapVocab); and merging that with a spanners::TextSpanningConfig to produce a UnifiedTokenVocab<T>.

A number of IO helpers are provided in vocab::io.

§Loading Public Pre-trained Models

For a number of pretrained models, simplified constructors are available to download, cache, and load the vocabulary.

Most users will want to use the load_vocab function, which will return a UnifiedTokenVocab containing the vocabulary and spanners configuration.

There is also a list_vocabs function which lists the available pretrained models.

See disk_cache::WordchipperDiskCache for details on the disk cache.

use std::sync::Arc;

use wordchipper::{
    Tokenizer,
    TokenizerOptions,
    UnifiedTokenVocab,
    WCResult,
    disk_cache::WordchipperDiskCache,
    load_vocab,
};

fn example() -> WCResult<Arc<Tokenizer<u32>>> {
    let mut disk_cache = WordchipperDiskCache::default();
    let loaded = load_vocab("openai:o200k_harmony", &mut disk_cache)?;
    Ok(TokenizerOptions::default().build(loaded.vocab().clone()))
}

§Crate Features

  • client — The base set of features needed to load and run pre-trained encoders and decoders. Uses platform-native TLS by default (OpenSSL on Linux). Combine with rustls-tls instead of default-tls for environments without system OpenSSL (e.g. manylinux).
  • download — The download feature enables downloading vocabularies from the internet.
  • default-tls — Use the platform-native TLS backend (OpenSSL on Linux).
  • rustls-tls — Use rustls (pure Rust) TLS backend for downloads. Useful for manylinux builds.
  • std (enabled by default) — The “std” feature enables the use of the std library. Without “std”, the crate uses hashbrown for HashMap/HashSet.
  • datagym — Enables datagym io.
  • fast-hash (enabled by default) — Swaps HashMap/HashSet to foldhash for faster hashing. Works in both std and no_std environments.
  • parallel (enabled by default) — Enables parallelism wrappers using the rayon crate.
  • concurrent (enabled by default) — Gates thread pool (PoolToy) and concurrency utilities.
  • tracing — This enables a number of tracing instrumentation points. This is only useful for timing tracing of the library itself.
  • testing — Enable test utilities for downstream users.
  • huggingface — Enable loading pretrained huggingface modules.

Modules§

decoders
Token Decoders
disk_cache
Wordchipper Disk Cache
encoders
Token Encoders
pretrained
Public Vocabulary Information
spanners
Text Segmentation
support
Support and Utility Modules
vocab
Vocabulary

Macros§

carrot_str
Generate a “<|$name|>” string literal.
declare_carrot_special
Declare a special token constant with carrot_str!().
join_patterns
An extension of join_strs!() which uses the “|” as the seperator.
join_strs
A macro to concatenate multiple string literals with a specified separator.
reserved_carrot_str
Generate a “<|reserved_{$value}|>” string literal.“

Structs§

LabeledVocab
Resolved vocabulary with its description and loaded vocabulary.
TokenDecoderOptions
Options for configuring a TokenDecoder.
TokenEncoderOptions
Options for configuring a TokenEncoder.
Tokenizer
Unified Tokenizer.
TokenizerOptions
Options for configuring a Tokenizer.
UnifiedTokenVocab
A unified vocabulary structure for BPE tokenization that provides coherent views of vocabulary components through multiple mapping interfaces.
VocabDescription
A description of a pretrained tokenizer.
VocabListing
A listing of known tokenizer.
VocabQuery
A lookup query.

Enums§

SpecialFilter
A policy for filtering special tokens.
WCError
Errors from wordchipper operations.

Traits§

TokenDecoder
The common trait for &[T] -> Vec<u8>/String> decoders.
TokenEncoder
The common trait for String/&[u8] -> Vec<T> encoders.
TokenType
A type that can be used as a token in a BPE-based encoders.
VocabIndex
Common traits for token vocabularies.

Functions§

hash_map_new
Create a new empty hash map.
hash_map_with_capacity
Create a new hash map with the given capacity.
list_models
List the available pretrained models.
list_vocabs
List all known vocabularies across all loaders.
load_vocab
Load a LabeledVocab by name.
resolve_vocab
Resolve a VocabListing by name.

Type Aliases§

Pair
A pair of tokens.
WCHashIter
Iterator over hash map entries.
WCHashMap
Type Alias for hash maps in this crate.
WCHashSet
Type Alias for hash sets in this crate.
WCResult
Result type for wordchipper operations.