1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
//! # `wordchipper` LLM Tokenizer Suite
//!
//! This is a high-performance LLM tokenizer suite.
//!
//! ## Client Summary
//!
//! ### Core Client Types
//! * [`TokenType`] - the parameterized integer type used for tokens; choose
//! from `{ u16, u32, u64 }`.
//! * [`UnifiedTokenVocab<T>`] - the unified vocabulary type.
//! * [`TokenEncoder<T>`] and [`TokenDecoder<T>`] - the encoder and decoder
//! interfaces.
//!
//! ### Pre-Trained Models
//! * [`WordchipperDiskCache`](`disk_cache::WordchipperDiskCache`) - the disk
//! cache for loading models.
//! * [`OATokenizer`](`pretrained::openai::OATokenizer`) - public pre-trained
//! `OpenAI` tokenizers.
//!
//! ## `TokenType` and `WCHash`* Types
//!
//! `wordchipper` is parameterized over an abstract primitive integer
//! [`TokenType`]. This permits vocabularies and tokenizers in the `{ u16, u32,
//! u64 }` types.
//!
//! It is also feature-parameterized over the [`WCHashSet`] and [`WCHashMap`]
//! types, which are used to represent sets and maps of tokens.
//! These are provided for convenience and are not required for correctness.
//!
//! ## Unified Vocabulary
//!
//! The core user-facing vocabulary type is [`UnifiedTokenVocab<T>`].
//!
//! Pre-trained vocabulary loaders return [`UnifiedTokenVocab<T>`] instances,
//! which can be converted between [`TokenType`]s via
//! [`UnifiedTokenVocab::to_token_type`].
//!
//! ## Loading and Saving Models
//!
//! Loading a pre-trained model requires reading in the vocabulary,
//! either as a [`vocab::SpanMapVocab`] or [`vocab::PairMapVocab`]
//! (either of which must have an attached [`vocab::ByteMapVocab`]);
//! and merging that with a [`spanners::TextSpanningConfig`]
//! to produce a [`UnifiedTokenVocab<T>`].
//!
//! A number of IO helpers are provided in [`vocab::io`].
//!
//! ## Loading Public Pre-trained Models
//!
//! For a number of pretrained models, simplified constructors are
//! available to download, cache, and load the vocabulary.
//!
//! Most users will want to use the [`load_vocab`] function, which will
//! return a [`UnifiedTokenVocab`] containing the vocabulary and
//! spanners configuration.
//!
//! There is also a [`list_vocabs`] function which lists the available
//! pretrained models.
//!
//! See [`disk_cache::WordchipperDiskCache`] for details on the disk cache.
//!
//! ```rust,no_run
//! use std::sync::Arc;
//!
//! use wordchipper::{
//! Tokenizer,
//! TokenizerOptions,
//! UnifiedTokenVocab,
//! WCResult,
//! disk_cache::WordchipperDiskCache,
//! load_vocab,
//! };
//!
//! fn example() -> WCResult<Arc<Tokenizer<u32>>> {
//! let mut disk_cache = WordchipperDiskCache::default();
//! let loaded = load_vocab("openai:o200k_harmony", &mut disk_cache)?;
//! Ok(TokenizerOptions::default().build(loaded.vocab().clone()))
//! }
//! ```
//!
//! ## Crate Features
extern crate std;
extern crate alloc;
/// Re-exports of common `alloc` types that are normally in the std prelude.
///
/// Modules that use `Vec`, `String`, `Box`, or `ToString` should add:
/// ```ignore
/// use crate::prelude::*;
/// ```
pub
pub use wordchipper_disk_cache as disk_cache;
pub use TokenDecoder;
pub use TokenDecoderOptions;
pub use TokenEncoder;
pub use TokenEncoderOptions;
pub use *;
pub use ;
pub use *;
pub use *;
pub use ;