1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
//! # Vocabulary
//!
//! This module provides the vocabulary and related io mechanisms.
//!
//! ## Byte Vocabulary
//!
//! Due to choices which exist in the community, we are forced to explicitly
//! map between byte values and token ranks. This is provided by:
//! * [`ByteMapVocab`].
//!
//! ## Unified Vocabulary
//!
//! [`UnifiedTokenVocab<T>`] is the primary vocabulary type for end users.
//! It unifies several component vocabularies into a coherent interface:
//! * [`ByteMapVocab`] - bidirectional byte⟷token mapping
//! * [`PairMapVocab`] - BPE merge pair mapping: `(T, T) → T`
//! * [`SpanMapVocab`] - span dictionary mapping: `Vec<u8> → T`
//! * [`crate::spanners::TextSpanningConfig`] - text spanners configuration that
//! defines how text is split into spans for encoding, including special token
//! words
//!
//! Pre-trained vocabulary loaders return [`UnifiedTokenVocab<T>`] instances,
//! which can be converted between [`crate::TokenType`]s via
//! [`UnifiedTokenVocab::to_token_type`].
//!
//! ## Loading and Saving Models
//!
//! Loading a pre-trained model requires reading in the vocabulary,
//! either as a [`SpanMapVocab`] or [`PairMapVocab`]
//! (either of which must have an attached [`ByteMapVocab`]);
//! and merging that with a [`crate::spanners::TextSpanningConfig`]
//! to produce a [`UnifiedTokenVocab<T>`].
//!
//! A number of IO helpers are provided in [`io`].
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;
/// Expected bytes/token ratio.
///
/// This is an observed bytes/token ratio, as a baseline
/// for scaling encode/decode buffers. Different languages
/// and encodings will see different ratios, and it
/// may be worth adjusting the ratio used by encoders/decoders
/// in production settings.
pub const DEFAULT_BYTE_PER_TOKEN_RATIO: f32 = 4.8;