1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
// bpe.rs
//
// Copyright (c) 2023-2024 Junpei Kawamoto
//
// This software is released under the MIT License.
//
// http://opensource.org/licenses/mit-license.php
//! This module provides a tokenizer based on the Byte Pair Encoding (BPE) model.
//!
//! Byte Pair Encoding is a sub-word tokenization technique that can dynamically adjust
//! the vocabulary based on the provided corpus, often leading to more efficient
//! representation of text data in machine learning models. For a deeper understanding
//! of BPE, refer to the [original paper](https://www.aclweb.org/anthology/P16-1162/).
//!
//! This module allows for the creation of a [`Tokenizer`] structure instance by specifying a path
//! to a directory containing `vocab.json` and `merges.txt`, which are essential for the BPE
//! algorithm.
//! You can also specify optional suffixes for the decoder to fine-tune the behavior of the decoding
//! process. For more details on how to use suffixes with the BPE decoder, please see the
//! documentation for
//! [BPEDecoder](https://docs.rs/tokenizers/latest/tokenizers/decoders/bpe/struct.BPEDecoder.html).
//!
//! The tokenizer instances created can be utilized in conjunction with structures like
//! [`Translator`][crate::Translator] and [`Generator`][crate::Generator] for tasks such as
//! translation or text generation.
//!
//! ## Example
//! Here is an example of how to create an instance of the Tokenizer
//! and then use it to create an instance of the [`Generator`][crate::Generator] structure:
//!
//! ```no_run
//! # use anyhow::Result;
//! #
//! use ct2rs::{Config, Generator};
//! use ct2rs::tokenizers::bpe;
//!
//! # fn main() -> Result<()> {
//! let path = "/path/to/model";
//! let t = Generator::with_tokenizer(&path, bpe::new(&path, None)?, &Config::default())?;
//! # Ok(())
//! # }
//! ```
use Path;
use ;
use BPEDecoder;
use BPE;
use RobertaProcessing;
use Tokenizer as HFTokenizer;
use Tokenizer;
const VOCAB_FILE: &str = "vocab.json";
const MERGES_FILE: &str = "merges.txt";
/// Create a tokenizer instance by specifying the path to a directory containing `vocab.json`
/// and `mergers.txt`.
/// Create a tokenizer instance by specifying the path to `vocab.json` and `mergers.txt`.