Expand description
aleph-alpha-tokenizer is a fast word-piece-like tokenizer based on fst
This can be used as a Model
in huggingface’s tokenizers, or standalone.
By default, this library builds only the code to be used standalone. Add it
to your Cargo.toml
with the following [dependencies]
entry:
[dependencies]
aleph-alpha-tokenizers = "0.3"
If you want to use it together with tokenizers
, you need to enable the
huggingface
feature, so the dependency entry becomes:
[dependencies]
aleph-alpha-tokenizers = { version = "0.3", features = ["huggingface"] }
§Examples
To use as a Model
, you need
to box it:
use tokenizers::{
tokenizer::{EncodeInput, Model, Tokenizer},
pre_tokenizers::bert::BertPreTokenizer,
};
use aleph_alpha_tokenizer::AlephAlphaTokenizer;
let mut tokenizer = Tokenizer::new(
Box::new(AlephAlphaTokenizer::from_vocab("vocab.txt")?));
tokenizer.with_pre_tokenizer(Box::new(BertPreTokenizer));
let _result = tokenizer.encode(
EncodeInput::Single("Some Test".to_string()), true)?;
Remember this depends on the huggingface
feature. Otherwise, you can use
it directly:
use aleph_alpha_tokenizer::AlephAlphaTokenizer;
let source_text = "Ein interessantes Beispiel";
let tokenizer = AlephAlphaTokenizer::from_vocab("vocab.txt")?;
let mut ids: Vec<i64> = Vec::new();
let mut ranges = Vec::new();
tokenizer.tokens_into(source_text, &mut ids, &mut ranges, None);
for (id, range) in ids.iter().zip(ranges.iter()) {
let _token_source = &source_text[range.clone()];
let _token_text = tokenizer.text_of(*id);
let _is_special = tokenizer.is_special(*id);
// etc.
}
Structs§
- Aleph
Alpha Tokenizer - The Tokenizer. Use
AlephAlphaTokenizer::from_vocab
to create an instance.
Traits§
- TokenID
- A trait to be able to convert token IDs on the fly