[−][src]Crate aleph_alpha_tokenizer
aleph-alpha-tokenizer is a fast word-piece-like tokenizer based on fst
This can be used as a Model
in huggingface's tokenizers, or standalone.
By default, this library builds only the code to be used standalone. Add it
to your Cargo.toml
with the following [dependencies]
entry:
[dependencies]
aleph-alpha-tokenizers = "0.3"
If you want to use it together with tokenizers
, you need to enable the
huggingface
feature, so the dependency entry becomes:
[dependencies]
aleph-alpha-tokenizers = { version = "0.3", features = ["huggingface"] }
Examples
To use as a Model
, you need
to box it:
use tokenizers::{ tokenizer::{EncodeInput, Model, Tokenizer}, pre_tokenizers::bert::BertPreTokenizer, }; use aleph_alpha_tokenizer::AlephAlphaTokenizer; let mut tokenizer = Tokenizer::new( Box::new(AlephAlphaTokenizer::from_vocab("vocab.txt")?)); tokenizer.with_pre_tokenizer(Box::new(BertPreTokenizer)); let _result = tokenizer.encode( EncodeInput::Single("Some Test".to_string()), true)?;
Remember this depends on the huggingface
feature. Otherwise, you can use
it directly:
use aleph_alpha_tokenizer::AlephAlphaTokenizer; let source_text = "Ein interessantes Beispiel"; let tokenizer = AlephAlphaTokenizer::from_vocab("vocab.txt")?; let mut ids: Vec<i64> = Vec::new(); let mut ranges = Vec::new(); tokenizer.tokens_into(source_text, &mut ids, &mut ranges, None); for (id, range) in ids.iter().zip(ranges.iter()) { let _token_source = &source_text[range.clone()]; let _token_text = tokenizer.text_of(*id); let _is_special = tokenizer.is_special(*id); // etc. }
Structs
AlephAlphaTokenizer | The Tokenizer. Use |
Traits
TokenID | A trait to be able to convert token IDs on the fly |