[][src]Crate aleph_alpha_tokenizer

aleph-alpha-tokenizer is a fast word-piece-like tokenizer based on fst

This can be used as a Model in huggingface's tokenizers, or standalone.

By default, this library builds only the code to be used standalone. Add it to your Cargo.toml with the following [dependencies] entry:

[dependencies]
aleph-alpha-tokenizers = "0.1"

If you want to use it together with tokenizers, you need to enable the huggingface feature, so the dependency entry becomes:

[dependencies]
aleph-alpha-tokenizers = { version = "0.1", features = ["huggingface"] }

Examples

To use as a Model, you need to box it:

use tokenizers::{
    tokenizer::{EncodeInput, Model, Tokenizer},
    pre_tokenizers::bert::BertPreTokenizer,
};
use aleph_alpha_tokenizer::AlephAlphaTokenizer;

let mut tokenizer = Tokenizer::new(
    Box::new(AlephAlphaTokenizer::from_vocab("vocab.txt")?));
tokenizer.with_pre_tokenizer(Box::new(BertPreTokenizer));
let _result = tokenizer.encode(
    EncodeInput::Single("Some Test".to_string()), true)?;

Remember this depends on the huggingface feature. Otherwise, you can use it directly:

use aleph_alpha_tokenizer::AlephAlphaTokenizer;

let source_text = "Ein interessantes Beispiel";
let tokenizer = AlephAlphaTokenizer::from_vocab("vocab.txt")?;
let mut ids: Vec<i64> = Vec::new();
let mut ranges = Vec::new();
tokenizer.tokens_into(source_text, &mut ids, &mut ranges, None);
for (id, range) in ids.iter().zip(ranges.iter()) {
     let _token_source = &source_text[range.clone()];
     let _token_text = tokenizer.text_of(*id);
     let _is_special = tokenizer.is_special(*id);
     // etc.
}

Structs

AlephAlphaTokenizer

The Tokenizer. Use AlephAlphaTokenizer::from_vocab to create an instance.

Traits

TokenID

A trait to be able to convert token IDs on the fly