Crate bert_tokenizer

Source
Expand description

This crate is a Rust port of Google’s BERT GoogleBERT WordPiece tokenizer.

Structs§

BasicTokenizer
A basic tokenizer that runs basic tokenization (punctuation splitting, lower casing, etc.). By default, it does not lower case the input.
BasicTokenizerBuilder
FullTokenizer
A FullTokenizer that runs basic tokenization and WordPiece tokenization.
FullTokenizerBuilder
WordPieceTokenizer
A subword tokenizer that runs WordPiece tokenization algorithm.
WordPieceTokenizerBuilder

Traits§

Tokenizer
A trait for tokenizing text. This trait is implemented by the BasicTokenizer and WordPieceTokenizer.

Functions§

load_vocab
Load a vocabulary from a vocabulary file. Not recommended to use this function directly, use FullTokenizerBuilder::vocab_from_file instead.

Type Aliases§

Ids
InvVocab
Vocab