Expand description

Tokenize words into word pieces.

This crate provides a subword tokenizer. A subword tokenizer splits a token into several pieces, so-called word pieces. Word pieces were popularized by and used in the BERT natural language encoder.

The tokenizer splits a word, providing an iterator over pieces. The piece is represented as a string and its vocabulary index.

use std::convert::TryFrom;
use std::fs::File;
use std::io::{BufRead, BufReader};

use wordpieces::{WordPiece, WordPieces};

let f = File::open("testdata/test.pieces").unwrap();
let word_pieces = WordPieces::from_buf_read(BufReader::new(f)).unwrap();

// A word that can be split fully.
let pieces = word_pieces.split("coördinatie")
 .map(|p| p.piece()).collect::<Vec<_>>();
assert_eq!(pieces, vec![Some("coördina"), Some("tie")]);

// A word that can be split partially.
let pieces = word_pieces.split("voorkomen")
 .map(|p| p.piece()).collect::<Vec<_>>();
assert_eq!(pieces, vec![Some("voor"), None]);

Structs

A set of word pieces.

Enums

A single word piece.
Errors that can occur while reading word pieces.