Crate bbpe

Crate bbpe 

Source
Expand description

Binary byte pair encoding (BPE) training library and CLI.

The crate exposes both a library API and a bbpe command line interface for training binary-aware BPE tokenizers that interoperate with the Hugging Face tokenizers ecosystem. Typical usage loads a binary corpus, trains a BpeModel, and then persists the resulting tokenizer.json.

use bbpe::{IngestConfig, Trainer, TrainerConfig};

let trainer_cfg = TrainerConfig::builder()
    .target_vocab_size(4096)
    .min_frequency(2)
    .show_progress(false)
    .build()?;
let trainer = Trainer::new(trainer_cfg);
let ingest_cfg = IngestConfig::default();
let artifacts = trainer.train_from_paths(&["/path/to/binaries"], &ingest_cfg)?;
artifacts.model.save_huggingface("tokenizer.json")?;

The CLI is enabled by default through the cli feature. Users targeting the library portion only can disable default features to avoid the CLI dependencies: bbpe = { version = "...", default-features = false }.

Re-exports§

pub use config::IngestConfig;
pub use config::PreprocessorConfig;
pub use config::PreprocessorKind;
pub use config::TrainerBuilder;
pub use config::TrainerConfig;
pub use error::BbpeError;
pub use error::Result;
pub use metrics::IterationMetrics;
pub use metrics::TrainingMetrics;
pub use model::BinaryTokenizer;
pub use model::BpeModel;
pub use model::TokenId;
pub use trainer::Trainer;
pub use trainer::TrainerArtifacts;

Modules§

bytes
Utilities for converting between binary data and serialized token strings.
config
Configuration builders controlling training and corpus ingestion.
corpus
Facilities for discovering input files and loading binary corpora.
error
Error handling utilities shared across the crate.
metrics
Metrics describing the evolution of the training process.
model
Model types and helpers for working with trained BPE tokenizers.
preprocess
Preprocessing helpers that transform raw byte sequences before training.
serialization
Helpers for (de)serialising tokenizers in Hugging Face formats.
special_tokens
Helpers for loading and arranging special token inventories.
trainer
Core training loop responsible for producing tokenizer.json artefacts.