Expand description
Binary byte pair encoding (BPE) training library and CLI.
The crate exposes both a library API and a bbpe command line interface for
training binary-aware BPE tokenizers that interoperate with the Hugging Face
tokenizers ecosystem. Typical usage loads a binary corpus, trains a
BpeModel, and then persists the resulting tokenizer.json.
use bbpe::{IngestConfig, Trainer, TrainerConfig};
let trainer_cfg = TrainerConfig::builder()
.target_vocab_size(4096)
.min_frequency(2)
.show_progress(false)
.build()?;
let trainer = Trainer::new(trainer_cfg);
let ingest_cfg = IngestConfig::default();
let artifacts = trainer.train_from_paths(&["/path/to/binaries"], &ingest_cfg)?;
artifacts.model.save_huggingface("tokenizer.json")?;The CLI is enabled by default through the cli feature. Users targeting the
library portion only can disable default features to avoid the CLI
dependencies: bbpe = { version = "...", default-features = false }.
Re-exports§
pub use config::IngestConfig;pub use config::PreprocessorConfig;pub use config::PreprocessorKind;pub use config::TrainerBuilder;pub use config::TrainerConfig;pub use error::BbpeError;pub use error::Result;pub use metrics::IterationMetrics;pub use metrics::TrainingMetrics;pub use model::BinaryTokenizer;pub use model::BpeModel;pub use model::TokenId;pub use trainer::Trainer;pub use trainer::TrainerArtifacts;
Modules§
- bytes
- Utilities for converting between binary data and serialized token strings.
- config
- Configuration builders controlling training and corpus ingestion.
- corpus
- Facilities for discovering input files and loading binary corpora.
- error
- Error handling utilities shared across the crate.
- metrics
- Metrics describing the evolution of the training process.
- model
- Model types and helpers for working with trained BPE tokenizers.
- preprocess
- Preprocessing helpers that transform raw byte sequences before training.
- serialization
- Helpers for (de)serialising tokenizers in Hugging Face formats.
- special_
tokens - Helpers for loading and arranging special token inventories.
- trainer
- Core training loop responsible for producing
tokenizer.jsonartefacts.