Skip to main content

Crate brainwires_datasets

Crate brainwires_datasets 

Source
Expand description

§Brainwires Datasets

Training data pipelines for the Brainwires Agent Framework.

Provides JSONL I/O, tokenization, deduplication, format conversion, and dataset management for cloud and local model fine-tuning workflows.

Re-exports§

pub use dataset::Dataset;
pub use dataset::InstructDataset;
pub use dataset::PreferenceDataset;
pub use error::DatasetError;
pub use error::DatasetResult;
pub use format::AlpacaFormat;
pub use format::ChatMlFormat;
pub use format::FormatConverter;
pub use format::OpenAiFormat;
pub use format::PreferenceConverter;
pub use format::ShareGptFormat;
pub use format::TogetherFormat;
pub use format::detect_format;
pub use jsonl::JsonlReader;
pub use jsonl::JsonlWriter;
pub use jsonl::read_jsonl;
pub use jsonl::read_jsonl_preferences;
pub use jsonl::write_jsonl;
pub use jsonl::write_jsonl_preferences;
pub use quality::DataValidator;
pub use quality::DatasetStats;
pub use quality::HistogramBucket;
pub use quality::IssueSeverity;
pub use quality::PreferenceStats;
pub use quality::RoleCounts;
pub use quality::ValidationIssue;
pub use quality::ValidationReport;
pub use quality::ValidatorConfig;
pub use quality::compute_preference_stats;
pub use quality::compute_stats;
pub use sampling::PreferenceSplitResult;
pub use sampling::SplitConfig;
pub use sampling::SplitResult;
pub use sampling::curriculum_order;
pub use sampling::preference_curriculum_order;
pub use sampling::preference_sample_n;
pub use sampling::preference_train_eval_split;
pub use sampling::sample_n;
pub use sampling::train_eval_split;
pub use types::DataFormat;
pub use types::PreferencePair;
pub use types::TrainingExample;
pub use types::TrainingMessage;
pub use types::TrainingRole;
pub use tokenizer::HfTokenizer;
pub use tokenizer::Tokenizer;

Modules§

dataset
Dataset trait and concrete dataset implementations.
error
Error types for dataset operations.
format
Format converters for various fine-tuning providers.
jsonl
JSONL reader and writer for streaming I/O.
quality
Data quality validation, statistics, and deduplication.
sampling
Train/eval splitting, curriculum ordering, and sampling utilities.
tokenizer
Tokenizer abstractions and implementations.
types
Core training data types (messages, examples, preference pairs).