Expand description
§Brainwires Datasets
Training data pipelines for the Brainwires Agent Framework.
Provides JSONL I/O, tokenization, deduplication, format conversion, and dataset management for cloud and local model fine-tuning workflows.
Re-exports§
pub use dataset::Dataset;pub use dataset::InstructDataset;pub use dataset::PreferenceDataset;pub use error::DatasetError;pub use error::DatasetResult;pub use format::AlpacaFormat;pub use format::ChatMlFormat;pub use format::FormatConverter;pub use format::OpenAiFormat;pub use format::PreferenceConverter;pub use format::TogetherFormat;pub use format::detect_format;pub use jsonl::JsonlReader;pub use jsonl::JsonlWriter;pub use jsonl::read_jsonl;pub use jsonl::read_jsonl_preferences;pub use jsonl::write_jsonl;pub use jsonl::write_jsonl_preferences;pub use quality::DataValidator;pub use quality::DatasetStats;pub use quality::HistogramBucket;pub use quality::IssueSeverity;pub use quality::PreferenceStats;pub use quality::RoleCounts;pub use quality::ValidationIssue;pub use quality::ValidationReport;pub use quality::ValidatorConfig;pub use quality::compute_preference_stats;pub use quality::compute_stats;pub use sampling::PreferenceSplitResult;pub use sampling::SplitConfig;pub use sampling::SplitResult;pub use sampling::curriculum_order;pub use sampling::preference_curriculum_order;pub use sampling::preference_sample_n;pub use sampling::preference_train_eval_split;pub use sampling::sample_n;pub use sampling::train_eval_split;pub use types::DataFormat;pub use types::PreferencePair;pub use types::TrainingExample;pub use types::TrainingMessage;pub use types::TrainingRole;pub use tokenizer::HfTokenizer;pub use tokenizer::Tokenizer;
Modules§
- dataset
- Dataset trait and concrete dataset implementations.
- error
- Error types for dataset operations.
- format
- Format converters for various fine-tuning providers.
- jsonl
- JSONL reader and writer for streaming I/O.
- quality
- Data quality validation, statistics, and deduplication.
- sampling
- Train/eval splitting, curriculum ordering, and sampling utilities.
- tokenizer
- Tokenizer abstractions and implementations.
- types
- Core training data types (messages, examples, preference pairs).