brainwires-datasets
Training data pipelines for the Brainwires Agent Framework — JSONL I/O, tokenization, deduplication, format conversion.
Overview
brainwires-datasets handles every step between raw training data and model-ready datasets. It reads and writes JSONL files, converts between popular fine-tuning formats (OpenAI, Together, Alpaca, ShareGPT, ChatML), validates data quality, deduplicates examples, tokenizes text, and splits into train/eval sets.
Design principles:
- Format-agnostic — a single
TrainingExampletype normalizes all formats; convert freely between them - Streaming I/O —
JsonlReader/JsonlWriterstream line-by-line to handle datasets larger than memory - Quality-first —
DataValidatorcatches missing fields, empty messages, and role violations before training - Pluggable tokenizers — HuggingFace Tokenizers and tiktoken via feature flags
- Deduplication — exact hash-based dedup to remove repeated examples
┌──────────────────────────────────────────────────────────────┐
│ brainwires-datasets │
│ │
│ ┌──────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐ │
│ │ JSONL │───▶│ Format │───▶│ Dataset │───▶│ Quality │ │
│ │ Reader│ │ Converter │ │ Instruct │ │ Validator│ │
│ │ Writer│ │ OpenAI │ │Preference│ │ Dedup │ │
│ └──────┘ │ Together │ └──────────┘ └────┬─────┘ │
│ │ Alpaca │ │ │
│ │ ShareGPT │ ▼ │
│ │ ChatML │ ┌───────────┐ │
│ └───────────┘ │ Tokenizer │ │
│ │ HF / Tik │ │
│ └─────┬─────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │Train/Eval │ │
│ │ Split │ │
│ └───────────┘ │
└──────────────────────────────────────────────────────────────┘
Flow: JSONL → Format Converter → Dataset → Quality/Dedup → Tokenizer → Split
Quick Start
Add to your Cargo.toml:
[]
= "0.1"
Load a JSONL dataset and validate it:
use ;
// Read training examples from JSONL
let examples: = read?;
// Validate data quality
let validator = new;
let report = validator.validate?;
println!;
// Compute statistics
let stats = compute_stats;
println!;
// Build a dataset
let dataset = from_examples;
Features
| Feature | Default | Description |
|---|---|---|
hf-tokenizer |
Yes | HuggingFace Tokenizers for token counting and BPE tokenization |
tiktoken |
No | OpenAI tiktoken tokenizer (cl100k_base, o200k_base, etc.) |
dedup |
No | Exact deduplication via SHA-256 hashing |
full |
No | Enables all optional features |
# With tiktoken for OpenAI token counting
[]
= { = "0.3", = ["tiktoken"] }
# Full feature set
[]
= { = "0.3", = ["full"] }
# Minimal — no tokenizer, just I/O and format conversion
[]
= { = "0.3", = false }
Architecture
Core Types
| Type | Description |
|---|---|
TrainingExample |
Normalized training example: list of messages with optional metadata |
TrainingMessage |
Single message with role and content |
TrainingRole |
System, User, Assistant, or Tool |
PreferencePair |
Chosen + rejected responses for preference tuning (DPO/ORPO) |
DataFormat |
Enum of supported formats: OpenAi, Together, Alpaca, ShareGpt, ChatMl |
JSONL I/O
| Type | Description |
|---|---|
JsonlReader |
Streaming line-by-line JSONL reader |
JsonlWriter |
Streaming JSONL writer with flush control |
read_jsonl |
Convenience function to read all examples from a file |
write_jsonl |
Convenience function to write all examples to a file |
Dataset Abstractions
| Type | Description |
|---|---|
InstructDataset |
Collection of instruction-following examples (system/user/assistant turns) |
PreferenceDataset |
Collection of preference pairs for alignment training |
Dataset |
Common trait with len(), get(), iter(), and shuffle() |
Format Converters
All converters implement the FormatConverter trait with to_format() and from_format() methods.
| Converter | Target Format | Notes |
|---|---|---|
OpenAiFormat |
OpenAI fine-tuning JSONL | messages array with role/content |
TogetherFormat |
Together AI format | Similar to OpenAI with provider-specific fields |
AlpacaFormat |
Stanford Alpaca | instruction, input, output fields |
ShareGptFormat |
ShareGPT conversations | conversations array with from/value |
ChatMlFormat |
ChatML template | <|im_start|>role\n...<|im_end|> markup |
Quality Tools
| Type | Description |
|---|---|
DataValidator |
Validates examples against configurable rules (required fields, role order, length limits) |
ValidatorConfig |
Configuration for validation rules |
ValidationReport |
Summary of all issues found |
DatasetStats |
Statistics: example count, token distribution, role balance |
Deduplicator |
SHA-256 based exact deduplication (requires dedup feature) |
Tokenizers
| Type | Feature | Description |
|---|---|---|
Tokenizer |
— | Common trait for all tokenizers |
HfTokenizer |
hf-tokenizer |
HuggingFace Tokenizers — any model from the Hub |
TiktokenTokenizer |
tiktoken |
OpenAI tiktoken — cl100k_base, o200k_base |
Sampling
| Function | Description |
|---|---|
train_eval_split |
Split dataset by ratio with optional shuffle |
sample_n |
Random sample of N examples |
curriculum_order |
Sort examples by complexity (token count) for curriculum learning |
Usage Examples
Format Conversion
use ;
// Read Alpaca-format data
let examples = from_file?;
// Convert to OpenAI format
let openai_lines: = examples
.iter
.map
.?;
write_jsonl?;
Token Counting
use ;
Deduplication
Train/Eval Split
use ;
let config = SplitConfig ;
let split = train_eval_split?;
println!;
Preference Datasets
use ;
let pairs = vec!;
let dataset = from_pairs;
Data Validation
use ;
let validator = new;
let report = validator.validate?;
for issue in &report.issues
Integration with Brainwires
Use via the brainwires facade crate:
[]
= { = "0.3", = ["datasets"] }
Or depend on brainwires-datasets directly for standalone dataset tooling without the rest of the framework.
The brainwires-training crate consumes brainwires-datasets types directly — datasets flow seamlessly into both cloud and local training pipelines.
References
Papers
- FED: GPU-Accelerated Deduplication Framework (Jan 2025) — high-throughput dedup strategies
- LSHBloom: Internet-Scale Deduplication (Nov 2024) — locality-sensitive hashing for massive datasets
- Linguistic Laws & Subword Tokenization (Nov 2024) — analysis of tokenizer behavior
- DPO: Direct Preference Optimization (2023) — the preference pair format consumed by
PreferenceDataset - ORPO: Monolithic Preference Optimization (2024) — single-stage alignment data format
- SLM-Bench: Small Language Model Benchmark (EMNLP 2025) — evaluation datasets for small models
Technical Blogs & Guides
- Modern Tokenization Techniques — CodeSignal — BPE, WordPiece, and SentencePiece
- Tokenization Deep Dive — Let's Data Science — why tokenization matters
- Diffusion Curriculum (DisCL) — ICCV 2025 — curriculum learning strategies (cf.
curriculum_order) - Synthetic Data for ML 2025 — generating training data
Data Tools
- Duplodocus — Allen AI — large-scale deduplication
- fastdedup — fast exact dedup
- DataTrove — HuggingFace — data processing pipelines
License
Licensed under the MIT License. See LICENSE for details.