docs.rs failed to build pmetal-data-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
pmetal-data
Dataset loading and preprocessing for LLM training.
Overview
This crate provides data loading, preprocessing, and batching utilities optimized for LLM fine-tuning. It supports multiple dataset formats and includes advanced features like sequence packing and chat template application.
Supported Formats
| Format | Description | Example |
|---|---|---|
| ShareGPT | Conversation format | {"conversations": [...]} |
| Alpaca | Instruction format | {"instruction": ..., "output": ...} |
| Messages | Chat format | {"messages": [...]} |
| Text | Raw text | {"text": "..."} |
Features
- Sequence Packing: Pack multiple sequences for efficient training
- Chat Templates: Apply model-specific conversation formatting
- Response Masking: Mask prompt tokens in loss computation
- Streaming Loading: Memory-efficient loading of large datasets
- Tokenizer Integration: HuggingFace tokenizers support
Usage
Basic Dataset Loading
use ;
// Load dataset
let dataset = from_jsonl?;
// Create dataloader
let loader = new;
for batch in loader
With Sequence Packing
use ;
let dataset = from_jsonl?;
// Pack sequences for efficient training
let packed = pack?;
// Reports: "Packing: 1000 sequences → 850 batches, 99.5% efficiency"
Chat Template Application
use ChatTemplate;
let template = from_tokenizer?;
let formatted = template.apply?;
Dataset Format Examples
ShareGPT
Alpaca
Messages
Modules
| Module | Description |
|---|---|
dataset |
Dataset abstractions and loading |
dataloader |
Batching and iteration |
packing |
Sequence packing utilities |
chat_templates |
Conversation formatting |
tokenizer |
Tokenizer integration |
collator |
Batch collation |
License
MIT OR Apache-2.0