dset
A Rust library for processing and managing dataset-related files, with a focus on machine learning datasets, captions, and safetensors files. Built on top of xio for efficient file operations.
Features
- 🔧 SafeTensors file processing and metadata extraction
- 📝 Caption file handling and conversion
- 🔄 JSON processing utilities
- 🎯 Smart content splitting and tag processing
- ⚡ Asynchronous operations using Tokio
- 🛡️ Robust error handling with anyhow
- 🖼️ Image-related utilities (via imx integration)
Installation
Add this to your Cargo.toml:
[]
= "0.1.1"
Usage Examples
Processing SafeTensors Files
Extract metadata from SafeTensors files:
use ;
use Result;
async
Working with Caption Files
Process and convert caption files:
use ;
use Result;
async
JSON Processing
Convert JSON files to caption format:
use ;
use io;
async
Content Splitting
Split content into tags and sentences:
use split_content;
Advanced Features
SafeTensors Metadata Processing
- Extracts embedded metadata from SafeTensors files
- Automatically decodes JSON-encoded strings in metadata
- Handles special fields like:
ss_bucket_infoss_tag_frequencyss_dataset_dirsss_network_argsresize_params
Caption Processing
- Supports both JSON and plain text caption formats
- Automatic format detection and parsing
- Tag probability filtering
- Special character escaping
Image Utilities (via imx)
- Image file detection
- Caption file validation
- Letterbox removal
- File extension handling
Error Handling
All operations return Result types with detailed error information:
anyhow::Resultfor rich error contextio::Resultfor basic file operations- Comprehensive error messages with context
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License.