dset
A Rust library for processing and managing dataset-related files, with a focus on machine learning datasets, captions, and safetensors files. Built on top of xio for efficient file operations.
Features
🔧 SafeTensors Processing
- Extract and decode embedded metadata from SafeTensors files
- Automatic JSON decoding of nested metadata fields
- Support for special metadata fields
- Memory-mapped file handling for efficient processing
- Pretty-printed JSON output
📝 Caption File Handling
- Multi-format support:
- Plain text captions
- JSON captions
- Automatic format detection
- Caption file validation:
- Check for existence and content
- Handle empty and whitespace-only files
- Tag extraction and probability filtering
- Special character escaping (e.g., parentheses)
- Conversion between formats
- Batch processing capabilities
🗃️ File Operations
- File management:
- Rename files (remove image extensions)
- Check file existence
- Content validation
- Batch processing capabilities
- Efficient async I/O operations
- Format conversions
🔢 JSON Processing
- Format validation and pretty printing
- Deep JSON string decoding
- Nested JSON structure handling
- Automatic type conversion
- Support for
Nonevalues - Probability-based tag filtering
🎯 Content Processing
- Smart content splitting into tags and sentences
- Tag probability threshold filtering (default: 0.2)
- Special character escaping in tags
- Sorting tags by probability
- Batch file processing
⚡ Performance Features
- Asynchronous operations using Tokio
- Memory-mapped file handling
- Parallel processing capabilities
- Efficient string and JSON parsing
- Optimized file I/O
🛡️ Error Handling
- Comprehensive error context with anyhow
- Detailed error messages
- Safe error recovery
- Proper resource cleanup
Installation
Add this to your Cargo.toml:
[]
= "0.1.6"
Usage Examples
SafeTensors Metadata Extraction
use ;
use Result;
async
Caption File Processing
use ;
use Result;
async
File Operations
use ;
use io;
async
JSON Processing and Formatting
use ;
use Value;
use Result;
async
Content Splitting
use split_content;
Error Handling
The library uses anyhow for comprehensive error handling:
use Path;
use ;
async
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. When contributing:
- Ensure all tests pass
- Add tests for new features
- Update documentation
- Follow the existing code style
- Add error handling where appropriate
License
This project is licensed under the MIT License.