dset 0.1.2

A Rust library for processing and managing dataset-related files, with a focus on machine learning datasets, captions, and safetensors files
Documentation

dset

A Rust library for processing and managing dataset-related files, with a focus on machine learning datasets, captions, and safetensors files. Built on top of xio for efficient file operations.

Features

  • 🔧 SafeTensors file processing and metadata extraction
  • 📝 Caption file handling and conversion
  • 🔄 JSON processing utilities
  • 🎯 Smart content splitting and tag processing
  • ⚡ Asynchronous operations using Tokio
  • 🛡️ Robust error handling with anyhow
  • 🖼️ Image-related utilities (via imx integration)

Installation

Add this to your Cargo.toml:

[dependencies]
dset = "0.1.1"

Usage Examples

Processing SafeTensors Files

Extract metadata from SafeTensors files:

use dset::{Path, process_safetensors_file};
use anyhow::Result;

async fn extract_model_metadata(path: &str) -> Result<()> {
    process_safetensors_file(Path::new(path)).await
}

Working with Caption Files

Process and convert caption files:

use dset::{Path, process_caption_file};
use anyhow::Result;

async fn handle_caption(path: &str) -> Result<()> {
    process_caption_file(Path::new(path)).await
}

JSON Processing

Convert JSON files to caption format:

use dset::{Path, process_json_to_caption};
use std::io;

async fn convert_json_caption(path: &str) -> io::Result<()> {
    process_json_to_caption(Path::new(path)).await
}

Content Splitting

Split content into tags and sentences:

use dset::split_content;

fn process_content(content: &str) {
    let (tags, sentences) = split_content(content);
    println!("Tags: {:?}", tags);
    println!("Sentences: {}", sentences);
}

Advanced Features

SafeTensors Metadata Processing

  • Extracts embedded metadata from SafeTensors files
  • Automatically decodes JSON-encoded strings in metadata
  • Handles special fields like:
    • ss_bucket_info
    • ss_tag_frequency
    • ss_dataset_dirs
    • ss_network_args
    • resize_params

Caption Processing

  • Supports both JSON and plain text caption formats
  • Automatic format detection and parsing
  • Tag probability filtering
  • Special character escaping

Image Utilities (via imx)

  • Image file detection
  • Caption file validation
  • Letterbox removal
  • File extension handling

Error Handling

All operations return Result types with detailed error information:

  • anyhow::Result for rich error context
  • io::Result for basic file operations
  • Comprehensive error messages with context

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.