dset

A Rust library for processing and managing dataset-related files, with a focus on machine learning datasets, captions, and safetensors files. Built on top of xio for efficient file operations.

Features

🔧 SafeTensors Processing

Extract and decode embedded metadata from SafeTensors files
Automatic JSON decoding of nested metadata fields
Support for special metadata fields
Memory-mapped file handling for efficient processing
Pretty-printed JSON output

📝 Caption File Handling

Multi-format support:
- Plain text captions
- JSON captions
- e621 JSON format support
- Automatic format detection
Caption file validation:
- Check for existence and content
- Handle empty and whitespace-only files
Tag extraction and probability filtering
Special character escaping (e.g., parentheses)
Conversion between formats
Batch processing capabilities
e621 tag processing with:
- Artist name formatting with prefix/suffix options
- Tag filtering for years, aspect ratios, etc.
- Optional underscore replacement (spaces vs underscores)
- Customizable rating conversions
- Custom caption format templates
Text processing utilities:
- String replacement with formatting options
- Special character normalization (smart quotes → standard quotes)
- Whitespace and newline normalization

🗃️ File Operations

File management:
- Rename files (remove image extensions)
- Check file existence
- Content validation
Batch processing capabilities
Efficient async I/O operations
Format conversions

🔢 JSON Processing

Format validation and pretty printing
Deep JSON string decoding
Nested JSON structure handling
Automatic type conversion
Support for None values
Probability-based tag filtering
e621 JSON post data extraction

🎯 Content Processing

Smart content splitting into tags and sentences
Tag probability threshold filtering (default: 0.2)
Special character escaping in tags
Sorting tags by probability
Batch file processing

⚡ Performance Features

Asynchronous operations using Tokio
Memory-mapped file handling
Parallel processing capabilities
Efficient string and JSON parsing
Optimized file I/O

🛡️ Error Handling

Comprehensive error context with anyhow
Detailed error messages
Safe error recovery
Proper resource cleanup

E621 Caption Processing

The library provides comprehensive support for processing e621 JSON post data into standardized caption files. This functionality is particularly useful for creating training datasets from e621 posts.

Configuration

The processing can be customized using E621Config:

use dset::{E621Config, Path, process_e621_json_file};
use std::collections::HashMap;
use anyhow::Result;

async fn process_with_config() -> Result<()> {
    // Create a custom configuration
    let mut custom_ratings = HashMap::new();
    custom_ratings.insert("s".to_string(), "sfw".to_string());
    custom_ratings.insert("q".to_string(), "maybe".to_string());
    custom_ratings.insert("e".to_string(), "nsfw".to_string());

    let config = E621Config::new()
        .with_filter_tags(false)  // Disable tag filtering
        .with_rating_conversions(Some(custom_ratings))  // Custom rating names
        .with_format(Some("Rating: {rating}\nArtists: {artists}\nTags: {general}".to_string()));  // Custom format

    process_e621_json_file(Path::new("e621_post.json"), Some(config)).await
}

Available Options

Tag Filtering (filter_tags: bool, default: true)
- When enabled, filters out noise tags
- Can be disabled to include all tags
Rating Conversions (rating_conversions: Option<HashMap<String, String>>)
- Default conversions:
  - "s" → "safe"
  - "q" → "questionable"
  - "e" → "explicit"
- Can be customized or disabled (set to None to use raw ratings)
Artist Formatting (new in 0.1.8)
- artist_prefix: Option<String> (default: Some("by "))
- artist_suffix: Option<String> (default: None)
- Customize how artist names are formatted
- Set both to None for raw artist names
- Examples:
  - Default: "by artist_name" → "by artist name"
  - Custom prefix: "drawn by artist_name" → "drawn by artist name"
  - Custom suffix: "artist_name (Artist)" → "artist name (Artist)"
  - Both: "art by artist_name (verified)" → "art by artist name (verified)"
  - None: "artist_name" → "artist name"
Format String (format: Option<String>)
- Default: "{rating}, {artists}, {characters}, {species}, {copyright}, {general}, {meta}"
- Available placeholders:
  - {rating} - The rating (after conversion)
  - {artists} - Artist tags (with configured formatting)
  - {characters} - Character tags
  - {species} - Species tags
  - {copyright} - Copyright tags
  - {general} - General tags
  - {meta} - Meta tags
- Each tag group is internally joined with ", "

Tag Processing

Artist Tags
- Configurable prefix (default: "by ")
- Optional suffix
- Underscores replaced with spaces
- "(artist)" suffix removed from source
- Examples:
  - Default: "artist_name (artist)" → "by artist name"
  - Custom: "artist_name" → "drawn by artist name (verified)"
  - Raw: "artist_name" → "artist name"
Character Tags
- Underscores replaced with spaces
- Original character names preserved
- Example: "character_name" → "character name"
Species Tags
- Included as-is with spaces
- Useful for dataset filtering
Copyright Tags
- Source material references preserved
- Underscores replaced with spaces
General Tags
- Common descriptive tags
- Underscores replaced with spaces
- Filtered to remove noise
Meta Tags
- Selected important meta information
- Art medium and style information preserved

Tag Filtering

Tag filtering is enabled by default but can be disabled. When enabled, it automatically filters out:

Year tags (e.g., "2023")
Aspect ratio tags (e.g., "16:9")
Conditional DNP tags
Empty or whitespace-only tags

To disable filtering, pass Some(false) as the filter_tags parameter.

Caption File Generation

Creates .txt files from e621 JSON posts
Filename derived from post's image MD5
Format: [rating], [artist tags], [character tags], [other tags]
Skips generation if no valid tags remain after filtering (when filtering is enabled)

Example Usage

use dset::{E621Config, Path, process_e621_json_file};
use anyhow::Result;

async fn process_e621() -> Result<()> {
    // Process with default settings
    process_e621_json_file(Path::new("e621_post.json"), None).await?;
    
    // Process with custom format
    let config = E621Config::new()
        .with_format(Some("{rating}\nBy: {artists}\nTags: {general}".to_string()));
    process_e621_json_file(Path::new("e621_post.json"), Some(config)).await?;
    
    // Process with raw ratings (no conversion)
    let config = E621Config::new()
        .with_rating_conversions(None);
    process_e621_json_file(Path::new("e621_post.json"), Some(config)).await?;
    
    Ok(())
}

Example Outputs

With default settings:

safe, by artist name, character name, species, tag1, tag2

With custom format:

Rating: safe
Artists: by artist name
Tags: tag1, tag2

With raw ratings:

s, by artist name, character name, species, tag1, tag2

Batch Processing Example

use dset::{E621Config, Path, process_e621_json_file};
use anyhow::Result;
use tokio::fs;

async fn batch_process_e621() -> Result<()> {
    // Optional: customize processing for all files
    let config = E621Config::new()
        .with_filter_tags(false)
        .with_format(Some("{rating}\n{artists}\n{general}".to_string()));
    
    let entries = fs::read_dir("e621_posts").await?;
    
    for entry in entries {
        if let Ok(entry) = entry {
            let path = entry.path();
            if path.extension().map_or(false, |ext| ext == "json") {
                process_e621_json_file(&path, Some(config.clone())).await?;
            }
        }
    }
    Ok(())
}

Installation

cargo add dset

Logging Configuration

The library uses the log crate for logging. To enable logging in your application:

Add a logging implementation like env_logger to your project:
```
cargo add env_logger
```

Initialize the logger in your application:

use env_logger;

fn main() {
    env_logger::init();
    // Your code here...
}

Set the log level using the RUST_LOG environment variable:

export RUST_LOG=info    # Show info and error messages
export RUST_LOG=debug   # Show debug, info, and error messages
export RUST_LOG=trace   # Show all log messages

The library uses different log levels:

error: For unrecoverable errors
warn: For recoverable errors or unexpected conditions
info: For important operations and successful processing
debug: For detailed processing information
trace: For very detailed debugging information

Usage Examples

SafeTensors Metadata Extraction

use dset::{Path, process_safetensors_file};
use anyhow::Result;

async fn extract_metadata(path: &str) -> Result<()> {
    // Extracts metadata and saves it as a JSON file
    process_safetensors_file(Path::new(path)).await?;
    
    // The output will be saved as "{path}.json"
    Ok(())
}

Caption File Processing

use dset::{
    Path,
    process_caption_file,
    process_json_to_caption,
    caption::caption_file_exists_and_not_empty
};
use anyhow::Result;

async fn handle_captions() -> Result<()> {
    let path = Path::new("image1.txt");
    
    // Check if caption file exists and has content
    if caption_file_exists_and_not_empty(&path).await {
        // Process the caption file (auto-detects format)
        process_caption_file(&path).await?;
    }
    
    // Convert JSON caption to text format
    process_json_to_caption(Path::new("image2.json")).await?;
    
    Ok(())
}

File Operations

use dset::{Path, rename_file_without_image_extension};
use std::io;

async fn handle_files() -> io::Result<()> {
    // Remove intermediate image extensions from files
    let path = Path::new("image.jpg.toml");
    rename_file_without_image_extension(&path).await?;  // Will rename to "image.toml"
    
    // Won't modify files that are actually images
    let img = Path::new("photo.jpg");
    rename_file_without_image_extension(&img).await?;  // Will remain "photo.jpg"
    
    Ok(())
}

JSON Processing and Formatting

use dset::{Path, format_json_file, process_json_file};
use serde_json::Value;
use anyhow::Result;

async fn handle_json() -> Result<()> {
    // Format a JSON file
    format_json_file(Path::new("data.json").to_path_buf()).await?;
    
    // Process JSON with custom handler
    process_json_file(Path::new("data.json"), |json: &Value| async {
        println!("Processing: {}", json);
        Ok(())
    }).await?;
    
    Ok(())
}

Content Splitting

use dset::split_content;
use log::info;

fn process_tags_and_text() {
    let content = "tag1, tag2, tag3., This is the main text.";
    let (tags, sentences) = split_content(content);
    
    info!("Tags: {:?}", tags);  // ["tag1", "tag2", "tag3"]
    info!("Text: {}", sentences);  // "This is the main text."
}

Text Processing

use dset::caption::{format_text_content, replace_string, replace_special_chars};
use std::path::{Path, PathBuf};
use anyhow::Result;
use log::info;

async fn example() -> Result<()> {
    // Format text by normalizing whitespace
    let formatted = format_text_content("  Multiple    spaces   \n\n  and newlines  ")?;
    assert_eq!(formatted, "Multiple spaces and newlines");
    
    // Replace text in a file
    replace_string(Path::new("caption.txt"), "old text", "new text").await?;
    
    // Replace special characters in a file (smart quotes, etc.)
    replace_special_chars(PathBuf::from("document.txt")).await?;
    
    Ok(())
}

Error Handling

The library uses anyhow for comprehensive error handling:

use dset::Path;
use anyhow::{Context, Result};
use log::info;

async fn example() -> Result<()> {
    process_safetensors_file(Path::new("model.safetensors"))
        .await
        .context("Failed to process safetensors file")?;
    info!("Successfully processed safetensors file");
    Ok(())
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. When contributing:

Ensure all tests pass
Add tests for new features
Update documentation
Follow the existing code style
Add error handling where appropriate

License

This project is licensed under the MIT License.

dset 0.1.10