dset
A Rust library for processing and managing dataset-related files, with a focus on machine learning datasets, captions, and safetensors files. Built on top of xio for efficient file operations.
Features
🔧 SafeTensors Processing
- Extract and decode embedded metadata from SafeTensors files
- Automatic JSON decoding of nested metadata fields
- Support for special metadata fields
- Memory-mapped file handling for efficient processing
- Pretty-printed JSON output
📝 Caption File Handling
- Multi-format support:
- Plain text captions
- JSON captions
- e621 JSON format support
- Automatic format detection
- Caption file validation:
- Check for existence and content
- Handle empty and whitespace-only files
- Tag extraction and probability filtering
- Special character escaping (e.g., parentheses)
- Conversion between formats
- Batch processing capabilities
- e621 tag processing with:
- Artist name formatting with prefix/suffix options
- Tag filtering for years, aspect ratios, etc.
- Optional underscore replacement (spaces vs underscores)
- Customizable rating conversions
- Custom caption format templates
- Text processing utilities:
- String replacement with formatting options
- Special character normalization (smart quotes → standard quotes)
- Whitespace and newline normalization
🗃️ File Operations
- File management:
- Rename files (remove image extensions)
- Check file existence
- Content validation
- Batch processing capabilities
- Efficient async I/O operations
- Format conversions
🔢 JSON Processing
- Format validation and pretty printing
- Deep JSON string decoding
- Nested JSON structure handling
- Automatic type conversion
- Support for
Nonevalues - Probability-based tag filtering
- e621 JSON post data extraction
🎯 Content Processing
- Smart content splitting into tags and sentences
- Tag probability threshold filtering (default: 0.2)
- Special character escaping in tags
- Sorting tags by probability
- Batch file processing
⚡ Performance Features
- Asynchronous operations using Tokio
- Memory-mapped file handling
- Parallel processing capabilities
- Efficient string and JSON parsing
- Optimized file I/O
🛡️ Error Handling
- Comprehensive error context with anyhow
- Detailed error messages
- Safe error recovery
- Proper resource cleanup
E621 Caption Processing
The library provides comprehensive support for processing e621 JSON post data into standardized caption files. This functionality is particularly useful for creating training datasets from e621 posts.
Configuration
The processing can be customized using E621Config:
use ;
use HashMap;
use Result;
async
Available Options
-
Tag Filtering (
filter_tags: bool, default:true)- When enabled, filters out noise tags
- Can be disabled to include all tags
-
Rating Conversions (
rating_conversions: Option<HashMap<String, String>>)- Default conversions:
- "s" → "safe"
- "q" → "questionable"
- "e" → "explicit"
- Can be customized or disabled (set to
Noneto use raw ratings)
- Default conversions:
-
Artist Formatting (new in 0.1.8)
artist_prefix: Option<String>(default:Some("by "))artist_suffix: Option<String>(default:None)- Customize how artist names are formatted
- Set both to
Nonefor raw artist names - Examples:
- Default: "by artist_name" → "by artist name"
- Custom prefix: "drawn by artist_name" → "drawn by artist name"
- Custom suffix: "artist_name (Artist)" → "artist name (Artist)"
- Both: "art by artist_name (verified)" → "art by artist name (verified)"
- None: "artist_name" → "artist name"
-
Format String (
format: Option<String>)- Default:
"{rating}, {artists}, {characters}, {species}, {copyright}, {general}, {meta}" - Available placeholders:
{rating}- The rating (after conversion){artists}- Artist tags (with configured formatting){characters}- Character tags{species}- Species tags{copyright}- Copyright tags{general}- General tags{meta}- Meta tags
- Each tag group is internally joined with ", "
- Default:
Tag Processing
-
Artist Tags
- Configurable prefix (default: "by ")
- Optional suffix
- Underscores replaced with spaces
- "(artist)" suffix removed from source
- Examples:
- Default: "artist_name (artist)" → "by artist name"
- Custom: "artist_name" → "drawn by artist name (verified)"
- Raw: "artist_name" → "artist name"
-
Character Tags
- Underscores replaced with spaces
- Original character names preserved
- Example: "character_name" → "character name"
-
Species Tags
- Included as-is with spaces
- Useful for dataset filtering
-
Copyright Tags
- Source material references preserved
- Underscores replaced with spaces
-
General Tags
- Common descriptive tags
- Underscores replaced with spaces
- Filtered to remove noise
-
Meta Tags
- Selected important meta information
- Art medium and style information preserved
Tag Filtering
Tag filtering is enabled by default but can be disabled. When enabled, it automatically filters out:
- Year tags (e.g., "2023")
- Aspect ratio tags (e.g., "16:9")
- Conditional DNP tags
- Empty or whitespace-only tags
To disable filtering, pass Some(false) as the filter_tags parameter.
Caption File Generation
- Creates
.txtfiles from e621 JSON posts - Filename derived from post's image MD5
- Format:
[rating], [artist tags], [character tags], [other tags] - Skips generation if no valid tags remain after filtering (when filtering is enabled)
Example Usage
use ;
use Result;
async
Example Outputs
With default settings:
safe, by artist name, character name, species, tag1, tag2
With custom format:
Rating: safe
Artists: by artist name
Tags: tag1, tag2
With raw ratings:
s, by artist name, character name, species, tag1, tag2
Batch Processing Example
use ;
use Result;
use fs;
async
Installation
Logging Configuration
The library uses the log crate for logging. To enable logging in your application:
-
Add a logging implementation like
env_loggerto your project: -
Initialize the logger in your application:
use env_logger; -
Set the log level using the
RUST_LOGenvironment variable:# Show info and error messages # Show debug, info, and error messages # Show all log messages
The library uses different log levels:
error: For unrecoverable errorswarn: For recoverable errors or unexpected conditionsinfo: For important operations and successful processingdebug: For detailed processing informationtrace: For very detailed debugging information
Usage Examples
SafeTensors Metadata Extraction
use ;
use Result;
async
Caption File Processing
use ;
use Result;
async
File Operations
use ;
use io;
async
JSON Processing and Formatting
use ;
use Value;
use Result;
async
Content Splitting
use split_content;
use info;
Text Processing
use ;
use ;
use Result;
use info;
async
Error Handling
The library uses anyhow for comprehensive error handling:
use Path;
use ;
use info;
async
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. When contributing:
- Ensure all tests pass
- Add tests for new features
- Update documentation
- Follow the existing code style
- Add error handling where appropriate
License
This project is licensed under the MIT License.