voirs-dataset
Dataset utilities and processing for VoiRS speech synthesis training and evaluation.
This crate provides comprehensive tools for loading, processing, and managing speech datasets used in VoiRS model training and evaluation. It supports popular datasets like LJSpeech and JVS, as well as custom dataset creation and preprocessing pipelines.
Features
- Multi-dataset Support: LJSpeech, JVS, VCTK, LibriTTS, and custom datasets
- Audio Processing: Normalization, resampling, trimming, and quality validation
- Data Augmentation: Speed perturbation, pitch shifting, noise injection
- Parallel Processing: Multi-threaded audio processing with Rayon
- Manifest Generation: JSON, CSV, and Parquet metadata formats
- Quality Control: Automatic filtering of low-quality samples
- Streaming: Memory-efficient processing of large datasets
Quick Start
use ;
async
Supported Datasets
| Dataset | Language | Speakers | Hours | Domain | Status |
|---|---|---|---|---|---|
| LJSpeech | English | 1 | 24h | Audiobooks | ✅ Stable |
| JVS | Japanese | 100 | 30h | Various | ✅ Stable |
| VCTK | English | 110 | 44h | News | 🚧 Beta |
| LibriTTS | English | 2,456 | 585h | Audiobooks | 🚧 Beta |
| Common Voice | Multi | 50,000+ | 20,000h+ | Read speech | 📋 Planned |
| JSUT | Japanese | 1 | 10h | News | 📋 Planned |
Architecture
Raw Dataset → Download → Extract → Validate → Process → Augment → Export
↓ ↓ ↓ ↓ ↓ ↓ ↓
Archive Metadata Audio Quality Enhance Variants Manifest
Core Components
-
Dataset Loaders
- Automatic download and extraction
- Metadata parsing and validation
- Audio file discovery and loading
-
Audio Processing
- Format conversion and normalization
- Resampling and channel management
- Silence detection and trimming
-
Data Augmentation
- Speed and pitch perturbation
- Noise injection and mixing
- Room impulse response simulation
-
Quality Control
- SNR and clipping detection
- Duration and frequency validation
- Manual review and annotation tools
API Reference
Core Trait
Dataset Sample
Audio Data
Usage Examples
Loading Standard Datasets
use ;
// Load LJSpeech dataset
let ljspeech = from_path?;
// Load JVS dataset
let jvs = from_path?;
// Load VCTK dataset
let vctk = from_path?;
Custom Dataset Creation
use ;
let dataset = new
.name
.audio_dir
.transcript_file
.audio_format
.sample_rate
.build?;
// Save dataset manifest
dataset.save_manifest?;
Data Processing Pipeline
use ;
let processor = builder
.add_step
.add_step
.add_step
.add_step
.build;
let processed_dataset = processor.process.await?;
Data Augmentation
use ;
let augmenter = builder
.add_augmentation
.add_augmentation
.add_augmentation
.build;
let augmented_dataset = augmenter.augment.await?;
Dataset Splitting
use ;
let split_config = SplitConfig ;
let DatasetSplit = dataset.split?;
println!;
println!;
println!;
Parallel Processing
use ;
let config = ProcessingConfig ;
let processor = new;
let results = processor.process_dataset.await?;
Quality Analysis
use ;
let analyzer = new;
let report = analyzer.analyze_dataset.await?;
println!;
println!;
println!;
println!;
println!;
Streaming Large Datasets
use ;
use StreamExt;
let config = StreamingConfig ;
let streaming_dataset = new;
let mut stream = streaming_dataset.stream;
while let Some = stream.next.await
Export to Different Formats
use ;
let exporter = new;
// Export to HuggingFace Datasets format
exporter.export.await?;
// Export to PyTorch format
exporter.export.await?;
Performance
Processing Speed (Intel i7-12700K, 8 workers)
| Operation | Speed | Memory Usage | Notes |
|---|---|---|---|
| Audio loading | 450 files/s | 2GB | WAV files, 22kHz |
| Resampling | 280 files/s | 1.5GB | 48kHz → 22kHz |
| Mel extraction | 220 files/s | 2.5GB | 80 mel bins |
| Augmentation | 180 files/s | 3GB | 3x augmentation |
| Quality analysis | 320 files/s | 1GB | SNR, clipping, etc. |
Memory Efficiency
- Streaming: Constant memory usage regardless of dataset size
- Chunked processing: Configurable memory limits
- Lazy loading: Audio files loaded on demand
- Memory pooling: Reuse of audio buffers
Installation
Add to your Cargo.toml:
[]
= "0.1"
# Enable specific features
[]
= "0.1"
= ["augmentation", "streaming", "export"]
Feature Flags
augmentation: Enable data augmentation pipelinestreaming: Enable streaming dataset supportexport: Enable dataset export to various formatsanalysis: Enable quality analysis toolsvisualization: Enable dataset visualizationpandrs: Integration with PandRS for ETL operations
System Dependencies
Audio processing:
# Ubuntu/Debian
# macOS
Optional dependencies:
# For advanced audio processing
# For dataset downloading
Configuration
Create ~/.voirs/dataset.toml:
[]
= "~/.voirs/cache/datasets"
= 8
= "4GB"
[]
= true
= true
= [
"https://github.com/keithito/tacotron/releases/download/v0.3/",
"https://mirror.voirs.org/datasets/"
]
[]
= 22050
= 1
= "rms"
= true
= -40.0
[]
= [0.9, 1.0, 1.1]
= [-2.0, 2.0]
= [20.0, 40.0]
= "~/.voirs/rir/"
[]
= 15.0
= 0.01
= 0.5
= 20.0
[]
= "json"
= true
= "gzip"
Dataset Creation Guide
Preparing Custom Dataset
use ;
let config = CreationConfig ;
let creator = new;
// Validate input data
let validation = creator.validate.await?;
if !validation.is_valid
// Create dataset
let dataset = creator.create.await?;
// Generate quality report
let report = dataset.quality_report.await?;
println!;
println!;
Audio File Organization
dataset/
├── audio/
│ ├── speaker1/
│ │ ├── 001.wav
│ │ ├── 002.wav
│ │ └── ...
│ └── speaker2/
│ ├── 001.wav
│ └── ...
├── transcripts.txt
├── speakers.json
└── metadata.json
Transcript Format
# transcripts.txt
speaker1/001.wav|Hello, this is the first sentence.
speaker1/002.wav|This is the second sentence with punctuation!
speaker2/001.wav|Different speaker saying something else.
Speaker Information
Error Handling
use ;
match dataset.get
Advanced Features
Custom Data Loaders
use ;
Integration with ML Frameworks
use ;
let torch_dataset = from_dataset?;
let dataloader = new
.batch_size
.shuffle
.num_workers
.build?;
for batch in dataloader
Contributing
We welcome contributions! Please see the main repository for contribution guidelines.
Development Setup
# Install development dependencies
# Download test datasets
# Run tests
# Run benchmarks
# Check code quality
Adding New Datasets
- Implement the
Datasettrait for your dataset - Add dataset-specific loading and processing logic
- Create comprehensive tests with sample data
- Add documentation and usage examples
- Update the supported datasets table
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.