pmetal-vocoder 0.1.0

Neural vocoder (BigVGAN) for text-to-speech synthesis
docs.rs failed to build pmetal-vocoder-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

pmetal-vocoder

BigVGAN neural vocoder for text-to-speech synthesis.

Overview

This crate provides a Rust implementation of BigVGAN, a high-fidelity neural vocoder that converts mel spectrograms to audio waveforms. It's designed for integration with speech synthesis systems.

Architecture

BigVGAN uses an anti-aliased multi-periodicity architecture:

Mel Spectrogram → Conv1d → [AMP Blocks × N] → Conv1d → Audio Waveform
                              ↓
                     Snake Activations
                     Anti-aliased Upsampling
                     Multi-period Processing

Features

  • High Fidelity: 24kHz+ audio generation
  • 256× Upsampling: Mel frames to audio samples
  • Snake Activations: Learnable periodic activations
  • Anti-Aliased Upsampling: Reduced aliasing artifacts
  • Multi-Resolution Discriminators: MPD, CQT-D for training

Components

Generator

  • Anti-Aliased Multi-Periodicity (AMP) blocks
  • Snake/SnakeBeta activations for periodic signals
  • Configurable upsampling ratios

Discriminators (for training)

  • MPD: Multi-Period Discriminator
  • MSD: Multi-Scale Discriminator
  • CQT-D: Constant-Q Transform Discriminator

Audio Processing

  • STFT/iSTFT for spectral analysis
  • Mel filterbank computation
  • Resampling utilities

Usage

use pmetal_vocoder::{BigVGANGenerator, VocoderConfig};

// Load vocoder
let config = VocoderConfig::default();
let vocoder = BigVGANGenerator::new(config)?;

// Convert mel spectrogram to audio
let mel = /* mel spectrogram [batch, mel_bins, frames] */;
let audio = vocoder.forward(&mel)?;
// audio: [batch, 1, samples]

Configuration

Parameter Description Default
upsample_rates Upsampling factors [8, 8, 2, 2]
upsample_kernel_sizes Kernel sizes [16, 16, 4, 4]
resblock_kernel_sizes ResBlock kernels [3, 7, 11]
resblock_dilation_sizes Dilation rates [[1,3,5], ...]

Modules

Module Description
generator BigVGAN generator architecture
discriminator Training discriminators
nn Custom neural network layers
audio Audio processing utilities
loss Training loss functions

License

MIT OR Apache-2.0