Crate ferrotorch_whisper

Expand description

Whisper-family audio encoder model composition for ferrotorch.

Assembles the encoder half of OpenAI’s Whisper model from ferrotorch primitives:

WhisperEncoder
├── WhisperConvStem
│   ├── Conv1d (conv1: num_mel_bins → d_model, k=3, stride=1, pad=1, bias)
│   └── Conv1d (conv2: d_model → d_model,     k=3, stride=2, pad=1, bias)
├── embed_positions (sinusoidal, loaded from state-dict as a parameter)
└── WhisperEncoderLayer × N
    ├── LayerNorm  (self_attn_layer_norm)                  ← PRE-NORM
    ├── WhisperEncoderSelfAttention
    │   ├── Linear q_proj    [d_model, d_model] (bias)
    │   ├── Linear k_proj    [d_model, d_model] (NO bias)
    │   ├── Linear v_proj    [d_model, d_model] (bias)
    │   └── Linear out_proj  [d_model, d_model] (bias)
    ├── LayerNorm  (final_layer_norm)                      ← PRE-NORM
    ├── Linear fc1 [d_model, encoder_ffn_dim] (bias) + GELU
    └── Linear fc2 [encoder_ffn_dim, d_model] (bias)
└── LayerNorm (layer_norm — final encoder LayerNorm)

§Audio preprocessing

audio::log_mel_spectrogram turns 16 kHz mono f32 PCM into the [1, 80, 3000] log-mel tensor the encoder consumes. The 80-bin filter bank is shipped as the embedded binary asset assets/mel_filters_80x201.bin, byte-for-byte equal to WhisperFeatureExtractor.mel_filters.T, so any drift between this module and the reference is in the STFT / log / clip / normalize pipeline — never in the mel scale.

§Loading real weights

WhisperEncoder::load_hf_state_dict accepts a StateDict whose keys use the HuggingFace WhisperModel naming convention. It filters out non-encoder keys (decoder / proj_out / etc.) and returns a encoder::DropReport documenting every drop so the pin script can confirm no encoder key was silently lost. Combined with ferrotorch_serialize::load_safetensors and the load_whisper_encoder helper this gives a direct path from a downloaded openai/whisper-tiny checkpoint to an encoder ready to produce [1, 1500, 384] hidden states.

§Out of scope

The decoder (cross-attention, kv-cache, beam search) is intentionally not implemented in this crate. Phase B.2 of real-artifact-driven development is encoder-only.

§REQ status (per `.design/<area>/<file>.md`)

REQ	Status	Evidence
REQ-1	SHIPPED	impl: `#![deny(...)]` / `#![allow(...)]` block at `lib.rs:5-44`; non-test consumer: enforced by every other file in the crate.
REQ-2	SHIPPED	impl: `pub mod` declarations in `lib.rs`; non-test consumer: every other `.rs` file in the crate uses `crate::<mod>::...` paths.
REQ-3	SHIPPED	impl: `pub use` block at `lib.rs:105-110`; non-test consumer: downstream binaries import these names directly.
REQ-4	SHIPPED	impl: `//!` doc-comment block at `lib.rs:46-96`; non-test consumer: published via `cargo doc -p ferrotorch-whisper`.

Re-exports§

pub use attention::WhisperEncoderSelfAttention;
pub use audio::N_FRAMES;
pub use audio::N_MELS;
pub use audio::SAMPLE_RATE;
pub use audio::log_mel_spectrogram;
pub use config::HfWhisperConfig;
pub use config::WhisperConfig;
pub use encoder::DropReport;
pub use encoder::WhisperConvStem;
pub use encoder::WhisperEncoder;
pub use layer::WhisperEncoderLayer;
pub use safetensors_loader::load_whisper_encoder;

Modules§

attention: Whisper encoder self-attention block.
audio: Whisper audio preprocessing — 16 kHz mono f32 PCM → [1, 80, 3000] log-mel spectrogram.
config: Typed Whisper encoder configuration.
encoder: Whisper encoder — conv stem + sinusoidal positional embedding + N × WhisperEncoderLayer + final LayerNorm.
layer: Single Whisper encoder layer (pre-norm).
safetensors_loader: Helpers that turn a path-to-safetensors into a loaded WhisperEncoder.