Skip to main content

Crate ferrotorch_whisper

Crate ferrotorch_whisper 

Source
Expand description

Whisper-family audio encoder model composition for ferrotorch.

Assembles the encoder half of OpenAI’s Whisper model from ferrotorch primitives:

WhisperEncoder
├── WhisperConvStem
│   ├── Conv1d (conv1: num_mel_bins → d_model, k=3, stride=1, pad=1, bias)
│   └── Conv1d (conv2: d_model → d_model,     k=3, stride=2, pad=1, bias)
├── embed_positions (sinusoidal, loaded from state-dict as a parameter)
└── WhisperEncoderLayer × N
    ├── LayerNorm  (self_attn_layer_norm)                  ← PRE-NORM
    ├── WhisperEncoderSelfAttention
    │   ├── Linear q_proj    [d_model, d_model] (bias)
    │   ├── Linear k_proj    [d_model, d_model] (NO bias)
    │   ├── Linear v_proj    [d_model, d_model] (bias)
    │   └── Linear out_proj  [d_model, d_model] (bias)
    ├── LayerNorm  (final_layer_norm)                      ← PRE-NORM
    ├── Linear fc1 [d_model, encoder_ffn_dim] (bias) + GELU
    └── Linear fc2 [encoder_ffn_dim, d_model] (bias)
└── LayerNorm (layer_norm — final encoder LayerNorm)

§Audio preprocessing

audio::log_mel_spectrogram turns 16 kHz mono f32 PCM into the [1, 80, 3000] log-mel tensor the encoder consumes. The 80-bin filter bank is shipped as the embedded binary asset assets/mel_filters_80x201.bin, byte-for-byte equal to WhisperFeatureExtractor.mel_filters.T, so any drift between this module and the reference is in the STFT / log / clip / normalize pipeline — never in the mel scale.

§Loading real weights

WhisperEncoder::load_hf_state_dict accepts a StateDict whose keys use the HuggingFace WhisperModel naming convention. It filters out non-encoder keys (decoder / proj_out / etc.) and returns a encoder::DropReport documenting every drop so the pin script can confirm no encoder key was silently lost. Combined with ferrotorch_serialize::load_safetensors and the load_whisper_encoder helper this gives a direct path from a downloaded openai/whisper-tiny checkpoint to an encoder ready to produce [1, 1500, 384] hidden states.

§Out of scope

The decoder (cross-attention, kv-cache, beam search) is intentionally not implemented in this crate. Phase B.2 of real-artifact-driven development is encoder-only.

§REQ status (per .design/<area>/<file>.md)

REQStatusEvidence
REQ-1SHIPPEDimpl: #![deny(...)] / #![allow(...)] block at lib.rs:5-44; non-test consumer: enforced by every other file in the crate.
REQ-2SHIPPEDimpl: pub mod declarations in lib.rs; non-test consumer: every other .rs file in the crate uses crate::<mod>::... paths.
REQ-3SHIPPEDimpl: pub use block at lib.rs:105-110; non-test consumer: downstream binaries import these names directly.
REQ-4SHIPPEDimpl: //! doc-comment block at lib.rs:46-96; non-test consumer: published via cargo doc -p ferrotorch-whisper.

Re-exports§

pub use attention::WhisperEncoderSelfAttention;
pub use audio::N_FRAMES;
pub use audio::N_MELS;
pub use audio::SAMPLE_RATE;
pub use audio::log_mel_spectrogram;
pub use config::HfWhisperConfig;
pub use config::WhisperConfig;
pub use encoder::DropReport;
pub use encoder::WhisperConvStem;
pub use encoder::WhisperEncoder;
pub use layer::WhisperEncoderLayer;
pub use safetensors_loader::load_whisper_encoder;

Modules§

attention
Whisper encoder self-attention block.
audio
Whisper audio preprocessing — 16 kHz mono f32 PCM → [1, 80, 3000] log-mel spectrogram.
config
Typed Whisper encoder configuration.
encoder
Whisper encoder — conv stem + sinusoidal positional embedding + N × WhisperEncoderLayer + final LayerNorm.
layer
Single Whisper encoder layer (pre-norm).
safetensors_loader
Helpers that turn a path-to-safetensors into a loaded WhisperEncoder.