Expand description
Whisper-family audio encoder model composition for ferrotorch.
Assembles the encoder half of OpenAI’s Whisper model from ferrotorch primitives:
WhisperEncoder
├── WhisperConvStem
│ ├── Conv1d (conv1: num_mel_bins → d_model, k=3, stride=1, pad=1, bias)
│ └── Conv1d (conv2: d_model → d_model, k=3, stride=2, pad=1, bias)
├── embed_positions (sinusoidal, loaded from state-dict as a parameter)
└── WhisperEncoderLayer × N
├── LayerNorm (self_attn_layer_norm) ← PRE-NORM
├── WhisperEncoderSelfAttention
│ ├── Linear q_proj [d_model, d_model] (bias)
│ ├── Linear k_proj [d_model, d_model] (NO bias)
│ ├── Linear v_proj [d_model, d_model] (bias)
│ └── Linear out_proj [d_model, d_model] (bias)
├── LayerNorm (final_layer_norm) ← PRE-NORM
├── Linear fc1 [d_model, encoder_ffn_dim] (bias) + GELU
└── Linear fc2 [encoder_ffn_dim, d_model] (bias)
└── LayerNorm (layer_norm — final encoder LayerNorm)§Audio preprocessing
audio::log_mel_spectrogram turns 16 kHz mono f32 PCM into the
[1, 80, 3000] log-mel tensor the encoder consumes. The 80-bin
filter bank is shipped as the embedded binary asset
assets/mel_filters_80x201.bin, byte-for-byte equal to
WhisperFeatureExtractor.mel_filters.T, so any drift between this
module and the reference is in the STFT / log / clip / normalize
pipeline — never in the mel scale.
§Loading real weights
WhisperEncoder::load_hf_state_dict accepts a StateDict whose
keys use the HuggingFace WhisperModel naming convention. It
filters out non-encoder keys (decoder / proj_out / etc.) and
returns a encoder::DropReport documenting every drop so the pin
script can confirm no encoder key was silently lost. Combined with
ferrotorch_serialize::load_safetensors and the
load_whisper_encoder helper this gives a direct path from a
downloaded openai/whisper-tiny checkpoint to an encoder ready to
produce [1, 1500, 384] hidden states.
§Out of scope
The decoder (cross-attention, kv-cache, beam search) is intentionally not implemented in this crate. Phase B.2 of real-artifact-driven development is encoder-only.
§REQ status (per .design/<area>/<file>.md)
| REQ | Status | Evidence |
|---|---|---|
| REQ-1 | SHIPPED | impl: #![deny(...)] / #![allow(...)] block at lib.rs:5-44; non-test consumer: enforced by every other file in the crate. |
| REQ-2 | SHIPPED | impl: pub mod declarations in lib.rs; non-test consumer: every other .rs file in the crate uses crate::<mod>::... paths. |
| REQ-3 | SHIPPED | impl: pub use block at lib.rs:105-110; non-test consumer: downstream binaries import these names directly. |
| REQ-4 | SHIPPED | impl: //! doc-comment block at lib.rs:46-96; non-test consumer: published via cargo doc -p ferrotorch-whisper. |
Re-exports§
pub use attention::WhisperEncoderSelfAttention;pub use audio::N_FRAMES;pub use audio::N_MELS;pub use audio::SAMPLE_RATE;pub use audio::log_mel_spectrogram;pub use config::HfWhisperConfig;pub use config::WhisperConfig;pub use encoder::DropReport;pub use encoder::WhisperConvStem;pub use encoder::WhisperEncoder;pub use layer::WhisperEncoderLayer;pub use safetensors_loader::load_whisper_encoder;
Modules§
- attention
- Whisper encoder self-attention block.
- audio
- Whisper audio preprocessing — 16 kHz mono f32 PCM →
[1, 80, 3000]log-mel spectrogram. - config
- Typed Whisper encoder configuration.
- encoder
- Whisper encoder — conv stem + sinusoidal positional embedding +
N × WhisperEncoderLayer+ final LayerNorm. - layer
- Single Whisper encoder layer (pre-norm).
- safetensors_
loader - Helpers that turn a path-to-safetensors into a loaded
WhisperEncoder.