pub struct TransformerConfig {Show 15 fields
pub hidden_size: usize,
pub num_attention_heads: usize,
pub num_kv_heads: usize,
pub intermediate_size: usize,
pub num_hidden_layers: usize,
pub vocab_size: usize,
pub max_position_embeddings: usize,
pub rms_norm_eps: f32,
pub rope_theta: f32,
pub use_bias: bool,
pub head_dim_override: Option<usize>,
pub architecture: ModelArchitecture,
pub hf_architecture: Option<String>,
pub hf_model_type: Option<String>,
pub tie_word_embeddings: bool,
}Expand description
Configuration for transformer models
Fields§
Hidden dimension (embedding size)
num_attention_heads: usizeNumber of attention heads
num_kv_heads: usizeNumber of key-value heads (for grouped-query attention)
intermediate_size: usizeFeed-forward network intermediate dimension
Number of transformer layers
vocab_size: usizeVocabulary size
max_position_embeddings: usizeMaximum sequence length
rms_norm_eps: f32RMS normalization epsilon
rope_theta: f32RoPE theta base
use_bias: boolWhether to use bias in linear layers
head_dim_override: Option<usize>Explicit per-head dimension (overrides hidden_size / num_heads). Required for Qwen3 where head_dim=128 but hidden_size/num_heads=80.
architecture: ModelArchitectureArchitecture family: encoder (BERT/RoBERTa) or decoder (LLaMA/Qwen). Determines position encoding, normalization, activation, and pooling strategy.
hf_architecture: Option<String>HuggingFace architecture class name (e.g., “Qwen2ForCausalLM”, “LlamaForCausalLM”). Used for checkpoint config.json compatibility.
hf_model_type: Option<String>HuggingFace model type (e.g., “qwen2”, “llama”). Used for checkpoint config.json compatibility.
tie_word_embeddings: boolWhether to tie input/output embeddings (embed_tokens and lm_head). Qwen2: true, LLaMA: false.
Implementations§
Source§impl TransformerConfig
impl TransformerConfig
Sourcepub fn llama2_13b() -> Self
pub fn llama2_13b() -> Self
LLaMA 2 13B configuration
Sourcepub fn mistral_7b() -> Self
pub fn mistral_7b() -> Self
Mistral 7B configuration
Sourcepub fn qwen2_0_5b() -> Self
pub fn qwen2_0_5b() -> Self
Qwen2 0.5B configuration (good for testing).
Empirically verified against
~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-0.5B-Instruct/.../config.json
2026-05-04. Pinned by
contracts/apr-pretrain-arch-polymorphic-v1.yaml FALSIFY-001.
Note: tie_word_embeddings: true is the Qwen2.5 0.5B/1.5B convention
(the 7B variant turns this OFF; see qwen2_7b()). This is a Qwen
scaling-law quirk — small Qwen models reuse embedding+lm_head weights
to save params, but the larger variants pay the param cost for
untied weights. Drift-prevention: keeping this true is required
for SHIP-TWO-001 §49 MODEL-2 fine-tune from a Qwen2.5-Coder-0.5B
checkpoint.
Sourcepub fn qwen2_1_5b() -> Self
pub fn qwen2_1_5b() -> Self
Qwen2.5-Coder-1.5B-Instruct: 28 layers, 12 heads, 2 KV heads, hidden=1536
Sourcepub fn qwen2_7b() -> Self
pub fn qwen2_7b() -> Self
Qwen2.5-Coder 7B configuration (GH-371)
Qwen2.5-Coder-7B-Instruct: 28 layers, 28 heads, 4 KV heads, hidden=3584 Contract: contracts/model-families/qwen2.yaml
Sourcepub fn qwen3_4b() -> Self
pub fn qwen3_4b() -> Self
Qwen3 4B configuration
Qwen3-4B: 36 layers, 32 heads, 8 KV heads, hidden=2560, head_dim=128. Same vocab_size as Qwen2 (151936). No attention bias (Qwen3 family).
Sourcepub fn qwen3_5_9b() -> Self
pub fn qwen3_5_9b() -> Self
Qwen3.5 9B configuration
Key differences from Qwen2: no attention bias, head_dim=256 (explicit), vocab_size=248320, hybrid attention (standard + linear layers). Contract: contracts/model-families/qwen3_5.yaml
Sourcepub fn from_apr_metadata(
hidden_size: Option<usize>,
num_heads: Option<usize>,
num_kv_heads: Option<usize>,
intermediate_size: Option<usize>,
num_layers: Option<usize>,
vocab_size: Option<usize>,
max_position_embeddings: Option<usize>,
rms_norm_eps: Option<f32>,
rope_theta: Option<f32>,
architecture: Option<&str>,
) -> Option<Self>
pub fn from_apr_metadata( hidden_size: Option<usize>, num_heads: Option<usize>, num_kv_heads: Option<usize>, intermediate_size: Option<usize>, num_layers: Option<usize>, vocab_size: Option<usize>, max_position_embeddings: Option<usize>, rms_norm_eps: Option<f32>, rope_theta: Option<f32>, architecture: Option<&str>, ) -> Option<Self>
Construct from APR v2 metadata fields.
CONTRACT: The .apr file is the single source of truth for model
architecture. These fields were validated at import time by the
tensor-layout-v1 contract. This function propagates that contract
to the training pipeline — no hardcoded lookups, no silent fallbacks.
Returns None if any required field is missing, forcing the caller to handle the error explicitly rather than silently degrading to tiny().
GH-376: Fixes instruct pipeline ignoring .apr architecture metadata.
Sourcepub fn from_size_str(size: &str) -> Result<Self, String>
pub fn from_size_str(size: &str) -> Result<Self, String>
Resolve config from a model size string. Errors on unknown sizes.
GH-377: Replaces _ => TransformerConfig::tiny() catch-all pattern.
This is the single canonical mapping from size strings to configs.
Every callsite that previously had its own match table should use this.
Sourcepub fn codebert() -> Self
pub fn codebert() -> Self
CodeBERT (microsoft/codebert-base) encoder configuration.
RoBERTa architecture: 12 layers, 768 hidden, 12 heads, GELU, LayerNorm, learned positions. SSC v11 Section 4: 125M params, ~20ms CPU inference, WASM-deployable.
Sourcepub fn is_encoder(&self) -> bool
pub fn is_encoder(&self) -> bool
Whether this config describes an encoder (BERT/RoBERTa) architecture.
Sourcepub fn hf_architecture_name(&self) -> &str
pub fn hf_architecture_name(&self) -> &str
HuggingFace architecture class name for checkpoint config.json. Uses explicit override if set, otherwise infers from config.
Sourcepub fn hf_model_type_str(&self) -> &str
pub fn hf_model_type_str(&self) -> &str
HuggingFace model_type string for checkpoint config.json.
Sourcepub fn ties_embeddings(&self) -> bool
pub fn ties_embeddings(&self) -> bool
Whether embeddings are tied (embed_tokens == lm_head). Uses explicit flag if set, otherwise infers from architecture.
Sourcepub fn head_dim(&self) -> usize
pub fn head_dim(&self) -> usize
Per-head dimension.
Uses explicit override when set (Qwen3: head_dim=128 with hidden=2560, 32 heads). Falls back to hidden_size / num_heads for standard architectures.
Sourcepub fn q_dim(&self) -> usize
pub fn q_dim(&self) -> usize
Total Q/O projection dimension = num_heads * head_dim.
Equals hidden_size for standard architectures but differs when head_dim is explicitly overridden (e.g. Qwen3-4B: 32 * 128 = 4096 != 2560).
Sourcepub fn per_layer_weight_elements(&self) -> usize
pub fn per_layer_weight_elements(&self) -> usize
Per-layer weight VRAM in f32 elements (constant, independent of seq_len).
Maps to cuda_block.rs lines 212-220: GpuBuffer::from_host() uploads.
Sourcepub fn total_training_vram_bytes(&self, max_seq_len: usize) -> usize
pub fn total_training_vram_bytes(&self, max_seq_len: usize) -> usize
Total VRAM in bytes for all layers at a given max_seq_len.
Postcondition: result is exact for the current cuda_block.rs buffer layout.
Total VRAM in bytes with SHARED scratch workspace (1 per model, not per layer).
This is the correct budget formula when gradient buffers are shared across layers (canonical in PyTorch/JAX). Only weights are truly per-layer.
Postcondition: result < total_training_vram_bytes(s) for L > 1
Solve for the maximum seq_len that fits in the given VRAM budget (bytes), using shared scratch workspace.
This is the solver to use with the shared-scratch architecture. Returns None if even seq_len=1 exceeds the budget.
Sourcepub fn max_seq_len_for_vram(&self, vram_bytes: usize) -> Option<usize>
pub fn max_seq_len_for_vram(&self, vram_bytes: usize) -> Option<usize>
Solve for the maximum seq_len that fits in the given VRAM budget (bytes).
Binary search over [1, max_position_embeddings]. Returns None if even seq_len=1 exceeds the budget.
Precondition: vram_bytes > 0 Postcondition: total_training_vram_bytes(result) <= vram_bytes
Trait Implementations§
Source§impl Clone for TransformerConfig
impl Clone for TransformerConfig
Source§fn clone(&self) -> TransformerConfig
fn clone(&self) -> TransformerConfig
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for TransformerConfig
impl Debug for TransformerConfig
Source§impl<'de> Deserialize<'de> for TransformerConfig
impl<'de> Deserialize<'de> for TransformerConfig
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Auto Trait Implementations§
impl Freeze for TransformerConfig
impl RefUnwindSafe for TransformerConfig
impl Send for TransformerConfig
impl Sync for TransformerConfig
impl Unpin for TransformerConfig
impl UnsafeUnpin for TransformerConfig
impl UnwindSafe for TransformerConfig
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> FmtForward for T
impl<T> FmtForward for T
Source§fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
self to use its Binary implementation when Debug-formatted.Source§fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
self to use its Display implementation when
Debug-formatted.Source§fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
self to use its LowerExp implementation when
Debug-formatted.Source§fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
self to use its LowerHex implementation when
Debug-formatted.Source§fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
self to use its Octal implementation when Debug-formatted.Source§fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
self to use its Pointer implementation when
Debug-formatted.Source§fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
self to use its UpperExp implementation when
Debug-formatted.Source§fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
self to use its UpperHex implementation when
Debug-formatted.Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pipe for Twhere
T: ?Sized,
impl<T> Pipe for Twhere
T: ?Sized,
Source§fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
Source§fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
self and passes that borrow into the pipe function. Read moreSource§fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
self and passes that borrow into the pipe function. Read moreSource§fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
Source§fn pipe_borrow_mut<'a, B, R>(
&'a mut self,
func: impl FnOnce(&'a mut B) -> R,
) -> R
fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
Source§fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
self, then passes self.as_ref() into the pipe function.Source§fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
self, then passes self.as_mut() into the pipe
function.Source§fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
self, then passes self.deref() into the pipe function.Source§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<T> PolicyExt for Twhere
T: ?Sized,
impl<T> PolicyExt for Twhere
T: ?Sized,
Source§impl<T> Tap for T
impl<T> Tap for T
Source§fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
Borrow<B> of a value. Read moreSource§fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
BorrowMut<B> of a value. Read moreSource§fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
AsRef<R> view of a value. Read moreSource§fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
AsMut<R> view of a value. Read moreSource§fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
Deref::Target of a value. Read moreSource§fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
Deref::Target of a value. Read moreSource§fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
.tap() only in debug builds, and is erased in release builds.Source§fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
.tap_mut() only in debug builds, and is erased in release
builds.Source§fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
.tap_borrow() only in debug builds, and is erased in release
builds.Source§fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
.tap_borrow_mut() only in debug builds, and is erased in release
builds.Source§fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
.tap_ref() only in debug builds, and is erased in release
builds.Source§fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
.tap_ref_mut() only in debug builds, and is erased in release
builds.Source§fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
.tap_deref() only in debug builds, and is erased in release
builds.