Struct TransformerConfig

Source

pub struct TransformerConfig {Show 25 fields
    pub hidden_size: usize,
    pub num_layers: usize,
    pub num_attention_heads: usize,
    pub num_kv_heads: usize,
    pub head_dim: usize,
    pub intermediate_size: usize,
    pub vocab_size: usize,
    pub norm_type: NormType,
    pub norm_eps: f64,
    pub activation: Activation,
    pub qkv_layout: QkvLayout,
    pub mlp_layout: MlpLayout,
    pub qkv_bias: bool,
    pub o_proj_bias: bool,
    pub mlp_bias: bool,
    pub embedding_scale: Option<f64>,
    pub tie_word_embeddings: bool,
    pub rope_theta: f64,
    pub max_position_embeddings: usize,
    pub attn_logit_softcapping: Option<f64>,
    pub final_logit_softcapping: Option<f64>,
    pub query_pre_attn_scalar: Option<f64>,
    pub use_post_norms: bool,
    pub sliding_window: Option<usize>,
    pub alternating_sliding_window: bool,
}

Expand description

Configuration for a generic decoder-only transformer.

Captures ~12 configuration axes that distinguish modern transformer architectures. Parsed from HuggingFace config.json via from_hf_config.

§Supported model families

Family	Key config traits
`LLaMA` 1/2/3	Baseline: GQA, `SiLU`, `RmsNorm`
`Qwen` 2/2.5	+ QKV bias, conditional tied embeddings
Gemma / Gemma 2	+ `GemmaRmsNorm`, embedding scale, soft-capping, 4-norm
`Phi-3` / `Phi-4`	+ Fused QKV, fused MLP
`StarCoder2`	+ Plain MLP, GELU, bias everywhere
Mistral	+ Sliding window attention

§`config.json` field reference

§Required fields (all families)

Field	`config.json` key
—	`model_type`
`hidden_size`	`hidden_size`
`num_layers`	`num_hidden_layers`
`num_attention_heads`	`num_attention_heads`
`intermediate_size`	`intermediate_size`
`vocab_size`	`vocab_size`

§Optional fields (all families)

Field	`config.json` key	Default
`num_kv_heads`	`num_key_value_heads`	`num_attention_heads`
`head_dim`	`head_dim`	`hidden_size / num_attention_heads`
`norm_eps`	`rms_norm_eps` ¹	1e-5 ²
`rope_theta`	`rope_theta`	10 000 ³
`max_position_embeddings`	`max_position_embeddings`	4 096 ⁴
`tie_word_embeddings`	`tie_word_embeddings`	`false` ⁵

¹ StarCoder2 reads norm_epsilon instead.
² 1e-6 for Qwen2, Gemma, Gemma 2.
³ 1 000 000 for Qwen2.
⁴ 32 768 for Qwen2/Mistral; 16 384 for StarCoder2; 8 192 for Gemma/Gemma 2; 4 096 for LLaMA/Phi-3.
⁵ true for Gemma, Gemma 2, StarCoder2.

§Hardcoded architecture axes

The following fields are set by the family-specific parser, not read from config.json (except where noted):

Field	Description
`norm_type`	`RmsNorm` for most; `GemmaRmsNorm` for Gemma/Gemma 2; read from `norm_type` key for `StarCoder2` (default `RmsNorm`, `"layer_norm"` → `LayerNorm`)
`activation`	`Silu` for `LLaMA`/Qwen2/`Phi-3`/Mistral; `GeluApprox` for Gemma/Gemma 2/`StarCoder2`
`qkv_layout`	`Fused` for `Phi-3`; `Separate` for all others
`mlp_layout`	`GatedFused` for `Phi-3`; `Plain` for `StarCoder2`; `GatedSeparate` for all others
`embedding_scale`	`Some(sqrt(hidden_size))` for Gemma/Gemma 2; `None` for all others
`use_post_norms`	`true` for Gemma 2 (4 norms per layer); `false` for all others
`alternating_sliding_window`	`true` for Gemma 2; `false` for all others

§Per-family `config.json` extensions

Qwen2 — reads attention_bias (default true) → qkv_bias.

Gemma / Gemma 2 — hardcodes embedding_scale to sqrt(hidden_size), tie_word_embeddings defaults to true, and norm_eps defaults to 1e-6. Gemma 2 additionally reads:

`config.json` key	Field	Default
`attn_logit_softcapping`	`attn_logit_softcapping`	`None`
`final_logit_softcapping`	`final_logit_softcapping`	`None`
`query_pre_attn_scalar`	`query_pre_attn_scalar`	`Some(256.0)`
`sliding_window`	`sliding_window`	`None`

Phi-3 — no extra config.json keys; fused QKV and fused gated MLP are hardcoded.

StarCoder2 — reads use_bias (default true) → qkv_bias, o_proj_bias, and mlp_bias. Reads norm_type (default RmsNorm, "layer_norm" → LayerNorm). Uses norm_epsilon key (not rms_norm_eps). Hardcodes Plain MLP and GeluApprox activation.

Mistral — reads sliding_window (default None). Otherwise identical to LLaMA; max_position_embeddings defaults to 32 768.

Fields§

§hidden_size: usize

Hidden dimension (d_model).

§num_layers: usize

Number of transformer layers (decoder blocks).

§num_attention_heads: usize

Number of query attention heads.

§num_kv_heads: usize

Number of key/value heads (GQA when < num_attention_heads).

§head_dim: usize

Dimension per head (usually hidden_size / num_attention_heads).

§intermediate_size: usize

MLP intermediate dimension.

§vocab_size: usize

Vocabulary size.

§norm_type: NormType

Normalization variant.

§norm_eps: f64

Epsilon for normalization layers.

§activation: Activation

MLP activation function.

§qkv_layout: QkvLayout

QKV projection layout (separate or fused).

§mlp_layout: MlpLayout

MLP layout (gated separate, gated fused, or plain).

§qkv_bias: bool

Whether Q, K, V projections have bias terms.

§o_proj_bias: bool

Whether the output projection (o_proj) has a bias term.

§mlp_bias: bool

Whether MLP projections have bias terms.

§embedding_scale: Option<f64>

Embedding scale factor (Some(sqrt(hidden_size)) for Gemma models).

§tie_word_embeddings: bool

Whether the LM head shares weights with the token embedding.

§rope_theta: f64

Base frequency for rotary position embeddings.

§max_position_embeddings: usize

Maximum sequence length for position embeddings.

§attn_logit_softcapping: Option<f64>

Attention logit soft-capping: tanh(scores / cap) * cap before softmax. Some(50.0) for Gemma 2; None for most models.

§final_logit_softcapping: Option<f64>

Final logit soft-capping: tanh(logits / cap) * cap after LM head. Some(30.0) for Gemma 2; None for most models.

§query_pre_attn_scalar: Option<f64>

Custom attention scaling factor. When set, scale = 1/sqrt(scalar) instead of the default 1/sqrt(head_dim). Some(256.0) for Gemma 2; None for most models.

§use_post_norms: bool

Whether each layer has post-attention and post-feedforward norms (4 norms per layer instead of 2). true for Gemma 2.

§sliding_window: Option<usize>

Sliding window size. None for global attention.

§alternating_sliding_window: bool

Whether sliding window alternates with global attention per layer. When true, even layers (0, 2, 4, …) use sliding window and odd layers use global causal. true for Gemma 2.

Struct TransformerConfig Copy item path

§Supported model families

§config.json field reference

§Required fields (all families)

§Optional fields (all families)

§Hardcoded architecture axes

§Per-family config.json extensions

Fields§

Implementations§

impl TransformerConfig

pub fn from_hf_config(config: &Value) -> Result<Self>

§Errors

impl TransformerConfig

pub fn from_hf_config_auto( config: &Value, tensor_names: &[String], ) -> Result<Self>

§Errors

impl TransformerConfig

pub fn check_config_fields(config: &Value) -> CompatibilityReport

pub fn check_auto_compatibility( config: &Value, tensor_names: &[String], ) -> CompatibilityReport

Trait Implementations§

impl Clone for TransformerConfig

fn clone(&self) -> TransformerConfig

fn clone_from(&mut self, source: &Self)

impl Debug for TransformerConfig

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl PartialEq for TransformerConfig

fn eq(&self, other: &TransformerConfig) -> bool

fn ne(&self, other: &Rhs) -> bool

impl StructuralPartialEq for TransformerConfig

Auto Trait Implementations§

impl Freeze for TransformerConfig

impl RefUnwindSafe for TransformerConfig

impl Send for TransformerConfig

impl Sync for TransformerConfig

impl Unpin for TransformerConfig

impl UnsafeUnpin for TransformerConfig

impl UnwindSafe for TransformerConfig

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> PolicyExt for Twhere T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>where T: Policy<B, E>, P: Policy<B, E>,

impl<T> Same for T

type Output = T

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

impl<T> WithSubscriber for T

Struct TransformerConfig

§`config.json` field reference

§Per-family `config.json` extensions

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

impl<T> ErasedDestructor for T
where T: 'static,