Skip to main content

TransformerConfig

Struct TransformerConfig 

Source
pub struct TransformerConfig {
Show 25 fields pub hidden_size: usize, pub num_layers: usize, pub num_attention_heads: usize, pub num_kv_heads: usize, pub head_dim: usize, pub intermediate_size: usize, pub vocab_size: usize, pub norm_type: NormType, pub norm_eps: f64, pub activation: Activation, pub qkv_layout: QkvLayout, pub mlp_layout: MlpLayout, pub qkv_bias: bool, pub o_proj_bias: bool, pub mlp_bias: bool, pub embedding_scale: Option<f64>, pub tie_word_embeddings: bool, pub rope_theta: f64, pub max_position_embeddings: usize, pub attn_logit_softcapping: Option<f64>, pub final_logit_softcapping: Option<f64>, pub query_pre_attn_scalar: Option<f64>, pub use_post_norms: bool, pub sliding_window: Option<usize>, pub alternating_sliding_window: bool,
}
Expand description

Configuration for a generic decoder-only transformer.

Captures ~12 configuration axes that distinguish modern transformer architectures. Parsed from HuggingFace config.json via from_hf_config.

§Supported model families

FamilyKey config traits
LLaMA 1/2/3Baseline: GQA, SiLU, RmsNorm
Qwen 2/2.5+ QKV bias, conditional tied embeddings
Gemma / Gemma 2+ GemmaRmsNorm, embedding scale, soft-capping, 4-norm
Phi-3 / Phi-4+ Fused QKV, fused MLP
StarCoder2+ Plain MLP, GELU, bias everywhere
Mistral+ Sliding window attention

§config.json field reference

§Required fields (all families)

Fieldconfig.json key
model_type
hidden_sizehidden_size
num_layersnum_hidden_layers
num_attention_headsnum_attention_heads
intermediate_sizeintermediate_size
vocab_sizevocab_size

§Optional fields (all families)

Fieldconfig.json keyDefault
num_kv_headsnum_key_value_headsnum_attention_heads
head_dimhead_dimhidden_size / num_attention_heads
norm_epsrms_norm_eps ¹1e-5 ²
rope_thetarope_theta10 000 ³
max_position_embeddingsmax_position_embeddings4 096 ⁴
tie_word_embeddingstie_word_embeddingsfalse

¹ StarCoder2 reads norm_epsilon instead.
² 1e-6 for Qwen2, Gemma, Gemma 2.
³ 1 000 000 for Qwen2.
⁴ 32 768 for Qwen2/Mistral; 16 384 for StarCoder2; 8 192 for Gemma/Gemma 2; 4 096 for LLaMA/Phi-3.
true for Gemma, Gemma 2, StarCoder2.

§Hardcoded architecture axes

The following fields are set by the family-specific parser, not read from config.json (except where noted):

FieldDescription
norm_typeRmsNorm for most; GemmaRmsNorm for Gemma/Gemma 2; read from norm_type key for StarCoder2 (default RmsNorm, "layer_norm"LayerNorm)
activationSilu for LLaMA/Qwen2/Phi-3/Mistral; GeluApprox for Gemma/Gemma 2/StarCoder2
qkv_layoutFused for Phi-3; Separate for all others
mlp_layoutGatedFused for Phi-3; Plain for StarCoder2; GatedSeparate for all others
embedding_scaleSome(sqrt(hidden_size)) for Gemma/Gemma 2; None for all others
use_post_normstrue for Gemma 2 (4 norms per layer); false for all others
alternating_sliding_windowtrue for Gemma 2; false for all others

§Per-family config.json extensions

Qwen2 — reads attention_bias (default true) → qkv_bias.

Gemma / Gemma 2 — hardcodes embedding_scale to sqrt(hidden_size), tie_word_embeddings defaults to true, and norm_eps defaults to 1e-6. Gemma 2 additionally reads:

config.json keyFieldDefault
attn_logit_softcappingattn_logit_softcappingNone
final_logit_softcappingfinal_logit_softcappingNone
query_pre_attn_scalarquery_pre_attn_scalarSome(256.0)
sliding_windowsliding_windowNone

Phi-3 — no extra config.json keys; fused QKV and fused gated MLP are hardcoded.

StarCoder2 — reads use_bias (default true) → qkv_bias, o_proj_bias, and mlp_bias. Reads norm_type (default RmsNorm, "layer_norm"LayerNorm). Uses norm_epsilon key (not rms_norm_eps). Hardcodes Plain MLP and GeluApprox activation.

Mistral — reads sliding_window (default None). Otherwise identical to LLaMA; max_position_embeddings defaults to 32 768.

Fields§

§hidden_size: usize

Hidden dimension (d_model).

§num_layers: usize

Number of transformer layers (decoder blocks).

§num_attention_heads: usize

Number of query attention heads.

§num_kv_heads: usize

Number of key/value heads (GQA when < num_attention_heads).

§head_dim: usize

Dimension per head (usually hidden_size / num_attention_heads).

§intermediate_size: usize

MLP intermediate dimension.

§vocab_size: usize

Vocabulary size.

§norm_type: NormType

Normalization variant.

§norm_eps: f64

Epsilon for normalization layers.

§activation: Activation

MLP activation function.

§qkv_layout: QkvLayout

QKV projection layout (separate or fused).

§mlp_layout: MlpLayout

MLP layout (gated separate, gated fused, or plain).

§qkv_bias: bool

Whether Q, K, V projections have bias terms.

§o_proj_bias: bool

Whether the output projection (o_proj) has a bias term.

§mlp_bias: bool

Whether MLP projections have bias terms.

§embedding_scale: Option<f64>

Embedding scale factor (Some(sqrt(hidden_size)) for Gemma models).

§tie_word_embeddings: bool

Whether the LM head shares weights with the token embedding.

§rope_theta: f64

Base frequency for rotary position embeddings.

§max_position_embeddings: usize

Maximum sequence length for position embeddings.

§attn_logit_softcapping: Option<f64>

Attention logit soft-capping: tanh(scores / cap) * cap before softmax. Some(50.0) for Gemma 2; None for most models.

§final_logit_softcapping: Option<f64>

Final logit soft-capping: tanh(logits / cap) * cap after LM head. Some(30.0) for Gemma 2; None for most models.

§query_pre_attn_scalar: Option<f64>

Custom attention scaling factor. When set, scale = 1/sqrt(scalar) instead of the default 1/sqrt(head_dim). Some(256.0) for Gemma 2; None for most models.

§use_post_norms: bool

Whether each layer has post-attention and post-feedforward norms (4 norms per layer instead of 2). true for Gemma 2.

§sliding_window: Option<usize>

Sliding window size. None for global attention.

§alternating_sliding_window: bool

Whether sliding window alternates with global attention per layer. When true, even layers (0, 2, 4, …) use sliding window and odd layers use global causal. true for Gemma 2.

Implementations§

Source§

impl TransformerConfig

Source

pub fn from_hf_config(config: &Value) -> Result<Self>

Parse a TransformerConfig from a HuggingFace config.json value.

Dispatches on the model_type field to a family-specific parser. See the TransformerConfig struct-level documentation for the full field reference (required/optional keys, defaults, and per-family extensions).

§Errors

Returns MIError::Config if model_type is missing, unsupported, or if required fields are absent.

Source§

impl TransformerConfig

Source

pub fn from_hf_config_auto( config: &Value, tensor_names: &[String], ) -> Result<Self>

Parse a TransformerConfig from a HuggingFace config.json value and safetensors tensor names.

Two-tier dispatch:

  • Known families (listed in SUPPORTED_MODEL_TYPES): delegates to the existing manually-validated parser via from_hf_config.
  • Unknown families: auto-detects architecture axes from config.json scalars and safetensors tensor names (QKV/MLP layout, bias flags, norm type, post-norms), with model_type-based fixups for Gemma-family traits.

tensor_names should contain all tensor names from the model’s safetensors file(s). Use tensor_names_from_safetensors or tensor_names_from_index to obtain them without loading weights.

§Errors

Returns MIError::Config if model_type is missing or if required dimension fields are absent.

Source§

impl TransformerConfig

Source

pub fn check_config_fields(config: &Value) -> CompatibilityReport

Check whether config.json contains the required fields for auto-config.

This is a lightweight check that does not require tensor names or downloading weights. It validates that the five required scalar fields (hidden_size, num_hidden_layers, num_attention_heads, intermediate_size, vocab_size) are present.

A passing check does not guarantee full compatibility — use check_auto_compatibility with tensor names for a definitive answer.

Source

pub fn check_auto_compatibility( config: &Value, tensor_names: &[String], ) -> CompatibilityReport

Check whether a model is fully compatible with GenericTransformer auto-config loading.

Validates both config.json fields and safetensors tensor names against the patterns GenericTransformer::load() expects. Call this after downloading but before loading to get a clear diagnostic instead of a cryptic “tensor not found” error.

Checks performed:

  • Required config.json scalars are present
  • Embedding tensor (model.embed_tokens.weight) exists
  • Layer-0 normalization tensors exist (input_layernorm.weight, post_attention_layernorm.weight)
  • Final norm tensor (model.norm.weight) exists
  • At least one recognized attention projection pattern
  • At least one recognized MLP projection pattern
  • lm_head.weight exists when tie_word_embeddings is false

Trait Implementations§

Source§

impl Clone for TransformerConfig

Source§

fn clone(&self) -> TransformerConfig

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for TransformerConfig

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl PartialEq for TransformerConfig

Source§

fn eq(&self, other: &TransformerConfig) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · Source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
Source§

impl StructuralPartialEq for TransformerConfig

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> ErasedDestructor for T
where T: 'static,