pub struct TransformerConfig {Show 25 fields
pub hidden_size: usize,
pub num_layers: usize,
pub num_attention_heads: usize,
pub num_kv_heads: usize,
pub head_dim: usize,
pub intermediate_size: usize,
pub vocab_size: usize,
pub norm_type: NormType,
pub norm_eps: f64,
pub activation: Activation,
pub qkv_layout: QkvLayout,
pub mlp_layout: MlpLayout,
pub qkv_bias: bool,
pub o_proj_bias: bool,
pub mlp_bias: bool,
pub embedding_scale: Option<f64>,
pub tie_word_embeddings: bool,
pub rope_theta: f64,
pub max_position_embeddings: usize,
pub attn_logit_softcapping: Option<f64>,
pub final_logit_softcapping: Option<f64>,
pub query_pre_attn_scalar: Option<f64>,
pub use_post_norms: bool,
pub sliding_window: Option<usize>,
pub alternating_sliding_window: bool,
}Expand description
Configuration for a generic decoder-only transformer.
Captures ~12 configuration axes that distinguish modern transformer
architectures. Parsed from HuggingFace config.json via
from_hf_config.
§Supported model families
| Family | Key config traits |
|---|---|
LLaMA 1/2/3 | Baseline: GQA, SiLU, RmsNorm |
Qwen 2/2.5 | + QKV bias, conditional tied embeddings |
| Gemma / Gemma 2 | + GemmaRmsNorm, embedding scale, soft-capping, 4-norm |
Phi-3 / Phi-4 | + Fused QKV, fused MLP |
StarCoder2 | + Plain MLP, GELU, bias everywhere |
| Mistral | + Sliding window attention |
§config.json field reference
§Required fields (all families)
| Field | config.json key |
|---|---|
| — | model_type |
hidden_size | hidden_size |
num_layers | num_hidden_layers |
num_attention_heads | num_attention_heads |
intermediate_size | intermediate_size |
vocab_size | vocab_size |
§Optional fields (all families)
| Field | config.json key | Default |
|---|---|---|
num_kv_heads | num_key_value_heads | num_attention_heads |
head_dim | head_dim | hidden_size / num_attention_heads |
norm_eps | rms_norm_eps ¹ | 1e-5 ² |
rope_theta | rope_theta | 10 000 ³ |
max_position_embeddings | max_position_embeddings | 4 096 ⁴ |
tie_word_embeddings | tie_word_embeddings | false ⁵ |
¹ StarCoder2 reads norm_epsilon instead.
² 1e-6 for Qwen2, Gemma, Gemma 2.
³ 1 000 000 for Qwen2.
⁴ 32 768 for Qwen2/Mistral; 16 384 for StarCoder2; 8 192 for
Gemma/Gemma 2; 4 096 for LLaMA/Phi-3.
⁵ true for Gemma, Gemma 2, StarCoder2.
§Hardcoded architecture axes
The following fields are set by the family-specific parser, not
read from config.json (except where noted):
| Field | Description |
|---|---|
norm_type | RmsNorm for most; GemmaRmsNorm for Gemma/Gemma 2; read from norm_type key for StarCoder2 (default RmsNorm, "layer_norm" → LayerNorm) |
activation | Silu for LLaMA/Qwen2/Phi-3/Mistral; GeluApprox for Gemma/Gemma 2/StarCoder2 |
qkv_layout | Fused for Phi-3; Separate for all others |
mlp_layout | GatedFused for Phi-3; Plain for StarCoder2; GatedSeparate for all others |
embedding_scale | Some(sqrt(hidden_size)) for Gemma/Gemma 2; None for all others |
use_post_norms | true for Gemma 2 (4 norms per layer); false for all others |
alternating_sliding_window | true for Gemma 2; false for all others |
§Per-family config.json extensions
Qwen2 — reads attention_bias (default true) → qkv_bias.
Gemma / Gemma 2 — hardcodes embedding_scale to sqrt(hidden_size),
tie_word_embeddings defaults to true, and norm_eps defaults to 1e-6.
Gemma 2 additionally reads:
config.json key | Field | Default |
|---|---|---|
attn_logit_softcapping | attn_logit_softcapping | None |
final_logit_softcapping | final_logit_softcapping | None |
query_pre_attn_scalar | query_pre_attn_scalar | Some(256.0) |
sliding_window | sliding_window | None |
Phi-3 — no extra config.json keys; fused QKV and fused gated MLP
are hardcoded.
StarCoder2 — reads use_bias (default true) → qkv_bias,
o_proj_bias, and mlp_bias. Reads norm_type (default RmsNorm,
"layer_norm" → LayerNorm). Uses norm_epsilon key (not
rms_norm_eps). Hardcodes Plain MLP and
GeluApprox activation.
Mistral — reads sliding_window (default None). Otherwise
identical to LLaMA; max_position_embeddings defaults to 32 768.
Fields§
Hidden dimension (d_model).
num_layers: usizeNumber of transformer layers (decoder blocks).
num_attention_heads: usizeNumber of query attention heads.
num_kv_heads: usizeNumber of key/value heads (GQA when < num_attention_heads).
head_dim: usizeDimension per head (usually hidden_size / num_attention_heads).
intermediate_size: usizeMLP intermediate dimension.
vocab_size: usizeVocabulary size.
norm_type: NormTypeNormalization variant.
norm_eps: f64Epsilon for normalization layers.
activation: ActivationMLP activation function.
qkv_layout: QkvLayoutQKV projection layout (separate or fused).
mlp_layout: MlpLayoutMLP layout (gated separate, gated fused, or plain).
qkv_bias: boolWhether Q, K, V projections have bias terms.
o_proj_bias: boolWhether the output projection (o_proj) has a bias term.
mlp_bias: boolWhether MLP projections have bias terms.
embedding_scale: Option<f64>Embedding scale factor (Some(sqrt(hidden_size)) for Gemma models).
tie_word_embeddings: boolWhether the LM head shares weights with the token embedding.
rope_theta: f64Base frequency for rotary position embeddings.
max_position_embeddings: usizeMaximum sequence length for position embeddings.
attn_logit_softcapping: Option<f64>Attention logit soft-capping: tanh(scores / cap) * cap before softmax.
Some(50.0) for Gemma 2; None for most models.
final_logit_softcapping: Option<f64>Final logit soft-capping: tanh(logits / cap) * cap after LM head.
Some(30.0) for Gemma 2; None for most models.
query_pre_attn_scalar: Option<f64>Custom attention scaling factor. When set, scale = 1/sqrt(scalar)
instead of the default 1/sqrt(head_dim).
Some(256.0) for Gemma 2; None for most models.
use_post_norms: boolWhether each layer has post-attention and post-feedforward norms
(4 norms per layer instead of 2). true for Gemma 2.
sliding_window: Option<usize>Sliding window size. None for global attention.
alternating_sliding_window: boolWhether sliding window alternates with global attention per layer.
When true, even layers (0, 2, 4, …) use sliding window and
odd layers use global causal. true for Gemma 2.
Implementations§
Source§impl TransformerConfig
impl TransformerConfig
Sourcepub fn from_hf_config(config: &Value) -> Result<Self>
pub fn from_hf_config(config: &Value) -> Result<Self>
Parse a TransformerConfig from a HuggingFace config.json value.
Dispatches on the model_type field to a family-specific parser.
See the TransformerConfig struct-level documentation for the
full field reference (required/optional keys, defaults, and
per-family extensions).
§Errors
Returns MIError::Config if model_type is missing, unsupported,
or if required fields are absent.
Source§impl TransformerConfig
impl TransformerConfig
Sourcepub fn from_hf_config_auto(
config: &Value,
tensor_names: &[String],
) -> Result<Self>
pub fn from_hf_config_auto( config: &Value, tensor_names: &[String], ) -> Result<Self>
Parse a TransformerConfig from a HuggingFace config.json value
and safetensors tensor names.
Two-tier dispatch:
- Known families (listed in
SUPPORTED_MODEL_TYPES): delegates to the existing manually-validated parser viafrom_hf_config. - Unknown families: auto-detects architecture axes from
config.jsonscalars and safetensors tensor names (QKV/MLP layout, bias flags, norm type, post-norms), withmodel_type-based fixups for Gemma-family traits.
tensor_names should contain all tensor names from the model’s
safetensors file(s). Use tensor_names_from_safetensors or
tensor_names_from_index to obtain them without loading weights.
§Errors
Returns MIError::Config if model_type is missing or if required
dimension fields are absent.
Source§impl TransformerConfig
impl TransformerConfig
Sourcepub fn check_config_fields(config: &Value) -> CompatibilityReport
pub fn check_config_fields(config: &Value) -> CompatibilityReport
Check whether config.json contains the required fields for auto-config.
This is a lightweight check that does not require tensor names or
downloading weights. It validates that the five required scalar
fields (hidden_size, num_hidden_layers, num_attention_heads,
intermediate_size, vocab_size) are present.
A passing check does not guarantee full compatibility — use
check_auto_compatibility with
tensor names for a definitive answer.
Sourcepub fn check_auto_compatibility(
config: &Value,
tensor_names: &[String],
) -> CompatibilityReport
pub fn check_auto_compatibility( config: &Value, tensor_names: &[String], ) -> CompatibilityReport
Check whether a model is fully compatible with GenericTransformer
auto-config loading.
Validates both config.json fields and safetensors tensor names
against the patterns GenericTransformer::load() expects. Call
this after downloading but before loading to get a clear diagnostic
instead of a cryptic “tensor not found” error.
Checks performed:
- Required
config.jsonscalars are present - Embedding tensor (
model.embed_tokens.weight) exists - Layer-0 normalization tensors exist (
input_layernorm.weight,post_attention_layernorm.weight) - Final norm tensor (
model.norm.weight) exists - At least one recognized attention projection pattern
- At least one recognized MLP projection pattern
lm_head.weightexists whentie_word_embeddingsis false
Trait Implementations§
Source§impl Clone for TransformerConfig
impl Clone for TransformerConfig
Source§fn clone(&self) -> TransformerConfig
fn clone(&self) -> TransformerConfig
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for TransformerConfig
impl Debug for TransformerConfig
Source§impl PartialEq for TransformerConfig
impl PartialEq for TransformerConfig
impl StructuralPartialEq for TransformerConfig
Auto Trait Implementations§
impl Freeze for TransformerConfig
impl RefUnwindSafe for TransformerConfig
impl Send for TransformerConfig
impl Sync for TransformerConfig
impl Unpin for TransformerConfig
impl UnsafeUnpin for TransformerConfig
impl UnwindSafe for TransformerConfig
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more