# Qwen3.5 Hybrid Attention Architecture
Qwen3.5-9B-Instruct introduces a **hybrid attention** architecture that alternates
between standard softmax attention and **Gated Delta Net** (GDN) linear attention
layers. This chapter explains the architecture, the contract-driven config pipeline,
and how to use it with `apr`.
## Architecture Overview
Qwen3.5 uses a 32-layer transformer with two types of attention:
| Standard (softmax) | Q*K^T / sqrt(d) → softmax → V | O(n^2) | KV cache |
| Linear (GDN) | Recurrent state update | O(n) per step | Recurrent + Conv |
The `layer_types` field in `config.json` specifies which layers use which mechanism:
```json
{
"layer_types": [
"attention", "linear", "attention", "linear",
...
]
}
```
### Gated Delta Net Recurrence
Linear attention layers implement the **Gated Delta Net** equations:
```
GDN-1: S_t = exp(g_t) * S_{t-1} + k_t (x) delta_t
GDN-2: delta_t = beta_t * (v_t - S_{t-1}^T k_t)
GDN-3: o_t = S_t^T q_t
```
Where:
- `g_t = -exp(A_log) * softplus(a_t + dt_bias)` is the decay factor
- `beta_t = sigma(b_t)` is the update gate
- Q, K are L2-normalized
- State `S` is a `[num_v_heads, key_head_dim, value_head_dim]` matrix
### Key Differences from Qwen2
| head_dim | 128 (computed) | 256 (explicit) |
| num_heads | 28 | 16 |
| num_kv_heads | 4 | 4 |
| attention_bias | true | **false** |
| layer_types | all standard | **hybrid** |
| vocab_size | 152064 | 248320 |
## Config Pipeline
The Qwen3.5 config flows through three stages:
```
SafeTensors config.json
|
v (SafetensorsConfig with layer_types, head_dim, linear_* fields)
AprTransformerConfig
|
v (config_to_gpu with explicit_head_dim, layer_types)
GpuModelConfig
|
v (is_linear_layer(block_idx) dispatch)
forward_linear_block_incremental() or forward_block_incremental()
```
### Contract-Driven Validation
The `contracts/model-families/qwen3_5.yaml` contract enforces:
```yaml
constraints:
attention_type: gqa
has_bias: "false" # No attention bias tensors
activation: silu
mlp_type: swiglu
positional_encoding: rope
```
These constraints drive weight loading: `has_bias=false` means the loader
skips bias tensors entirely instead of loading zeros.
## Weight Loading
Linear attention layers have **different tensor names** from standard layers:
### Standard Attention Layer Tensors
```
model.layers.{n}.self_attn.q_proj.weight [4096, 4096]
model.layers.{n}.self_attn.k_proj.weight [1024, 4096]
model.layers.{n}.self_attn.v_proj.weight [1024, 4096]
model.layers.{n}.self_attn.o_proj.weight [4096, 4096]
```
### Gated Delta Net Layer Tensors
```
model.layers.{n}.self_attn.in_proj_qkvz.weight [QKVZ_dim, 4096]
model.layers.{n}.self_attn.in_proj_ba.weight [2*num_v_heads, 4096]
model.layers.{n}.self_attn.out_proj.weight [4096, value_dim]
model.layers.{n}.self_attn.conv1d.weight [conv_dim, 1, kernel]
model.layers.{n}.self_attn.A_log [num_v_heads]
model.layers.{n}.self_attn.dt_bias [num_v_heads]
model.layers.{n}.self_attn.norm.weight [value_dim]
```
The `in_proj_qkvz` tensor is a **combined** projection that gets split into
Q, K, V, and Z (gate) during loading:
```
in_proj_qkvz = [Q | K | V | Z]
Q: [key_dim, hidden_dim] -> qkv_weight (part)
K: [key_dim, hidden_dim] -> qkv_weight (part)
V: [value_dim, hidden_dim] -> qkv_weight (part)
Z: [value_dim, hidden_dim] -> linear_attn.z_weight
```
## CLI Usage
```bash
# Import Qwen3.5 from HuggingFace
apr import hf://Qwen/Qwen3.5-9B-Instruct -o qwen35.apr --arch qwen3_5
# Inspect hybrid attention config
# Run inference (realizar handles dispatch)
apr run qwen35.apr --prompt "What is 2+2?" --max-tokens 32
# QA validation
apr qa qwen35.apr --assert-tps 50
```
## Falsification Tests
The contract is protected by 8 Popperian falsification tests:
| QWEN35-001 | Exact dimensions (4096, 256, 16, 4, 248320) |
| QWEN35-002 | has_bias must be false |
| QWEN35-003 | hidden_dim == num_heads * head_dim |
| QWEN35-004 | GQA divisibility (16 % 4 == 0) |
| QWEN35-005 | Shape template dimensions |
| QWEN35-006 | rope_theta = 1,000,000 |
| QWEN35-007 | Architecture class = Qwen3_5ForCausalLM |
| QWEN35-008 | SwiGLU MLP (silu + swiglu) |
Run: `cargo test -- falsify_mf_qwen35`
## Fine-Tuning Support
Qwen3.5-9B is wired into the `apr finetune` CLI for classification fine-tuning:
```bash
# Plan mode — shows config and trainable parameter count
apr finetune --task classify --model-size 9B --plan
# Output:
# Model: 4096h x 32L
# LoRA: rank=16, alpha=16.0, 64 adapters
# Classifier: 4096->5 (20485 params)
# Total trainable: 6,836,229 params
```
The `TransformerConfig::qwen3_5_9b()` factory in entrenar mirrors the contract:
| hidden_size | 4096 | qwen3_5.yaml `hidden_dim` |
| num_attention_heads | 16 | qwen3_5.yaml `num_heads` |
| num_kv_heads | 4 | qwen3_5.yaml `num_kv_heads` |
| num_hidden_layers | 32 | qwen3_5.yaml `num_layers` |
| vocab_size | 248320 | qwen3_5.yaml `vocab_size` |
| use_bias | false | qwen3_5.yaml `has_bias` |
| head_dim() | 256 | 4096 / 16 |
CLI aliases: `--model-size 9B`, `--model-size qwen3.5-9b`, `--model-size qwen3.5`
### Fine-Tuning Falsification Tests
Seven additional tests (FALSIFY-FT-QWEN35-001..007) verify the config factory
matches the contract:
| FT-QWEN35-001 | vocab_size must be 248320 (not Qwen2's 152064) |
| FT-QWEN35-002 | use_bias must be false |
| FT-QWEN35-003 | head_dim() must be 256 |
| FT-QWEN35-004 | num_hidden_layers must be 32 |
| FT-QWEN35-005 | num_kv_heads must be 4 |
| FT-QWEN35-006 | All YAML dimensions match config factory |
| FT-QWEN35-007 | CLI dispatch consistency (9B != Qwen2) |
Run: `cargo test -- falsify_ft_qwen35`