Expand description
Configuration for the multi-view diffusion pipeline.
§Classifier-Free Guidance (CFG)
The pipeline uses CFG to control the strength of IP-Adapter conditioning. CFG interpolates between conditional and unconditional predictions:
prediction = unconditional + guidance_scale * (conditional - unconditional)§How CFG Works in GAF
- Conditional Pass: U-Net forward pass WITH IP-Adapter tokens from the reference image (CLIP embeddings)
- Unconditional Pass: U-Net forward pass WITHOUT IP-Adapter tokens (skips reference conditioning)
- Interpolation: Combine predictions based on
guidance_scale
§Guidance Scale Selection
- 1.0: Pure conditional (no guidance, equivalent to single forward pass)
- 3.0-7.5: Balanced (recommended for GAF, default: 3.0)
- >10.0: Strong conditioning (may oversaturate or reduce diversity)
§IP-Adapter Architecture
IP-Adapter provides pixel-level identity preservation by conditioning on CLIP image embeddings. The architecture includes:
- CLIP Encoder: ViT-H/14 encodes reference image to 257×1280 embeddings
- Projection: Linear projection from 1280 → 1024 (cross_attention_dim)
- IP Cross-Attention: Dedicated
attn_iplayer in each transformer block - Integration: Each spatial position attends to image tokens
This differs from text conditioning by providing direct visual features rather than semantic embeddings.
Structs§
- Diffusion
Config - Full configuration for the multi-view diffusion model.