Skip to main content

Module config

Module config 

Source
Expand description

Configuration for the multi-view diffusion pipeline.

§Classifier-Free Guidance (CFG)

The pipeline uses CFG to control the strength of IP-Adapter conditioning. CFG interpolates between conditional and unconditional predictions:

prediction = unconditional + guidance_scale * (conditional - unconditional)

§How CFG Works in GAF

  1. Conditional Pass: U-Net forward pass WITH IP-Adapter tokens from the reference image (CLIP embeddings)
  2. Unconditional Pass: U-Net forward pass WITHOUT IP-Adapter tokens (skips reference conditioning)
  3. Interpolation: Combine predictions based on guidance_scale

§Guidance Scale Selection

  • 1.0: Pure conditional (no guidance, equivalent to single forward pass)
  • 3.0-7.5: Balanced (recommended for GAF, default: 3.0)
  • >10.0: Strong conditioning (may oversaturate or reduce diversity)

§IP-Adapter Architecture

IP-Adapter provides pixel-level identity preservation by conditioning on CLIP image embeddings. The architecture includes:

  • CLIP Encoder: ViT-H/14 encodes reference image to 257×1280 embeddings
  • Projection: Linear projection from 1280 → 1024 (cross_attention_dim)
  • IP Cross-Attention: Dedicated attn_ip layer in each transformer block
  • Integration: Each spatial position attends to image tokens

This differs from text conditioning by providing direct visual features rather than semantic embeddings.

Structs§

DiffusionConfig
Full configuration for the multi-view diffusion model.