Skip to main content

Module config

oxigaf_diffusion

Module config

Expand description

Configuration for the multi-view diffusion pipeline.

§Classifier-Free Guidance (CFG)

The pipeline uses CFG to control the strength of IP-Adapter conditioning. CFG interpolates between conditional and unconditional predictions:

prediction = unconditional + guidance_scale * (conditional - unconditional)

§How CFG Works in GAF

Conditional Pass: U-Net forward pass WITH IP-Adapter tokens from the reference image (CLIP embeddings)
Unconditional Pass: U-Net forward pass WITHOUT IP-Adapter tokens (skips reference conditioning)
Interpolation: Combine predictions based on guidance_scale

§Guidance Scale Selection

1.0: Pure conditional (no guidance, equivalent to single forward pass)
3.0-7.5: Balanced (recommended for GAF, default: 3.0)
>10.0: Strong conditioning (may oversaturate or reduce diversity)

§IP-Adapter Architecture

IP-Adapter provides pixel-level identity preservation by conditioning on CLIP image embeddings. The architecture includes:

CLIP Encoder: ViT-H/14 encodes reference image to 257×1280 embeddings
Projection: Linear projection from 1280 → 1024 (cross_attention_dim)
IP Cross-Attention: Dedicated attn_ip layer in each transformer block
Integration: Each spatial position attends to image tokens

This differs from text conditioning by providing direct visual features rather than semantic embeddings.

Structs§

DiffusionConfig: Full configuration for the multi-view diffusion model.