Expand description
Multi-view U-Net with camera-conditioned cross-view attention.
The U-Net follows the SD 2.1 architecture but replaces every spatial
transformer block with a MultiViewSpatialTransformer that adds:
- Cross-view attention: Allows spatial positions to attend across all views
- IP-Adapter conditioning: Dedicated cross-attention layer (
attn_ip) that conditions on CLIP image embeddings from the reference photo - Camera-pose conditioning: Camera extrinsics added to timestep embedding
§IP-Adapter Integration
Each transformer block contains four attention layers:
attn1: Self-attention (within view)attn_cv: Cross-view attention (across views)attn2: Text cross-attention (unused in GAF, always zero)attn_ip: IP-Adapter cross-attention (reference image conditioning)
When ip_tokens is None (unconditional pass), the attn_ip layer is
skipped entirely, producing the unconditional prediction for CFG.
§Architecture Details
The U-Net structure:
- Encoder: 4 downsampling stages (320 → 640 → 1280 → 1280 channels)
- Bottleneck: ResBlock + Attention + ResBlock at 1280 channels
- Decoder: 4 upsampling stages with skip connections
- Output: GroupNorm + Conv → 4-channel latent prediction
Each stage contains 2 ResBlocks + 1 MultiViewSpatialTransformer (if attention enabled).
Structs§
- Multi
ViewU Net - The multi-view U-Net for diffusion-based avatar generation.