Skip to main content

Module unet

Module unet 

Source
Expand description

Multi-view U-Net with camera-conditioned cross-view attention.

The U-Net follows the SD 2.1 architecture but replaces every spatial transformer block with a MultiViewSpatialTransformer that adds:

  1. Cross-view attention: Allows spatial positions to attend across all views
  2. IP-Adapter conditioning: Dedicated cross-attention layer (attn_ip) that conditions on CLIP image embeddings from the reference photo
  3. Camera-pose conditioning: Camera extrinsics added to timestep embedding

§IP-Adapter Integration

Each transformer block contains four attention layers:

  • attn1: Self-attention (within view)
  • attn_cv: Cross-view attention (across views)
  • attn2: Text cross-attention (unused in GAF, always zero)
  • attn_ip: IP-Adapter cross-attention (reference image conditioning)

When ip_tokens is None (unconditional pass), the attn_ip layer is skipped entirely, producing the unconditional prediction for CFG.

§Architecture Details

The U-Net structure:

  • Encoder: 4 downsampling stages (320 → 640 → 1280 → 1280 channels)
  • Bottleneck: ResBlock + Attention + ResBlock at 1280 channels
  • Decoder: 4 upsampling stages with skip connections
  • Output: GroupNorm + Conv → 4-channel latent prediction

Each stage contains 2 ResBlocks + 1 MultiViewSpatialTransformer (if attention enabled).

Structs§

MultiViewUNet
The multi-view U-Net for diffusion-based avatar generation.