Skip to main content

Module unet

oxigaf_diffusion

Module unet

Expand description

Multi-view U-Net with camera-conditioned cross-view attention.

The U-Net follows the SD 2.1 architecture but replaces every spatial transformer block with a MultiViewSpatialTransformer that adds:

Cross-view attention: Allows spatial positions to attend across all views
IP-Adapter conditioning: Dedicated cross-attention layer (attn_ip) that conditions on CLIP image embeddings from the reference photo
Camera-pose conditioning: Camera extrinsics added to timestep embedding

§IP-Adapter Integration

Each transformer block contains four attention layers:

attn1: Self-attention (within view)
attn_cv: Cross-view attention (across views)
attn2: Text cross-attention (unused in GAF, always zero)
attn_ip: IP-Adapter cross-attention (reference image conditioning)

When ip_tokens is None (unconditional pass), the attn_ip layer is skipped entirely, producing the unconditional prediction for CFG.

§Architecture Details

The U-Net structure:

Encoder: 4 downsampling stages (320 → 640 → 1280 → 1280 channels)
Bottleneck: ResBlock + Attention + ResBlock at 1280 channels
Decoder: 4 upsampling stages with skip connections
Output: GroupNorm + Conv → 4-channel latent prediction

Each stage contains 2 ResBlocks + 1 MultiViewSpatialTransformer (if attention enabled).

Structs§

MultiViewUNet: The multi-view U-Net for diffusion-based avatar generation.