Expand description
Attention-based building blocks for multi-view diffusion.
Implements the multi-view transformer block that replaces the standard
SD 2.1 BasicTransformerBlock with additional layers:
§Multi-View Transformer Architecture
Each MultiViewTransformerBlock contains five sequential operations:
- Self-Attention (
attn1): Attention within each view’s spatial tokens - Cross-View Attention (
attn_cv): Attention across all N views at each spatial position, enabling 3D consistency - Text Cross-Attention (
attn2): Conditions on text embeddings (always zero in GAF since we don’t use text prompts) - IP-Adapter Cross-Attention (
attn_ip): Conditions on CLIP image embeddings from the reference photo, providing identity preservation - Feed-Forward (
ff): GeGLU-activated MLP for feature processing
§IP-Adapter Mechanism
The IP-Adapter layer enables pixel-level identity conditioning:
- Input: CLIP ViT-H/14 encodes reference image → 257×1280 embeddings
- Projection: Linear layer projects to cross_attention_dim (1024)
- Attention: Each spatial position (h×w) attends to 257 image tokens
- Output: Spatially-varying conditioning based on reference features
When ip_tokens=None (CFG unconditional pass), the IP-Adapter layer is
skipped entirely via early return, producing unconditional predictions.
§Flash Attention Support
When the flash_attention feature is enabled, attention modules can use
memory-efficient flash attention with O(N) memory complexity instead of
O(N²). This is controlled via the use_flash_attention field in
DiffusionConfig.
Flash attention provides 2-4× memory reduction for large images without sacrificing accuracy (< 1e-3 numerical difference from standard attention).
Structs§
- Cross
Attention - Cross-attention module with optional flash attention support.
- Multi
View Spatial Transformer - A spatial transformer that includes multi-view attention in every block.
Replaces the standard
SpatialTransformerfrom SD 2.1. - Multi
View Transformer Block - A transformer block with multi-view cross-attention support.