Skip to main content

Module attention

Module attention 

Source
Expand description

Attention-based building blocks for multi-view diffusion.

Implements the multi-view transformer block that replaces the standard SD 2.1 BasicTransformerBlock with additional layers:

§Multi-View Transformer Architecture

Each MultiViewTransformerBlock contains five sequential operations:

  1. Self-Attention (attn1): Attention within each view’s spatial tokens
  2. Cross-View Attention (attn_cv): Attention across all N views at each spatial position, enabling 3D consistency
  3. Text Cross-Attention (attn2): Conditions on text embeddings (always zero in GAF since we don’t use text prompts)
  4. IP-Adapter Cross-Attention (attn_ip): Conditions on CLIP image embeddings from the reference photo, providing identity preservation
  5. Feed-Forward (ff): GeGLU-activated MLP for feature processing

§IP-Adapter Mechanism

The IP-Adapter layer enables pixel-level identity conditioning:

  • Input: CLIP ViT-H/14 encodes reference image → 257×1280 embeddings
  • Projection: Linear layer projects to cross_attention_dim (1024)
  • Attention: Each spatial position (h×w) attends to 257 image tokens
  • Output: Spatially-varying conditioning based on reference features

When ip_tokens=None (CFG unconditional pass), the IP-Adapter layer is skipped entirely via early return, producing unconditional predictions.

§Flash Attention Support

When the flash_attention feature is enabled, attention modules can use memory-efficient flash attention with O(N) memory complexity instead of O(N²). This is controlled via the use_flash_attention field in DiffusionConfig.

Flash attention provides 2-4× memory reduction for large images without sacrificing accuracy (< 1e-3 numerical difference from standard attention).

Structs§

CrossAttention
Cross-attention module with optional flash attention support.
MultiViewSpatialTransformer
A spatial transformer that includes multi-view attention in every block. Replaces the standard SpatialTransformer from SD 2.1.
MultiViewTransformerBlock
A transformer block with multi-view cross-attention support.