Module attention

Expand description

Attention-based building blocks for multi-view diffusion.

Implements the multi-view transformer block that replaces the standard SD 2.1 BasicTransformerBlock with additional layers:

§Multi-View Transformer Architecture

Each MultiViewTransformerBlock contains five sequential operations:

Self-Attention (attn1): Attention within each view’s spatial tokens
Cross-View Attention (attn_cv): Attention across all N views at each spatial position, enabling 3D consistency
Text Cross-Attention (attn2): Conditions on text embeddings (always zero in GAF since we don’t use text prompts)
IP-Adapter Cross-Attention (attn_ip): Conditions on CLIP image embeddings from the reference photo, providing identity preservation
Feed-Forward (ff): GeGLU-activated MLP for feature processing

§IP-Adapter Mechanism

The IP-Adapter layer enables pixel-level identity conditioning:

Input: CLIP ViT-H/14 encodes reference image → 257×1280 embeddings
Projection: Linear layer projects to cross_attention_dim (1024)
Attention: Each spatial position (h×w) attends to 257 image tokens
Output: Spatially-varying conditioning based on reference features

When ip_tokens=None (CFG unconditional pass), the IP-Adapter layer is skipped entirely via early return, producing unconditional predictions.

§Flash Attention Support

When the flash_attention feature is enabled, attention modules can use memory-efficient flash attention with O(N) memory complexity instead of O(N²). This is controlled via the use_flash_attention field in DiffusionConfig.

Flash attention provides 2-4× memory reduction for large images without sacrificing accuracy (< 1e-3 numerical difference from standard attention).

Structs§

CrossAttention: Cross-attention module with optional flash attention support.
MultiViewSpatialTransformer: A spatial transformer that includes multi-view attention in every block. Replaces the standard SpatialTransformer from SD 2.1.
MultiViewTransformerBlock: A transformer block with multi-view cross-attention support.