oxigaf-diffusion
Multi-view diffusion model inference for GAF.
Overview
This crate implements the multi-view diffusion pipeline for Gaussian Avatar Framework (GAF):
- CLIP image encoding — Extract semantic features from input images
- Multi-view U-Net — Generate novel views with camera-conditioned cross-view attention
- Latent Upsampler — 32×32 → 64×64 latent upsampling (sd-x2-latent-upscaler) for 512×512 output
- IP-Adapter — Identity-preserving image conditioning for consistent face generation
- Classifier-Free Guidance (CFG) — Quality improvement with configurable guidance scale (1.0–20.0)
- VAE decoding — Decode latent representations to RGB images
- DDIM scheduling — Fast sampling with 50-100 steps (vs 1000 for DDPM)
- Flash Attention — Memory-efficient O(N) attention for large images
The pipeline takes a single input image and generates multiple novel views of the subject at 512×512 resolution, which are then used to initialize and optimize 3D Gaussians.
v0.1.0 — what's included:
- Full 512×512 multi-view generation pipeline (Latent Upsampler + IP-Adapter + CFG)
- 66 tests (all passing)
- Benchmarks: standard vs Flash Attention, sequence lengths, DDIM scheduler
Installation
[]
= "0.1"
Features
| Feature | Description |
|---|---|
default |
["accelerate", "flash_attention"] — CPU with optimizations |
accelerate |
Platform-native BLAS/LAPACK (Accelerate on macOS, OpenBLAS on Linux) |
cuda |
NVIDIA GPU acceleration (requires CUDA toolkit) |
metal |
Apple Silicon GPU acceleration (M1/M2/M3) |
flash_attention |
Memory-efficient O(N) attention (enabled by default) |
mixed_precision |
FP16/BF16 inference (planned, not yet implemented) |
Feature Details
-
accelerate: Uses native BLAS/LAPACK for tensor operations- macOS: Apple Accelerate framework
- Linux: OpenBLAS or Intel MKL
- Windows: OpenBLAS
-
cuda: NVIDIA GPU acceleration via candle CUDA backend- Requires CUDA toolkit (11.8+ recommended)
- Requires compute capability 7.0+ (Volta and newer)
- Not available on macOS
-
metal: Apple Silicon GPU acceleration via Metal- macOS only
- Optimized for M1/M2/M3 chips
- Automatic selection on compatible hardware
-
flash_attention: Block-based attention computation- Reduces memory usage by 2-4× for large images
- Maintains quality while being faster
- Enabled by default for efficiency
Example Usage
# CPU-only with flash attention (default)
= "0.1"
# Apple Silicon with Metal acceleration
= { = "0.1", = ["metal", "flash_attention"] }
# NVIDIA GPU with CUDA
= { = "0.1", = ["cuda", "flash_attention"] }
Usage
Basic Multi-View Inference
use ;
use image;
Custom Camera Poses
use ;
use nalgebra as na;
DDIM Scheduler Configuration
use ;
Memory-Efficient Inference with Flash Attention
use ;
Pipeline Components
CLIP Image Encoder
Extracts semantic features from input images using CLIP ViT (Vision Transformer):
- Input: RGB image (224×224)
- Output: 768-dimensional feature vector
- Pre-trained on 400M image-text pairs
Multi-View U-Net
Denoises latent representations with camera-conditioned attention:
- Camera-conditioned cross-attention for view consistency
- Multi-scale feature pyramid (4 levels)
- Skip connections for detail preservation
- Supports batch processing of multiple views
VAE Decoder
Decodes latent representations to RGB images:
- Latent space: 4 channels
- RGB output: 3 channels
- Upsampling factor: 8× (e.g., 64×64 latent → 512×512 RGB)
Latent Upsampler (v0.1.0)
Upscales latent representations from 32×32 to 64×64 for 512×512 output:
- Separate U-Net (
upsampler.rs) fromstabilityai/sd-x2-latent-upscaler - 10-step DDIM denoising in latent space
- Fallback:
BilinearVaemode for CPU inference
IP-Adapter (v0.1.0)
Adds pixel-level identity conditioning:
- Additional
attn_ipcross-attention layer in transformer blocks - Context = VAE-encoded reference image
- Ensures face identity consistency across all generated views
Classifier-Free Guidance (v0.1.0)
Improves generation quality via dual forward pass:
- Conditional: full CLIP + IP embeddings
- Unconditional: zero embeddings
noise_pred = uncond + guidance_scale * (cond - uncond)- Configurable
guidance_scale(default: 7.5, range: 1.0–20.0)
DDIM Scheduler
Fast sampling with fewer steps than DDPM:
- DDPM: 1000 steps (slow)
- DDIM: 50-100 steps (20× faster)
- Deterministic sampling for reproducibility
- Supports both ε-prediction and v-prediction
Performance
Inference times on various hardware (512×512 resolution, 4 views, 50 steps):
| Hardware | Time (with flash attention) | Time (without) |
|---|---|---|
| CPU (Apple M2 Max) | ~12s | ~25s |
| Apple M2 Max (Metal) | ~3s | ~6s |
| NVIDIA RTX 4090 (CUDA) | ~1.5s | ~3s |
| NVIDIA RTX 3080 (CUDA) | ~2.5s | ~5s |
Memory usage:
| Resolution | Standard Attention | Flash Attention |
|---|---|---|
| 512×512 | ~8 GB | ~4 GB |
| 1024×1024 | ~24 GB | ~8 GB |
Statistics
- Tests: 66 (all passing)
- Source files:
attention.rs,camera.rs,clip.rs,flash_attention.rs,pipeline.rs,scheduler.rs,unet.rs,upsampler.rs,vae.rs - Benchmarks:
diffusion_bench.rs,flash_attention_bench.rs
Documentation
License
Licensed under the Apache License, Version 2.0 (LICENSE)