Crate ferrotorch_diffusion

Expand description

Stable-Diffusion model composition for ferrotorch.

Phase B.3 of real-artifact-driven development. This crate implements the VAE decoder (Phase B.3a) and the UNet2DConditionModel (Phase B.3b) of runwayml/stable-diffusion-v1-5. The encoder, the CLIP text encoder, and the scheduler are out of scope and tracked under follow-up dispatches.

§VAE decoder

Mirrors vae/config.json — VaeDecoder inverts a latent [B, 4, 64, 64] into an image [B, 3, 512, 512]. See vae.

§UNet2DConditionModel

Mirrors unet/config.json — UNet2DConditionModel consumes (noisy_latent [B, 4, 64, 64], timestep [B], text_embed [B, S, 768]) and returns predicted noise [B, 4, 64, 64]. See unet.

ResnetBlock2DTime (UNet flavour with time bias):

h = silu(norm1(x)); h = conv1(h)
t = silu(temb); h = h + Linear(t).view(B, out, 1, 1)
h = silu(norm2(h)); h = conv2(h)
out = h + (x if in==out else conv_shortcut(x))

Transformer2DModel (SD UNet flavour):

h = GroupNorm(x); h = proj_in (Conv2d k=1, [B, inner, H, W])
h = flatten to [B, HW, inner]; for block in blocks: h = block(h, ehs)
h = reshape back; h = proj_out (Conv2d k=1); out = h + residual

Each BasicTransformerBlock is the canonical pre-LN (self-attn → cross-attn → GEGLU FF) stack.

Re-exports§

pub use attention::Attention;
pub use attention::BasicTransformerBlock;
pub use attention::FeedForward;
pub use attention::Transformer2DModel;
pub use blocks::AttnBlock2D;
pub use blocks::DownEncoderBlock2D;
pub use blocks::Downsample2D;
pub use blocks::ResnetBlock2D;
pub use blocks::UNetMidBlock2D;
pub use blocks::UpDecoderBlock2D;
pub use blocks::Upsample2D;
pub use clip_text_encoder::ClipEncoder;
pub use clip_text_encoder::ClipEncoderLayer;
pub use clip_text_encoder::ClipMlp;
pub use clip_text_encoder::ClipSelfAttention;
pub use clip_text_encoder::ClipTextConfig;
pub use clip_text_encoder::ClipTextEmbeddings;
pub use clip_text_encoder::ClipTextEncoder;
pub use config::VaeDecoderConfig;
pub use pipeline::PipelineStepDump;
pub use pipeline::StableDiffusionPipeline;
pub use resnet_block_time::ResnetBlock2DTime;
pub use safetensors_loader::DropReport;
pub use safetensors_loader::load_clip_text_encoder;
pub use safetensors_loader::load_unet;
pub use safetensors_loader::load_vae_decoder;
pub use safetensors_loader::load_vae_encoder;
pub use scheduler::BetaSchedule;
pub use scheduler::DDIMConfig;
pub use scheduler::DDIMScheduler;
pub use scheduler::PredictionType;
pub use scheduler::TimestepSpacing;
pub use time_embedding::TimestepEmbedding;
pub use time_embedding::Timesteps;
pub use unet::AnyDownBlock;
pub use unet::AnyUpBlock;
pub use unet::CrossAttnDownBlock2D;
pub use unet::CrossAttnUpBlock2D;
pub use unet::DownBlock2D;
pub use unet::UNet2DConditionModel;
pub use unet::UNetMidBlock2DCrossAttn;
pub use unet::UpBlock2D;
pub use unet_config::UNet2DConditionConfig;
pub use vae::Decoder;
pub use vae::VaeDecoder;
pub use vae_encoder::DiagonalGaussianDistribution;
pub use vae_encoder::Encoder;
pub use vae_encoder::VaeEncoder;
pub use vae_encoder::VaeEncoderConfig;

Modules§

attention: Multi-head attention + the Transformer2DModel wrapper used by the SD UNet’s CrossAttn blocks.
blocks: Building blocks of the Stable-Diffusion VAE decoder.
clip_text_encoder: Stable-Diffusion 1.5 CLIP text encoder (openai/clip-vit-large-patch14 — the text tower of CLIP-ViT-L/14).
config: Configuration for the Stable-Diffusion VAE decoder.
pipeline: Stable-Diffusion 1.5 end-to-end text-to-image generation pipeline.
resnet_block_time: ResnetBlock2DTime — the time-conditioned variant of the ResnetBlock2D used by the SD UNet.
safetensors_loader: Helpers that turn a path-to-safetensors into a loaded VaeDecoder.
scheduler: Deterministic DDIM scheduler matching diffusers.schedulers.DDIMScheduler for the Stable-Diffusion-1.5 sampling defaults.
time_embedding: Time-step sinusoidal positional encoding + the MLP that follows it.
unet: Stable-Diffusion UNet2DConditionModel forward pass.
unet_config: Configuration for the Stable-Diffusion UNet2DConditionModel.
vae: Stable-Diffusion VAE decoder composition.
vae_encoder: Stable-Diffusion VAE encoder composition.