Expand description
Stable-Diffusion model composition for ferrotorch.
Phase B.3 of real-artifact-driven development. This crate implements
the VAE decoder (Phase B.3a) and the UNet2DConditionModel
(Phase B.3b) of runwayml/stable-diffusion-v1-5. The encoder, the
CLIP text encoder, and the scheduler are out of scope and tracked
under follow-up dispatches.
§VAE decoder
Mirrors vae/config.json — VaeDecoder inverts a latent
[B, 4, 64, 64] into an image [B, 3, 512, 512]. See vae.
§UNet2DConditionModel
Mirrors unet/config.json — UNet2DConditionModel consumes
(noisy_latent [B, 4, 64, 64], timestep [B], text_embed [B, S, 768])
and returns predicted noise [B, 4, 64, 64]. See unet.
ResnetBlock2DTime (UNet flavour with time bias):
h = silu(norm1(x)); h = conv1(h)
t = silu(temb); h = h + Linear(t).view(B, out, 1, 1)
h = silu(norm2(h)); h = conv2(h)
out = h + (x if in==out else conv_shortcut(x))Transformer2DModel (SD UNet flavour):
h = GroupNorm(x); h = proj_in (Conv2d k=1, [B, inner, H, W])
h = flatten to [B, HW, inner]; for block in blocks: h = block(h, ehs)
h = reshape back; h = proj_out (Conv2d k=1); out = h + residualEach BasicTransformerBlock is the canonical pre-LN
(self-attn → cross-attn → GEGLU FF) stack.
Re-exports§
pub use attention::Attention;pub use attention::BasicTransformerBlock;pub use attention::FeedForward;pub use attention::Transformer2DModel;pub use blocks::AttnBlock2D;pub use blocks::DownEncoderBlock2D;pub use blocks::Downsample2D;pub use blocks::ResnetBlock2D;pub use blocks::UNetMidBlock2D;pub use blocks::UpDecoderBlock2D;pub use blocks::Upsample2D;pub use clip_text_encoder::ClipEncoder;pub use clip_text_encoder::ClipEncoderLayer;pub use clip_text_encoder::ClipMlp;pub use clip_text_encoder::ClipSelfAttention;pub use clip_text_encoder::ClipTextConfig;pub use clip_text_encoder::ClipTextEmbeddings;pub use clip_text_encoder::ClipTextEncoder;pub use config::VaeDecoderConfig;pub use pipeline::PipelineStepDump;pub use pipeline::StableDiffusionPipeline;pub use resnet_block_time::ResnetBlock2DTime;pub use safetensors_loader::DropReport;pub use safetensors_loader::load_clip_text_encoder;pub use safetensors_loader::load_unet;pub use safetensors_loader::load_vae_decoder;pub use safetensors_loader::load_vae_encoder;pub use scheduler::BetaSchedule;pub use scheduler::DDIMConfig;pub use scheduler::DDIMScheduler;pub use scheduler::PredictionType;pub use scheduler::TimestepSpacing;pub use time_embedding::TimestepEmbedding;pub use time_embedding::Timesteps;pub use unet::AnyDownBlock;pub use unet::AnyUpBlock;pub use unet::CrossAttnDownBlock2D;pub use unet::CrossAttnUpBlock2D;pub use unet::DownBlock2D;pub use unet::UNet2DConditionModel;pub use unet::UNetMidBlock2DCrossAttn;pub use unet::UpBlock2D;pub use unet_config::UNet2DConditionConfig;pub use vae::Decoder;pub use vae::VaeDecoder;pub use vae_encoder::DiagonalGaussianDistribution;pub use vae_encoder::Encoder;pub use vae_encoder::VaeEncoder;pub use vae_encoder::VaeEncoderConfig;
Modules§
- attention
- Multi-head attention + the
Transformer2DModelwrapper used by the SD UNet’s CrossAttn blocks. - blocks
- Building blocks of the Stable-Diffusion VAE decoder.
- clip_
text_ encoder - Stable-Diffusion 1.5 CLIP text encoder
(
openai/clip-vit-large-patch14— the text tower of CLIP-ViT-L/14). - config
- Configuration for the Stable-Diffusion VAE decoder.
- pipeline
- Stable-Diffusion 1.5 end-to-end text-to-image generation pipeline.
- resnet_
block_ time ResnetBlock2DTime— the time-conditioned variant of theResnetBlock2Dused by the SD UNet.- safetensors_
loader - Helpers that turn a path-to-safetensors into a loaded
VaeDecoder. - scheduler
- Deterministic DDIM scheduler matching
diffusers.schedulers.DDIMSchedulerfor the Stable-Diffusion-1.5 sampling defaults. - time_
embedding - Time-step sinusoidal positional encoding + the MLP that follows it.
- unet
- Stable-Diffusion UNet2DConditionModel forward pass.
- unet_
config - Configuration for the Stable-Diffusion UNet2DConditionModel.
- vae
- Stable-Diffusion VAE decoder composition.
- vae_
encoder - Stable-Diffusion VAE encoder composition.