Vision Transformer & CLIP primitives for OxiCUDA: ViT patch embedding, multi-head self-attention, CLIP contrastive learning, FPN, RoI align, DETR decoder — pure Rust, zero CUDA SDK dependency.
//! Patch embedding for Vision Transformers.
//!//! Converts a CHW image to a sequence of patch tokens by applying a
//! strided Conv2D with `kernel_size == stride == patch_size`.
//! Also provides 2-D sinusoidal and learnable positional encodings.
pubmodconv2d_patch;pubmodpos_embed;pubuseconv2d_patch::{PatchEmbed, PatchEmbedConfig, PatchEmbedWeights, prepend_cls};pubusepos_embed::{LearnablePosEmbed, add_pos_embed, pos_2d_sincos};