oxicuda-vision 0.2.0

Vision Transformer & CLIP primitives for OxiCUDA: ViT patch embedding, multi-head self-attention, CLIP contrastive learning, FPN, RoI align, DETR decoder — pure Rust, zero CUDA SDK dependency.

Documentation

//! Text encoders for vision-language models.
//!
//! Provides:
//! - **`clip_text`**: the CLIP Transformer text tower (Radford et al. 2021) —
//!   token + positional embeddings, pre-LN causal self-attention blocks, a
//!   final LayerNorm, EOS-token pooling, and a linear projection into the
//!   joint image-text embedding space with L2 normalisation.

pub mod clip_text;

pub use clip_text::{ClipTextConfig, ClipTextEncoder};