oxicuda-vision 0.1.6

Vision Transformer & CLIP primitives for OxiCUDA: ViT patch embedding, multi-head self-attention, CLIP contrastive learning, FPN, RoI align, DETR decoder — pure Rust, zero CUDA SDK dependency.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
//! CLIP (Contrastive Language–Image Pre-Training) vision module.
//!
//! Provides:
//! - **`ClipVisionEncoder`**: ViT-based image encoder that produces a single
//!   CLS-token embedding per image.
//! - **`ProjectionHead`**: linear projection + L2 normalisation mapping
//!   encoder embeddings to a shared CLIP embedding space.
//! - **`info_nce_loss`**: numerically-stable symmetric InfoNCE / NT-Xent loss.

pub mod contrastive;
pub mod projection;
pub mod vision_encoder;

pub use contrastive::info_nce_loss;
pub use projection::{ProjectionHead, ProjectionWeights};
pub use vision_encoder::{ClipVisionConfig, ClipVisionEncoder};