oxicuda-vision

Vision Transformer & CLIP primitives for OxiCUDA -- ViT patch embedding, multi-head self-attention, CLIP contrastive learning, FPN, RoI Align, and DETR decoder, all in pure Rust.

Part of the OxiCUDA project. See the workspace README for the full crate map.

Overview

oxicuda-vision provides the architectural pieces of modern vision deep learning: a strided Conv2D PatchEmbed, sinusoidal and learnable positional embeddings, a complete ViT block (pre-norm MHSA + MLP) plus encoder stack and full ViTModel, a CLIP vision encoder + projection head + InfoNCE contrastive loss, geometric / photometric / normalisation image augmentations, a Feature Pyramid Network with lateral 1x1 convolutions and top-down pathway, and a DETR-style decoder with roi_align and bipartite_match primitives.

All forward passes operate on flat row-major Vec<f32> tensors so the same code drives CPU unit tests, PTX kernel verification, and CPU-only deployments. PTX kernels (patch_embed, bilinear_interp, contrastive_loss, roi_align, image_normalize, adaptive_avg_pool, focal_loss) are emitted for SM 7.5 through SM 12.0. The only crate dependency is thiserror.

Modules

Module	Description
`error`	`VisionError` / `VisionResult`
`handle`	`VisionHandle`, `SmVersion`, `LcgRng`
`patch_embed`	`PatchEmbed`, `PatchEmbedConfig`, `LearnablePosEmbed`, `pos_2d_sincos`, `add_pos_embed`, `prepend_cls`
`vit`	`ViTBlock`, `ViTEncoder`, `ViTModel`, `ViTConfig::tiny()`
`clip`	`ClipVisionEncoder`, `ClipVisionConfig`, `ProjectionHead`, `info_nce_loss`
`augment`	`AugOp`, `Pipeline`; `Resize`, `RandomCrop`, `HorizontalFlip`, ImageNet normalize
`fpn`	`FeatureMap`, `FpnConfig`, `Fpn`, `LateralConv1x1`
`detection`	`roi_align`, `bipartite_match`, `DetrDecoder`, `DetrConfig::tiny()`
`ptx_kernels`	PTX for the seven kernels listed above

Quick Start

use oxicuda_vision::prelude::*;

let mut rng = LcgRng::new(1);

// 32x32 RGB image, patch_size = 4, embed_dim = 16 -> 64 patch tokens.
let cfg = PatchEmbedConfig::new(32, 4, 3, 16)?;
let pe  = PatchEmbed::new(cfg.clone(), &mut rng);
let image = vec![0.5_f32; 3 * 32 * 32];
let tokens = pe.forward(&image)?;
assert_eq!(tokens.len(), cfg.n_patches() * cfg.embed_dim);

// Tiny ViT classifier (10-way).
let model = ViTModel::new(ViTConfig::tiny(), &mut rng)?;
let logits = model.forward(&image)?;
assert_eq!(logits.len(), 10);
# Ok::<(), VisionError>(())

Status

Item	Value
Version	0.1.5
Release date	2026-05-03
Default features	Pure Rust (`thiserror` only)
`unwrap()`	0 in production code

oxicuda-vision 0.1.7

oxicuda-vision

Overview

Modules

Quick Start

Status

License