Expand description
oxicuda-vision — Vision Transformer & CLIP primitives for OxiCUDA.
Pure-Rust CPU reference implementation providing:
patch_embed: strided Conv2D patch embedder, sinusoidal & learnable positional encodings.vit: ViT block (pre-norm MHSA + MLP), encoder stack, full ViT model, and the Swin Transformer windowed / shifted-window block.convnext: ConvNeXt modern-CNN block (depthwise conv + channel LayerNorm + inverted-bottleneck + layer scale).clip: CLIP vision encoder, projection head, InfoNCE contrastive loss.augment: geometric, photometric, and normalisation image augmentations, plus MixUp / CutMix batch mixing regularisers.imgproc: classical image processing — Sobel gradients and the Canny edge detector, binary/grayscale morphology, union-find connected-component labelling, and the Hough line transform.fpn: Feature Pyramid Network (lateral 1×1 convolutions + top-down pathway).detection: RoI Align, DETR decoder, bipartite set matching, IoU / GIoU / DIoU / CIoU box-regression losses, the RTMDet detector (CSPNeXt backbone + PAFPN neck + decoupled head + SimOTA-lite cost), and the OWL-ViT open-vocabulary detector (per-patch image-text matching).segmentation: the Segment Anything Model (SAM) — ViT image encoder, prompt encoder, and a two-way transformer mask decoder.ssl: the DINOv2 self-supervised distillation recipe — ViT backbone ([CLS]+ patch tokens), weight-normalised prototype head, centred / sharpened teacher-student DINO loss, EMA teacher, centering buffer, and the iBOT masked-patch term.text: the CLIP Transformer text encoder — token + positional embeddings, causal self-attention blocks, EOS pooling, and joint-space projection.pointcloud: the Point Transformer vector self-attention layer over kNN neighbourhoods.losses: focal loss (sigmoid & softmax) and soft Dice segmentation loss.ptx_kernels: 7 GPU PTX kernel string generators (SM 7.5–12.0).
No CUDA SDK dependency; all forward passes run on CPU f32 tensors
using flat row-major Vec<f32> layouts.
Re-exports§
pub use error::VisionError;pub use error::VisionResult;pub use handle::LcgRng;pub use handle::SmVersion;pub use handle::VisionHandle;
Modules§
- augment
- Image augmentation pipeline for CHW tensors.
- blocks
- EfficientNet-style building blocks.
- clip
- CLIP (Contrastive Language–Image Pre-Training) vision module.
- convnext
- ConvNeXt modern-CNN components.
- detection
- Object detection components.
- error
- Error types for
oxicuda-vision. - fpn
- Feature Pyramid Network (FPN) components.
- handle
- Session handle for
oxicuda-vision. - imgproc
- Classical image-processing primitives operating on flat
f32buffers. - losses
- Loss functions for classification, detection, and segmentation heads.
- optimize
- Inference-time optimization passes for vision models.
- patch_
embed - Patch embedding for Vision Transformers.
- pointcloud
- Point-cloud neural network primitives.
- prelude
- ptx_
kernels - PTX GPU kernel sources for vision model operations.
- segmentation
- Image segmentation models.
- ssl
- Self-supervised learning (SSL) methods for vision backbones.
- text
- Text encoders for vision-language models.
- vit
- Vision Transformer (ViT) components.