Skip to main content

Crate oxicuda_vision

Crate oxicuda_vision 

Source
Expand description

oxicuda-vision — Vision Transformer & CLIP primitives for OxiCUDA.

Pure-Rust CPU reference implementation providing:

  • patch_embed: strided Conv2D patch embedder, sinusoidal & learnable positional encodings.
  • vit: ViT block (pre-norm MHSA + MLP), encoder stack, full ViT model, and the Swin Transformer windowed / shifted-window block.
  • convnext: ConvNeXt modern-CNN block (depthwise conv + channel LayerNorm + inverted-bottleneck + layer scale).
  • clip: CLIP vision encoder, projection head, InfoNCE contrastive loss.
  • augment: geometric, photometric, and normalisation image augmentations, plus MixUp / CutMix batch mixing regularisers.
  • imgproc: classical image processing — Sobel gradients and the Canny edge detector, binary/grayscale morphology, union-find connected-component labelling, and the Hough line transform.
  • fpn: Feature Pyramid Network (lateral 1×1 convolutions + top-down pathway).
  • detection: RoI Align, DETR decoder, bipartite set matching, IoU / GIoU / DIoU / CIoU box-regression losses, the RTMDet detector (CSPNeXt backbone + PAFPN neck + decoupled head + SimOTA-lite cost), and the OWL-ViT open-vocabulary detector (per-patch image-text matching).
  • segmentation: the Segment Anything Model (SAM) — ViT image encoder, prompt encoder, and a two-way transformer mask decoder.
  • ssl: the DINOv2 self-supervised distillation recipe — ViT backbone ([CLS] + patch tokens), weight-normalised prototype head, centred / sharpened teacher-student DINO loss, EMA teacher, centering buffer, and the iBOT masked-patch term.
  • text: the CLIP Transformer text encoder — token + positional embeddings, causal self-attention blocks, EOS pooling, and joint-space projection.
  • pointcloud: the Point Transformer vector self-attention layer over kNN neighbourhoods.
  • losses: focal loss (sigmoid & softmax) and soft Dice segmentation loss.
  • ptx_kernels: 7 GPU PTX kernel string generators (SM 7.5–12.0).

No CUDA SDK dependency; all forward passes run on CPU f32 tensors using flat row-major Vec<f32> layouts.

Re-exports§

pub use error::VisionError;
pub use error::VisionResult;
pub use handle::LcgRng;
pub use handle::SmVersion;
pub use handle::VisionHandle;

Modules§

augment
Image augmentation pipeline for CHW tensors.
blocks
EfficientNet-style building blocks.
clip
CLIP (Contrastive Language–Image Pre-Training) vision module.
convnext
ConvNeXt modern-CNN components.
detection
Object detection components.
error
Error types for oxicuda-vision.
fpn
Feature Pyramid Network (FPN) components.
handle
Session handle for oxicuda-vision.
imgproc
Classical image-processing primitives operating on flat f32 buffers.
losses
Loss functions for classification, detection, and segmentation heads.
optimize
Inference-time optimization passes for vision models.
patch_embed
Patch embedding for Vision Transformers.
pointcloud
Point-cloud neural network primitives.
prelude
ptx_kernels
PTX GPU kernel sources for vision model operations.
segmentation
Image segmentation models.
ssl
Self-supervised learning (SSL) methods for vision backbones.
text
Text encoders for vision-language models.
vit
Vision Transformer (ViT) components.