Crate oxicuda_vision

Source

Expand description

oxicuda-vision — Vision Transformer & CLIP primitives for OxiCUDA.

Pure-Rust CPU reference implementation providing:

patch_embed: strided Conv2D patch embedder, sinusoidal & learnable positional encodings.
vit: ViT block (pre-norm MHSA + MLP), encoder stack, full ViT model, and the Swin Transformer windowed / shifted-window block.
convnext: ConvNeXt modern-CNN block (depthwise conv + channel LayerNorm + inverted-bottleneck + layer scale).
clip: CLIP vision encoder, projection head, InfoNCE contrastive loss.
augment: geometric, photometric, and normalisation image augmentations, plus MixUp / CutMix batch mixing regularisers.
imgproc: classical image processing — Sobel gradients and the Canny edge detector, binary/grayscale morphology, union-find connected-component labelling, and the Hough line transform.
fpn: Feature Pyramid Network (lateral 1×1 convolutions + top-down pathway).
detection: RoI Align, DETR decoder, bipartite set matching, IoU / GIoU / DIoU / CIoU box-regression losses, the RTMDet detector (CSPNeXt backbone + PAFPN neck + decoupled head + SimOTA-lite cost), and the OWL-ViT open-vocabulary detector (per-patch image-text matching).
segmentation: the Segment Anything Model (SAM) — ViT image encoder, prompt encoder, and a two-way transformer mask decoder.
ssl: the DINOv2 self-supervised distillation recipe — ViT backbone ([CLS] + patch tokens), weight-normalised prototype head, centred / sharpened teacher-student DINO loss, EMA teacher, centering buffer, and the iBOT masked-patch term.
text: the CLIP Transformer text encoder — token + positional embeddings, causal self-attention blocks, EOS pooling, and joint-space projection.
pointcloud: the Point Transformer vector self-attention layer over kNN neighbourhoods.
losses: focal loss (sigmoid & softmax) and soft Dice segmentation loss.
ptx_kernels: 7 GPU PTX kernel string generators (SM 7.5–12.0).

No CUDA SDK dependency; all forward passes run on CPU f32 tensors using flat row-major Vec<f32> layouts.

Re-exports§

pub use error::VisionError;
pub use error::VisionResult;
pub use handle::LcgRng;
pub use handle::SmVersion;
pub use handle::VisionHandle;

Modules§

augment: Image augmentation pipeline for CHW tensors.
blocks: EfficientNet-style building blocks.
clip: CLIP (Contrastive Language–Image Pre-Training) vision module.
convnext: ConvNeXt modern-CNN components.
detection: Object detection components.
error: Error types for oxicuda-vision.
fpn: Feature Pyramid Network (FPN) components.
handle: Session handle for oxicuda-vision.
imgproc: Classical image-processing primitives operating on flat f32 buffers.
losses: Loss functions for classification, detection, and segmentation heads.
optimize: Inference-time optimization passes for vision models.
patch_embed: Patch embedding for Vision Transformers.
pointcloud: Point-cloud neural network primitives.
prelude
ptx_kernels: PTX GPU kernel sources for vision model operations.
segmentation: Image segmentation models.
ssl: Self-supervised learning (SSL) methods for vision backbones.
text: Text encoders for vision-language models.
vit: Vision Transformer (ViT) components.

Crate oxicuda_vision

Crate oxicuda_vision Copy item path

Re-exports§

Modules§

Crate oxicuda_vision