oxicuda-vision 0.2.0

Vision Transformer & CLIP primitives for OxiCUDA: ViT patch embedding, multi-head self-attention, CLIP contrastive learning, FPN, RoI align, DETR decoder — pure Rust, zero CUDA SDK dependency.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//! Self-supervised learning (SSL) methods for vision backbones.
//!
//! Provides:
//! - **`dinov2`**: the DINOv2 self-distillation recipe (Oquab et al. 2023),
//!   combining the image-level DINO objective (Caron et al. 2021) with the
//!   patch-level iBOT masked-image-modelling objective (Zhou et al. 2022) — a
//!   ViT backbone returning `[CLS]` + patch tokens, a weight-normalised
//!   prototype projection head, a centred-and-sharpened teacher / softer
//!   student cross-entropy loss, an EMA teacher update, a centering buffer, and
//!   the iBOT masked-patch term.

pub mod dinov2;

pub use dinov2::{
    BackboneOutput, CenteringBuffer, DinoBackbone, DinoHead, cross_entropy, dino_loss, ibot_loss,
    student_softmax, teacher_softmax,
};