Expand description
§jepa-vision
Vision Transformer (ViT) encoders and predictors for image and video JEPA.
This crate provides the concrete vision modules that implement the
abstract traits defined in jepa_core:
Image / Video
│
▼
┌────────────┐ ┌──────────────────┐
│ Patch / │──►│ ViT Encoder │──► Representation
│ Tubelet │ │ (+ 2D/3D RoPE) │ [B, S, D]
│ Embedding │ └──────────────────┘
└────────────┘§Modules
| Module | Contents | Reference |
|---|---|---|
patch | PatchEmbedding — 2D image patchification + linear projection | ViT (Dosovitskiy 2021) |
rope | RotaryPositionEncoding2D — 2D rotary position encoding | RoFormer (Su 2021) |
vit | VitEncoder — image ViT with configurable presets (Tiny → giant) | |
image | TransformerPredictor, IJepa — I-JEPA pipeline with forward_step_strict | Assran et al. (2023) |
video | VitVideoEncoder, VJepa — V-JEPA with 3D tubelets + 3D RoPE | Bardes et al. (2024) |
§Quick start
use jepa_vision::vit::VitConfig;
use jepa_core::Encoder;
use burn_ndarray::NdArray;
type B = NdArray<f32>;
let device = burn_ndarray::NdArrayDevice::Cpu;
// Tiny ViT for tests; use VitConfig::vit_base_patch16() for real workloads
let encoder = VitConfig::tiny_test().init::<B>(&device);
assert_eq!(encoder.embed_dim(), 32);