Skip to main content

Crate jepa_vision

Crate jepa_vision 

Source
Expand description

§jepa-vision

Vision Transformer (ViT) encoders and predictors for image and video JEPA.

This crate provides the concrete vision modules that implement the abstract traits defined in jepa_core:

 Image / Video
      │
      ▼
┌────────────┐   ┌──────────────────┐
│ Patch /     │──►│  ViT Encoder     │──► Representation
│ Tubelet     │   │  (+ 2D/3D RoPE)  │    [B, S, D]
│ Embedding   │   └──────────────────┘
└────────────┘

§Modules

ModuleContentsReference
patchPatchEmbedding — 2D image patchification + linear projectionViT (Dosovitskiy 2021)
ropeRotaryPositionEncoding2D — 2D rotary position encodingRoFormer (Su 2021)
vitVitEncoder — image ViT with configurable presets (Tiny → giant)
imageTransformerPredictor, IJepa — I-JEPA pipeline with forward_step_strictAssran et al. (2023)
videoVitVideoEncoder, VJepa — V-JEPA with 3D tubelets + 3D RoPEBardes et al. (2024)

§Quick start

use jepa_vision::vit::VitConfig;
use jepa_core::Encoder;
use burn_ndarray::NdArray;

type B = NdArray<f32>;
let device = burn_ndarray::NdArrayDevice::Cpu;

// Tiny ViT for tests; use VitConfig::vit_base_patch16() for real workloads
let encoder = VitConfig::tiny_test().init::<B>(&device);
assert_eq!(encoder.embed_dim(), 32);

Modules§

image
I-JEPA (Image Joint Embedding Predictive Architecture) pipeline.
patch
Patch embedding for images.
rope
Rotary Position Embedding (RoPE) for 2D spatial positions.
video
V-JEPA video encoder with 3D tubelets and 3D RoPE.
vit
Vision Transformer (ViT) encoder for JEPA.