jepa_vision/
lib.rs

1//! # jepa-vision
2//!
3//! Vision Transformer (ViT) encoders and predictors for image and video JEPA.
4//!
5//! This crate provides the concrete vision modules that implement the
6//! abstract traits defined in [`jepa_core`]:
7//!
8//! ```text
9//!  Image / Video
10//!       │
11//!       ▼
12//! ┌────────────┐   ┌──────────────────┐
13//! │ Patch /     │──►│  ViT Encoder     │──► Representation
14//! │ Tubelet     │   │  (+ 2D/3D RoPE)  │    [B, S, D]
15//! │ Embedding   │   └──────────────────┘
16//! └────────────┘
17//! ```
18//!
19//! ## Modules
20//!
21//! | Module | Contents | Reference |
22//! |--------|----------|-----------|
23//! | [`patch`] | [`PatchEmbedding`](patch::PatchEmbedding) — 2D image patchification + linear projection | ViT (Dosovitskiy 2021) |
24//! | [`rope`] | [`RotaryPositionEncoding2D`](rope::RotaryPositionEncoding2D) — 2D rotary position encoding | RoFormer (Su 2021) |
25//! | [`vit`] | [`VitEncoder`](vit::VitEncoder) — image ViT with configurable presets (Tiny → giant) | |
26//! | [`image`] | [`TransformerPredictor`](image::TransformerPredictor), [`IJepa`](image::IJepa) — I-JEPA pipeline with `forward_step_strict` | Assran et al. (2023) |
27//! | [`video`] | [`VitVideoEncoder`](video::VitVideoEncoder), [`VJepa`](video::VJepa) — V-JEPA with 3D tubelets + 3D RoPE | Bardes et al. (2024) |
28//!
29//! ## Quick start
30//!
31//! ```rust
32//! use jepa_vision::vit::VitConfig;
33//! use jepa_core::Encoder;
34//! use burn_ndarray::NdArray;
35//!
36//! type B = NdArray<f32>;
37//! let device = burn_ndarray::NdArrayDevice::Cpu;
38//!
39//! // Tiny ViT for tests; use VitConfig::vit_base_patch16() for real workloads
40//! let encoder = VitConfig::tiny_test().init::<B>(&device);
41//! assert_eq!(encoder.embed_dim(), 32);
42//! ```
43
44pub mod image;
45pub mod patch;
46pub mod rope;
47pub(crate) mod token_ops;
48pub mod video;
49pub mod vit;
jepa_vision/lib.rs

jepa_vision/
lib.rs