pub struct VitEncoder<B: Backend> { /* private fields */ }Expand description
Vision Transformer encoder.
Maps images to patch-level representations via:
- Patch embedding (linear projection of flattened patches)
- 2D Rotary Position Encoding
- Stack of transformer blocks (self-attention + MLP)
- Final layer normalization
Output shape: [batch, num_patches, embed_dim]
Implementations§
Source§impl<B: Backend> VitEncoder<B>
impl<B: Backend> VitEncoder<B>
Sourcepub fn forward(&self, images: &Tensor<B, 4>) -> Representation<B>
pub fn forward(&self, images: &Tensor<B, 4>) -> Representation<B>
Sourcepub fn forward_visible_tokens(
&self,
images: &Tensor<B, 4>,
visible_indices: &[usize],
) -> Representation<B>
pub fn forward_visible_tokens( &self, images: &Tensor<B, 4>, visible_indices: &[usize], ) -> Representation<B>
Encode only the visible patch tokens for strict JEPA context encoding.
The image is patchified and position-encoded using the full grid so the surviving tokens retain their real flattened positions, then masked tokens are removed before self-attention runs.
Sourcepub fn load_named_tensors(
self,
tensors: &HashMap<String, TensorData>,
) -> Result<Self, VitLoadError>
pub fn load_named_tensors( self, tensors: &HashMap<String, TensorData>, ) -> Result<Self, VitLoadError>
Load a ViT encoder from a map of burn-style parameter names to tensor data.
Expected parameter names match the burn module record layout, for example
patch_embed.projection.weight and blocks.0.attn.out_proj.bias.
Sourcepub fn ema_update_from(self, online: &Self, ema: &Ema, step: usize) -> Self
pub fn ema_update_from(self, online: &Self, ema: &Ema, step: usize) -> Self
Update this encoder toward an online encoder using EMA.
The returned encoder preserves the gradient setting of the target encoder parameters while detaching the blended tensors from any active autodiff graph.