pub struct ClipVisionEncoder {
pub config: ClipVisionConfig,
pub patch_embed: PatchEmbed,
pub pos_embed: LearnablePosEmbed,
pub encoder: ViTEncoder,
pub cls_token: Vec<f32>,
}Expand description
CLIP vision encoder: ViT-backbone that produces a single embed_dim
CLS-token embedding per image.
Pipeline:
image [C × H × W]
→ patch_embed → [n_patches, embed_dim]
→ prepend_cls → [n_patches + 1, embed_dim]
→ add_pos_embed → [n_patches + 1, embed_dim]
→ encoder → [n_patches + 1, embed_dim]
→ tokens[0] → [embed_dim] (CLS token output)Fields§
§config: ClipVisionConfigFull configuration.
patch_embed: PatchEmbedStrided Conv2D patch embedder.
pos_embed: LearnablePosEmbedLearnable positional embeddings: n_patches + 1 positions (incl. CLS).
encoder: ViTEncoderStack of ViT transformer blocks with final layer-norm.
cls_token: Vec<f32>CLS token: flat [embed_dim], Gaussian-initialised with scale 0.02.
Implementations§
Source§impl ClipVisionEncoder
impl ClipVisionEncoder
Sourcepub fn new(cfg: ClipVisionConfig, rng: &mut LcgRng) -> VisionResult<Self>
pub fn new(cfg: ClipVisionConfig, rng: &mut LcgRng) -> VisionResult<Self>
Construct a new CLIP vision encoder.
Initialises:
- Patch embedder (Conv2D kernel, bias).
- Learnable positional embedding table with
n_patches + 1rows. - ViT encoder stack.
- CLS token vector (N(0, 0.02²)).
§Errors
Propagates any errors from the sub-component constructors.
Sourcepub fn forward_single(&self, image: &[f32]) -> VisionResult<Vec<f32>>
pub fn forward_single(&self, image: &[f32]) -> VisionResult<Vec<f32>>
Run the encoder on a single image and return the CLS embedding.
§Parameters
image: flat[in_chans × img_size × img_size]CHW buffer.
§Returns
[embed_dim] CLS-token embedding.
§Errors
Returns VisionError::DimensionMismatch if the image size does not
match the configured dimensions.
Sourcepub fn forward_batch(
&self,
images: &[f32],
batch_size: usize,
) -> VisionResult<Vec<Vec<f32>>>
pub fn forward_batch( &self, images: &[f32], batch_size: usize, ) -> VisionResult<Vec<Vec<f32>>>
Run the encoder on a batch of images.
§Parameters
images: flat[batch × in_chans × img_size × img_size]buffer.batch_size: number of images.
§Returns
Vec<Vec<f32>> of length batch_size, each element is [embed_dim].
§Errors
Returns VisionError::DimensionMismatch if the flat buffer length
does not match batch_size × in_chans × img_size × img_size, or if
any individual forward pass fails.