Struct ClipTextEncoder

Source

pub struct ClipTextEncoder {
    pub config: ClipTextConfig,
    pub token_embedding: Vec<f32>,
    pub positional_embedding: Vec<f32>,
    /* private fields */
}

Expand description

The CLIP Transformer text tower.

Fields§

§config: ClipTextConfig

Configuration.

§token_embedding: Vec<f32>

Token embedding table: [vocab_size · width] row-major.

§positional_embedding: Vec<f32>

Learned positional embedding: [n_ctx · width] row-major.

Implementations§

Source §

impl ClipTextEncoder

Source

pub fn new(cfg: ClipTextConfig, rng: &mut LcgRng) -> VisionResult<Self>

Construct a CLIP text encoder with Gaussian-initialised weights.

§Errors

Propagates configuration / sub-component validation errors.

Source

pub fn eot_position(&self, tokens: &[usize]) -> usize

Locate the pooling position for a token sequence.

Following CLIP, the joint embedding is read at the position of the end-of-text token. We select the position of the last occurrence of eot_token; if it never appears, we fall back to CLIP’s argmax convention (the position of the highest token id), and as a final fallback the last index.

Source

pub fn hidden_states(&self, tokens: &[usize]) -> VisionResult<Vec<f32>>

Run the full encoder and return the contextual hidden states before pooling and projection: [n · width], after the final LayerNorm.

Exposed so causality tests can probe individual token hidden states.

§Errors

VisionError::EmptyInput if tokens is empty.
VisionError::Internal if the sequence is longer than n_ctx or a token id is out of the vocabulary range.

Source