pub struct ClipTextEncoder {
pub config: ClipTextConfig,
pub token_embedding: Vec<f32>,
pub positional_embedding: Vec<f32>,
/* private fields */
}Expand description
The CLIP Transformer text tower.
Fields§
§config: ClipTextConfigConfiguration.
token_embedding: Vec<f32>Token embedding table: [vocab_size · width] row-major.
positional_embedding: Vec<f32>Learned positional embedding: [n_ctx · width] row-major.
Implementations§
Source§impl ClipTextEncoder
impl ClipTextEncoder
Sourcepub fn new(cfg: ClipTextConfig, rng: &mut LcgRng) -> VisionResult<Self>
pub fn new(cfg: ClipTextConfig, rng: &mut LcgRng) -> VisionResult<Self>
Construct a CLIP text encoder with Gaussian-initialised weights.
§Errors
Propagates configuration / sub-component validation errors.
Sourcepub fn eot_position(&self, tokens: &[usize]) -> usize
pub fn eot_position(&self, tokens: &[usize]) -> usize
Locate the pooling position for a token sequence.
Following CLIP, the joint embedding is read at the position of the
end-of-text token. We select the position of the last occurrence of
eot_token; if it never appears, we fall back to CLIP’s argmax
convention (the position of the highest token id), and as a final
fallback the last index.
Run the full encoder and return the contextual hidden states before
pooling and projection: [n · width], after the final LayerNorm.
Exposed so causality tests can probe individual token hidden states.
§Errors
VisionError::EmptyInputiftokensis empty.VisionError::Internalif the sequence is longer thann_ctxor a token id is out of the vocabulary range.
Sourcepub fn encode(&self, tokens: &[usize]) -> VisionResult<Vec<f32>>
pub fn encode(&self, tokens: &[usize]) -> VisionResult<Vec<f32>>
Encode a token sequence to a unit-norm joint-space embedding.
§Returns
[embed_dim] L2-normalised text embedding.
§Errors
Propagates errors from Self::hidden_states.