pub struct ClipEncoder<B: Backend> { /* private fields */ }Expand description
CLIP vision encoder for DeepSeek-OCR.
Implements the CLIP-Large architecture (or configurable variants) for global semantic understanding of visual features. Processes patch embeddings through a stack of transformer blocks with pre-LayerNorm and Quick GELU activation.
§Architecture Details
- Input: RGB images
[B, 3, H, W] - Patch embedding: Splits image into non-overlapping patches
- Class token: Learnable token prepended to sequence
- Position encoding: Added to all tokens
- Transformer: Stack of attention + FFN blocks
- Output: Sequence of embeddings
[B, num_patches+1, hidden_size]
§Default Configuration (CLIP-Large)
- 1024-dimensional embeddings
- 24 transformer layers
- 16 attention heads per layer
- 4096-dimensional feed-forward hidden layer
- 14×14 patch size on 224×224 images (256 patches)
§Example
let config = ClipConfig::large();
let encoder = ClipEncoder::new(&config, &device);
let image = Tensor::zeros(, &device);[1]
let features = encoder.forward(image, None);
// features shape: (256 patches + 1 class token)[1]Implementations§
Source§impl<B: Backend> ClipEncoder<B>
impl<B: Backend> ClipEncoder<B>
Sourcepub fn new(cfg: &ClipConfig, device: &B::Device) -> Self
pub fn new(cfg: &ClipConfig, device: &B::Device) -> Self
Creates a new CLIP encoder.
Initializes all layers including embeddings, pre-LayerNorm, and transformer blocks according to the provided configuration.
§Arguments
cfg- CLIP configuration specifying model architecturedevice- Device for tensor allocation (CPU/GPU)
§Returns
Fully initialized CLIP encoder ready for inference
Sourcepub fn forward(
&self,
x: Tensor<B, 4>,
_patch_embeds: Option<Tensor<B, 4>>,
) -> Tensor<B, 3>
pub fn forward( &self, x: Tensor<B, 4>, _patch_embeds: Option<Tensor<B, 4>>, ) -> Tensor<B, 3>
Forward pass through CLIP encoder.
Processes input images through patch embedding, position encoding, and transformer layers to produce semantic visual features.
§Arguments
x- Input image tensor of shape[B, 3, H, W]_patch_embeds- Optional pre-computed patch embeddings (currently unused, reserved for future integration with SAM features)
§Returns
Feature tensor of shape [B, num_patches+1, hidden_size] where the first
token (index 0) is the class token and remaining tokens are patch features.
§Note
The _patch_embeds parameter is currently ignored. Full integration with
SAM features would require a projection layer to match embedding dimensions.
Sourcepub fn forward_features(
&self,
x: Tensor<B, 4>,
patch_embeds: Option<Tensor<B, 4>>,
) -> Tensor<B, 3>
pub fn forward_features( &self, x: Tensor<B, 4>, patch_embeds: Option<Tensor<B, 4>>, ) -> Tensor<B, 3>
Forward pass excluding class token.
Convenience method that runs the encoder and strips the class token from the output sequence, returning only patch feature embeddings.
§Arguments
x- Input image tensor of shape[B, 3, H, W]patch_embeds- Optional pre-computed patch embeddings
§Returns
Patch features of shape [B, num_patches, hidden_size] with the
class token removed (excludes first token in sequence).
Trait Implementations§
Source§impl<B> AutodiffModule<B> for ClipEncoder<B>
impl<B> AutodiffModule<B> for ClipEncoder<B>
Source§type InnerModule = ClipEncoder<<B as AutodiffBackend>::InnerBackend>
type InnerModule = ClipEncoder<<B as AutodiffBackend>::InnerBackend>
Source§fn valid(&self) -> Self::InnerModule
fn valid(&self) -> Self::InnerModule
Source§impl<B: Backend> Clone for ClipEncoder<B>
impl<B: Backend> Clone for ClipEncoder<B>
Source§impl<B: Backend> Display for ClipEncoder<B>
impl<B: Backend> Display for ClipEncoder<B>
Source§impl<B: Backend> Module<B> for ClipEncoder<B>
impl<B: Backend> Module<B> for ClipEncoder<B>
Source§type Record = ClipEncoderRecord<B>
type Record = ClipEncoderRecord<B>
Source§fn load_record(self, record: Self::Record) -> Self
fn load_record(self, record: Self::Record) -> Self
Source§fn into_record(self) -> Self::Record
fn into_record(self) -> Self::Record
Source§fn num_params(&self) -> usize
fn num_params(&self) -> usize
Source§fn visit<Visitor: ModuleVisitor<B>>(&self, visitor: &mut Visitor)
fn visit<Visitor: ModuleVisitor<B>>(&self, visitor: &mut Visitor)
Source§fn map<Mapper: ModuleMapper<B>>(self, mapper: &mut Mapper) -> Self
fn map<Mapper: ModuleMapper<B>>(self, mapper: &mut Mapper) -> Self
Source§fn collect_devices(&self, devices: Devices<B>) -> Devices<B>
fn collect_devices(&self, devices: Devices<B>) -> Devices<B>
Source§fn to_device(self, device: &B::Device) -> Self
fn to_device(self, device: &B::Device) -> Self
Source§fn fork(self, device: &B::Device) -> Self
fn fork(self, device: &B::Device) -> Self
Source§fn devices(&self) -> Vec<<B as Backend>::Device>
fn devices(&self) -> Vec<<B as Backend>::Device>
Source§fn save_file<FR, PB>(
self,
file_path: PB,
recorder: &FR,
) -> Result<(), RecorderError>
fn save_file<FR, PB>( self, file_path: PB, recorder: &FR, ) -> Result<(), RecorderError>
Source§fn load_file<FR, PB>(
self,
file_path: PB,
recorder: &FR,
device: &<B as Backend>::Device,
) -> Result<Self, RecorderError>
fn load_file<FR, PB>( self, file_path: PB, recorder: &FR, device: &<B as Backend>::Device, ) -> Result<Self, RecorderError>
Source§fn quantize_weights(self, quantizer: &mut Quantizer) -> Self
fn quantize_weights(self, quantizer: &mut Quantizer) -> Self
Source§impl<B: Backend> ModuleDisplay for ClipEncoder<B>
impl<B: Backend> ModuleDisplay for ClipEncoder<B>
Source§fn format(&self, passed_settings: DisplaySettings) -> String
fn format(&self, passed_settings: DisplaySettings) -> String
Source§fn custom_settings(&self) -> Option<DisplaySettings>
fn custom_settings(&self) -> Option<DisplaySettings>
Auto Trait Implementations§
impl<B> !Freeze for ClipEncoder<B>
impl<B> !RefUnwindSafe for ClipEncoder<B>
impl<B> Send for ClipEncoder<B>
impl<B> !Sync for ClipEncoder<B>
impl<B> Unpin for ClipEncoder<B>where
<B as Backend>::FloatTensorPrimitive: Unpin,
<B as Backend>::QuantizedTensorPrimitive: Unpin,
<B as Backend>::Device: Unpin,
impl<B> UnwindSafe for ClipEncoder<B>where
<B as Backend>::FloatTensorPrimitive: UnwindSafe,
<B as Backend>::QuantizedTensorPrimitive: UnwindSafe,
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more