Optical Embeddings
A Rust implementation of DeepSeek-OCR, a vision-language model that compresses long text documents through optical encoding using the Burn deep learning framework.
π About DeepSeek-OCR
This implementation is based on the paper:
DeepSeek-OCR: Contexts Optical Compression Haoran Wei, Yaofeng Sun, Yukun Li DeepSeek-AI arXiv:2510.18234v1 [cs.CV] 21 Oct 2024 Paper (arXiv) | Official Repository
Key Innovation
DeepSeek-OCR addresses the computational challenges of processing long textual contexts in Large Language Models (LLMs) by leveraging context optical compressionβa novel approach that treats rendered text images as an efficient compression medium. Instead of processing thousands of text tokens, the model encodes document images into a compact set of vision tokens.
Architecture Highlights
The system consists of two main components:
- DeepEncoder (~380M parameters): A hybrid vision encoder combining:
- SAM-base (80M): Window attention for efficient local feature extraction
- 16Γ Convolutional Compressor: Reduces spatial dimensions via two stride-2 conv layers
- CLIP-large (300M): Global attention for semantic understanding
- DeepSeek3B-MoE Decoder (570M activated): A Mixture-of-Experts language model that reconstructs text from compressed vision tokens
Compression Performance
According to the paper's findings on the Fox benchmark:
- ~10Γ compression: Achieves 97% OCR decoding precision
- ~20Γ compression: Maintains 60% accuracy
- Token efficiency: Processes 1000 text words using only 100 vision tokens
The model supports multiple resolution modes optimized for different compression ratios:
| Mode | Resolution | Vision Tokens | Compression Target |
|---|---|---|---|
| Tiny | 512Γ512 | 64 | Ultra-fast inference |
| Small | 640Γ640 | 100 | Balanced performance |
| Base | 1024Γ1024 | 256 | Default (10Γ target) |
| Large | 1280Γ1280 | 400 | High-precision OCR |
| Gundam | Dynamic | <800 | Complex documents |
π― Implementation Features
This Rust implementation provides:
- β Complete DeepEncoder architecture with SAM and CLIP encoders
- β Window & global attention mechanisms for efficient processing
- β 16Γ spatial compression via convolutional layers
- β Multi-resolution support (Tiny/Small/Base/Large modes)
- β GPU acceleration via WGPU (cross-platform) or CUDA (NVIDIA)
- β Information-theoretic compression metrics for analysis
- β Production-ready with proper error handling and logging
Run
- Download your favourite font in
assets/font.ttf cargo run --release- check the output image
Tests
cargo test --all-features -- --nocapture
Build Commands
CPU only (default):
With WGPU (GPU - works with NVIDIA, AMD, Intel, Apple Silicon):
With CUDA (NVIDIA only - fastest):
# Make sure CUDA toolkit is installed first:
# Ubuntu/Debian: sudo apt install nvidia-cuda-toolkit
# Or download from: https://developer.nvidia.com/cuda-downloads
Check which GPU you have:
# NVIDIA
# AMD/Intel/General
|
# Or just try WGPU (works with most GPUs)
Metrics
Try: cargo test -- test_information_compression_pipeline --nocapture
Information compression:
)
)
)
)
)
))
)
)
)
; ; ; ; ;