# Optical Embeddings
A Rust implementation of DeepSeek-OCR, a vision-language model that compresses long text documents through optical encoding using the [Burn](https://burn.dev) deep learning framework.
## π About DeepSeek-OCR
This implementation is based on the paper:
> **DeepSeek-OCR: Contexts Optical Compression**
> Haoran Wei, Yaofeng Sun, Yukun Li
> DeepSeek-AI
> arXiv:2510.18234v1 [cs.CV] 21 Oct 2024
> [Paper (arXiv)](https://arxiv.org/abs/2510.18234) | [Official Repository](https://github.com/deepseek-ai/DeepSeek-OCR)
### Key Innovation
DeepSeek-OCR addresses the computational challenges of processing long textual contexts in Large Language Models (LLMs) by leveraging **context optical compression**βa novel approach that treats rendered text images as an efficient compression medium. Instead of processing thousands of text tokens, the model encodes document images into a compact set of vision tokens.
### Architecture Highlights
The system consists of two main components:
1. **DeepEncoder** (~380M parameters): A hybrid vision encoder combining:
- **SAM-base** (80M): Window attention for efficient local feature extraction
- **16Γ Convolutional Compressor**: Reduces spatial dimensions via two stride-2 conv layers
- **CLIP-large** (300M): Global attention for semantic understanding
2. **DeepSeek3B-MoE** Decoder (570M activated): A Mixture-of-Experts language model that reconstructs text from compressed vision tokens
### Compression Performance
According to the paper's findings on the Fox benchmark:
- **~10Γ compression**: Achieves **97% OCR decoding precision**
- **~20Γ compression**: Maintains **60% accuracy**
- **Token efficiency**: Processes 1000 text words using only 100 vision tokens
The model supports multiple resolution modes optimized for different compression ratios:
| Tiny | 512Γ512 | 64 | Ultra-fast inference |
| Small | 640Γ640 | 100 | Balanced performance |
| Base | 1024Γ1024 | 256 | Default (10Γ target) |
| Large | 1280Γ1280 | 400 | High-precision OCR |
| Gundam | Dynamic | <800 | Complex documents |
## π― Implementation Features
This Rust implementation provides:
- β
**Complete DeepEncoder architecture** with SAM and CLIP encoders
- β
**Window \& global attention mechanisms** for efficient processing
- β
**16Γ spatial compression** via convolutional layers
- β
**Multi-resolution support** (Tiny/Small/Base/Large modes)
- β
**GPU acceleration** via WGPU (cross-platform) or CUDA (NVIDIA)
- β
**Information-theoretic compression metrics** for analysis
- β
**Production-ready** with proper error handling and logging
## Run
1. Download your favourite font in `assets/font.ttf`
2. `cargo run --release`
3. check the output image
## Tests
`cargo test --all-features -- --nocapture`
## Build Commands
CPU only (default):
```bash
cargo build --release
cargo run --release
```
With WGPU (GPU - works with NVIDIA, AMD, Intel, Apple Silicon):
```bash
cargo build --release --features wgpu
cargo run --release --features wgpu
```
With CUDA (NVIDIA only - fastest):
```bash
# Make sure CUDA toolkit is installed first:
# Ubuntu/Debian: sudo apt install nvidia-cuda-toolkit
# Or download from: https://developer.nvidia.com/cuda-downloads
cargo build --release --features cuda
cargo run --release --features cuda
```
Check which GPU you have:
```bash
# NVIDIA
nvidia-smi
# AMD/Intel/General
# Or just try WGPU (works with most GPUs)
cargo run --release --features wgpu
```
## Metrics
Try: `cargo test -- test_information_compression_pipeline --nocapture`
Information compression:
```bash
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Optical Embeddings Information Analysis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π TEXT INFORMATION:
ββ Bytes: 641
ββ Characters: 641
ββ Words: 87
ββ Unique chars: 42
ββ Entropy (bits): 4.3794
πΌοΈ IMAGE INFORMATION:
ββ Bytes: 786432
ββ Pixels: 262144
ββ Unique colors: 2
ββ Entropy (bits): 0.1507
π― VISION TOKENS:
ββ Token count: 64
ββ Embedding dim: 1024
ββ Total values: 65536
π COMPRESSION METRICS:
ββ TextβImage: 0.0008Γ (smaller)
ββ TextβTokens: 0.0098Γ (smaller)
ββ ImageβTokens: 12.0000Γ (compressed)
ββ Effective (ent): 39.5030Γ (compression)
π INFORMATION FLOW:
Original text: 641 bytes (4.3793884228266 bits entropy)
Rendered image: 786432 bytes (0.15070330510950625 bits entropy)
Vision tokens: 64 tokens Γ 1024 dims = 65536 values
Effective rate: 80.12 bits/token
π COMPRESSION RESULTS:
ββ Text words: 87
ββ Vision tokens: 64
ββ Words/token: 1.36
ββ Spatial compression: 1024 patches β 64 tokens = 16.0Γ reduction
β
Compression test passed!
- Achieved 16Γ spatial compression (1024 β 64 tokens)
- Word-to-token ratio: 1.36
- β
Effective compression: 1.36 words per vision token
test tests::tests::test_information_compression_pipeline ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 7 filtered out; finished in 1.28s
```