Optical Embeddings

A Rust implementation of DeepSeek-OCR, a vision-language model that compresses long text documents through optical encoding using the Burn deep learning framework.

📄 About DeepSeek-OCR

This implementation is based on the paper:

DeepSeek-OCR: Contexts Optical Compression Haoran Wei, Yaofeng Sun, Yukun Li DeepSeek-AI arXiv:2510.18234v1 [cs.CV] 21 Oct 2024 Paper (arXiv) | Official Repository

Key Innovation

DeepSeek-OCR addresses the computational challenges of processing long textual contexts in Large Language Models (LLMs) by leveraging context optical compression—a novel approach that treats rendered text images as an efficient compression medium. Instead of processing thousands of text tokens, the model encodes document images into a compact set of vision tokens.

Architecture Highlights

The system consists of two main components:

DeepEncoder (~380M parameters): A hybrid vision encoder combining:
- SAM-base (80M): Window attention for efficient local feature extraction
- 16× Convolutional Compressor: Reduces spatial dimensions via two stride-2 conv layers
- CLIP-large (300M): Global attention for semantic understanding
DeepSeek3B-MoE Decoder (570M activated): A Mixture-of-Experts language model that reconstructs text from compressed vision tokens

Compression Performance

According to the paper's findings on the Fox benchmark:

~10× compression: Achieves 97% OCR decoding precision
~20× compression: Maintains 60% accuracy
Token efficiency: Processes 1000 text words using only 100 vision tokens

The model supports multiple resolution modes optimized for different compression ratios:

Mode	Resolution	Vision Tokens	Compression Target
Tiny	512×512	64	Ultra-fast inference
Small	640×640	100	Balanced performance
Base	1024×1024	256	Default (10× target)
Large	1280×1280	400	High-precision OCR
Gundam	Dynamic	<800	Complex documents

🎯 Implementation Features

This Rust implementation provides:

✅ Complete DeepEncoder architecture with SAM and CLIP encoders
✅ Window & global attention mechanisms for efficient processing
✅ 16× spatial compression via convolutional layers
✅ Multi-resolution support (Tiny/Small/Base/Large modes)
✅ GPU acceleration via WGPU (cross-platform) or CUDA (NVIDIA)
✅ Information-theoretic compression metrics for analysis
✅ Production-ready with proper error handling and logging

Run

Download your favourite font in assets/font.ttf
cargo run --release
check the output image

Tests

cargo test --all-features -- --nocapture

Build Commands

CPU only (default):

cargo build --release
cargo run --release

With WGPU (GPU - works with NVIDIA, AMD, Intel, Apple Silicon):

cargo build --release --features wgpu
cargo run --release --features wgpu

With CUDA (NVIDIA only - fastest):

# Make sure CUDA toolkit is installed first:
# Ubuntu/Debian: sudo apt install nvidia-cuda-toolkit
# Or download from: https://developer.nvidia.com/cuda-downloads

cargo build --release --features cuda
cargo run --release --features cuda

Check which GPU you have:

# NVIDIA
nvidia-smi

# AMD/Intel/General
vulkaninfo | grep -i "device name"

# Or just try WGPU (works with most GPUs)
cargo run --release --features wgpu

Metrics

Try: cargo test -- test_information_compression_pipeline --nocapture

Information compression:

╔═══════════════════════════════════════════════════════════╗
║          Optical Embeddings Information Analysis               ║
╚═══════════════════════════════════════════════════════════╝

📝 TEXT INFORMATION:
  ├─ Bytes:                   641
  ├─ Characters:              641
  ├─ Words:                    87
  ├─ Unique chars:             42
  └─ Entropy (bits):       4.3794

🖼️  IMAGE INFORMATION:
  ├─ Bytes:                786432
  ├─ Pixels:               262144
  ├─ Unique colors:             2
  └─ Entropy (bits):       0.1507

🎯 VISION TOKENS:
  ├─ Token count:              64
  ├─ Embedding dim:          1024
  └─ Total values:          65536

📊 COMPRESSION METRICS:
  ├─ Text→Image:           0.0008× (smaller)
  ├─ Text→Tokens:          0.0098× (smaller)
  ├─ Image→Tokens:        12.0000× (compressed)
  └─ Effective (ent):     39.5030× (compression)

📈 INFORMATION FLOW:
  Original text:    641 bytes (4.3793884228266 bits entropy)
  Rendered image:   786432 bytes (0.15070330510950625 bits entropy)
  Vision tokens:    64 tokens × 1024 dims = 65536 values
  Effective rate:   80.12 bits/token

📊 COMPRESSION RESULTS:
  ├─ Text words: 87
  ├─ Vision tokens: 64
  ├─ Words/token: 1.36
  └─ Spatial compression: 1024 patches → 64 tokens = 16.0× reduction

✅ Compression test passed!
   - Achieved 16× spatial compression (1024 → 64 tokens)
   - Word-to-token ratio: 1.36
   - ✅ Effective compression: 1.36 words per vision token
test tests::tests::test_information_compression_pipeline ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 7 filtered out; finished in 1.28s

optical-embeddings 0.3.0