optical-embeddings 0.3.0

DeepSeek-OCR - compress text into images
Documentation
# Optical Embeddings

A Rust implementation of DeepSeek-OCR, a vision-language model that compresses long text documents through optical encoding using the [Burn](https://burn.dev) deep learning framework.

## πŸ“„ About DeepSeek-OCR

This implementation is based on the paper:

> **DeepSeek-OCR: Contexts Optical Compression**
> Haoran Wei, Yaofeng Sun, Yukun Li
> DeepSeek-AI
> arXiv:2510.18234v1 [cs.CV] 21 Oct 2024
> [Paper (arXiv)]https://arxiv.org/abs/2510.18234 | [Official Repository]https://github.com/deepseek-ai/DeepSeek-OCR

### Key Innovation

DeepSeek-OCR addresses the computational challenges of processing long textual contexts in Large Language Models (LLMs) by leveraging **context optical compression**β€”a novel approach that treats rendered text images as an efficient compression medium. Instead of processing thousands of text tokens, the model encodes document images into a compact set of vision tokens.

### Architecture Highlights

The system consists of two main components:

1. **DeepEncoder** (~380M parameters): A hybrid vision encoder combining:
    - **SAM-base** (80M): Window attention for efficient local feature extraction
    - **16Γ— Convolutional Compressor**: Reduces spatial dimensions via two stride-2 conv layers
    - **CLIP-large** (300M): Global attention for semantic understanding
2. **DeepSeek3B-MoE** Decoder (570M activated): A Mixture-of-Experts language model that reconstructs text from compressed vision tokens

### Compression Performance

According to the paper's findings on the Fox benchmark:

- **~10Γ— compression**: Achieves **97% OCR decoding precision**
- **~20Γ— compression**: Maintains **60% accuracy**
- **Token efficiency**: Processes 1000 text words using only 100 vision tokens

The model supports multiple resolution modes optimized for different compression ratios:


| Mode | Resolution | Vision Tokens | Compression Target |
| :-- | :-- | :-- | :-- |
| Tiny | 512Γ—512 | 64 | Ultra-fast inference |
| Small | 640Γ—640 | 100 | Balanced performance |
| Base | 1024Γ—1024 | 256 | Default (10Γ— target) |
| Large | 1280Γ—1280 | 400 | High-precision OCR |
| Gundam | Dynamic | <800 | Complex documents |

## 🎯 Implementation Features

This Rust implementation provides:

- βœ… **Complete DeepEncoder architecture** with SAM and CLIP encoders
- βœ… **Window \& global attention mechanisms** for efficient processing
- βœ… **16Γ— spatial compression** via convolutional layers
- βœ… **Multi-resolution support** (Tiny/Small/Base/Large modes)
- βœ… **GPU acceleration** via WGPU (cross-platform) or CUDA (NVIDIA)
- βœ… **Information-theoretic compression metrics** for analysis
- βœ… **Production-ready** with proper error handling and logging

## Run
1. Download your favourite font in `assets/font.ttf`
2. `cargo run --release`
3. check the output image

## Tests
`cargo test --all-features -- --nocapture`

## Build Commands
CPU only (default):
```bash
cargo build --release
cargo run --release
```
With WGPU (GPU - works with NVIDIA, AMD, Intel, Apple Silicon):
```bash
cargo build --release --features wgpu
cargo run --release --features wgpu
```
With CUDA (NVIDIA only - fastest):
```bash
# Make sure CUDA toolkit is installed first:
# Ubuntu/Debian: sudo apt install nvidia-cuda-toolkit
# Or download from: https://developer.nvidia.com/cuda-downloads

cargo build --release --features cuda
cargo run --release --features cuda
```
Check which GPU you have:
```bash
# NVIDIA
nvidia-smi

# AMD/Intel/General
vulkaninfo | grep -i "device name"

# Or just try WGPU (works with most GPUs)
cargo run --release --features wgpu
```

## Metrics
Try: `cargo test -- test_information_compression_pipeline --nocapture`


Information compression:
```bash
╔═══════════════════════════════════════════════════════════╗
β•‘          Optical Embeddings Information Analysis               β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

πŸ“ TEXT INFORMATION:
  β”œβ”€ Bytes:                   641
  β”œβ”€ Characters:              641
  β”œβ”€ Words:                    87
  β”œβ”€ Unique chars:             42
  └─ Entropy (bits):       4.3794

πŸ–ΌοΈ  IMAGE INFORMATION:
  β”œβ”€ Bytes:                786432
  β”œβ”€ Pixels:               262144
  β”œβ”€ Unique colors:             2
  └─ Entropy (bits):       0.1507

🎯 VISION TOKENS:
  β”œβ”€ Token count:              64
  β”œβ”€ Embedding dim:          1024
  └─ Total values:          65536

πŸ“Š COMPRESSION METRICS:
  β”œβ”€ Textβ†’Image:           0.0008Γ— (smaller)
  β”œβ”€ Textβ†’Tokens:          0.0098Γ— (smaller)
  β”œβ”€ Imageβ†’Tokens:        12.0000Γ— (compressed)
  └─ Effective (ent):     39.5030Γ— (compression)

πŸ“ˆ INFORMATION FLOW:
  Original text:    641 bytes (4.3793884228266 bits entropy)
  Rendered image:   786432 bytes (0.15070330510950625 bits entropy)
  Vision tokens:    64 tokens Γ— 1024 dims = 65536 values
  Effective rate:   80.12 bits/token

πŸ“Š COMPRESSION RESULTS:
  β”œβ”€ Text words: 87
  β”œβ”€ Vision tokens: 64
  β”œβ”€ Words/token: 1.36
  └─ Spatial compression: 1024 patches β†’ 64 tokens = 16.0Γ— reduction

βœ… Compression test passed!
   - Achieved 16Γ— spatial compression (1024 β†’ 64 tokens)
   - Word-to-token ratio: 1.36
   - βœ… Effective compression: 1.36 words per vision token
test tests::tests::test_information_compression_pipeline ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 7 filtered out; finished in 1.28s
```