optical-embeddings 0.3.0

# Optical Embeddings

A Rust implementation of DeepSeek-OCR, a vision-language model that compresses long text documents through optical encoding using the [Burn](https://burn.dev) deep learning framework.

## 📄 About DeepSeek-OCR

This implementation is based on the paper:

> **DeepSeek-OCR: Contexts Optical Compression**
> Haoran Wei, Yaofeng Sun, Yukun Li
> DeepSeek-AI
> arXiv:2510.18234v1 [cs.CV] 21 Oct 2024
> [Paper (arXiv)](https://arxiv.org/abs/2510.18234) | [Official Repository](https://github.com/deepseek-ai/DeepSeek-OCR)

### Key Innovation

DeepSeek-OCR addresses the computational challenges of processing long textual contexts in Large Language Models (LLMs) by leveraging **context optical compression**—a novel approach that treats rendered text images as an efficient compression medium. Instead of processing thousands of text tokens, the model encodes document images into a compact set of vision tokens.

### Architecture Highlights

The system consists of two main components:

1. **DeepEncoder** (~380M parameters): A hybrid vision encoder combining:
    - **SAM-base** (80M): Window attention for efficient local feature extraction
    - **16× Convolutional Compressor**: Reduces spatial dimensions via two stride-2 conv layers
    - **CLIP-large** (300M): Global attention for semantic understanding
2. **DeepSeek3B-MoE** Decoder (570M activated): A Mixture-of-Experts language model that reconstructs text from compressed vision tokens

### Compression Performance

According to the paper's findings on the Fox benchmark:

- **~10× compression**: Achieves **97% OCR decoding precision**
- **~20× compression**: Maintains **60% accuracy**
- **Token efficiency**: Processes 1000 text words using only 100 vision tokens

The model supports multiple resolution modes optimized for different compression ratios:


| Mode | Resolution | Vision Tokens | Compression Target |
| :-- | :-- | :-- | :-- |
| Tiny | 512×512 | 64 | Ultra-fast inference |
| Small | 640×640 | 100 | Balanced performance |
| Base | 1024×1024 | 256 | Default (10× target) |
| Large | 1280×1280 | 400 | High-precision OCR |
| Gundam | Dynamic | <800 | Complex documents |

## 🎯 Implementation Features

This Rust implementation provides:

- ✅ **Complete DeepEncoder architecture** with SAM and CLIP encoders
- ✅ **Window \& global attention mechanisms** for efficient processing
- ✅ **16× spatial compression** via convolutional layers
- ✅ **Multi-resolution support** (Tiny/Small/Base/Large modes)
- ✅ **GPU acceleration** via WGPU (cross-platform) or CUDA (NVIDIA)
- ✅ **Information-theoretic compression metrics** for analysis
- ✅ **Production-ready** with proper error handling and logging

## Run
1. Download your favourite font in `assets/font.ttf`
2. `cargo run --release`
3. check the output image

## Tests
`cargo test --all-features -- --nocapture`

## Build Commands
CPU only (default):
```bash
cargo build --release
cargo run --release
```
With WGPU (GPU - works with NVIDIA, AMD, Intel, Apple Silicon):
```bash
cargo build --release --features wgpu
cargo run --release --features wgpu
```
With CUDA (NVIDIA only - fastest):
```bash
# Make sure CUDA toolkit is installed first:
# Ubuntu/Debian: sudo apt install nvidia-cuda-toolkit
# Or download from: https://developer.nvidia.com/cuda-downloads

cargo build --release --features cuda
cargo run --release --features cuda
```
Check which GPU you have:
```bash
# NVIDIA
nvidia-smi

# AMD/Intel/General
vulkaninfo | grep -i "device name"

# Or just try WGPU (works with most GPUs)
cargo run --release --features wgpu
```

## Metrics
Try: `cargo test -- test_information_compression_pipeline --nocapture`


Information compression:
```bash
╔═══════════════════════════════════════════════════════════╗
║          Optical Embeddings Information Analysis               ║
╚═══════════════════════════════════════════════════════════╝

📝 TEXT INFORMATION:
  ├─ Bytes:                   641
  ├─ Characters:              641
  ├─ Words:                    87
  ├─ Unique chars:             42
  └─ Entropy (bits):       4.3794

🖼️  IMAGE INFORMATION:
  ├─ Bytes:                786432
  ├─ Pixels:               262144
  ├─ Unique colors:             2
  └─ Entropy (bits):       0.1507

🎯 VISION TOKENS:
  ├─ Token count:              64
  ├─ Embedding dim:          1024
  └─ Total values:          65536

📊 COMPRESSION METRICS:
  ├─ Text→Image:           0.0008× (smaller)
  ├─ Text→Tokens:          0.0098× (smaller)
  ├─ Image→Tokens:        12.0000× (compressed)
  └─ Effective (ent):     39.5030× (compression)

📈 INFORMATION FLOW:
  Original text:    641 bytes (4.3793884228266 bits entropy)
  Rendered image:   786432 bytes (0.15070330510950625 bits entropy)
  Vision tokens:    64 tokens × 1024 dims = 65536 values
  Effective rate:   80.12 bits/token

📊 COMPRESSION RESULTS:
  ├─ Text words: 87
  ├─ Vision tokens: 64
  ├─ Words/token: 1.36
  └─ Spatial compression: 1024 patches → 64 tokens = 16.0× reduction

✅ Compression test passed!
   - Achieved 16× spatial compression (1024 → 64 tokens)
   - Word-to-token ratio: 1.36
   - ✅ Effective compression: 1.36 words per vision token
test tests::tests::test_information_compression_pipeline ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 7 filtered out; finished in 1.28s
```