This is a comprehensive demonstration of the Optical Embeddings context optical compression system.
The system implements a novel approach to compressing long text documents into compact visual
representations using a multi-stage vision encoder pipeline. The architecture, called DeepEncoder,
combines the best aspects of SAM (Segment Anything Model) and CLIP (Contrastive Language-Image
Pre-training) to achieve remarkable compression ratios while maintaining high decoding accuracy.
The process begins by rendering text documents as high-resolution images. These images are then
divided into 16x16 patches, similar to Vision Transformers. The patches first pass through a
window attention block inspired by SAM-base, which efficiently processes local features using
sparse attention patterns. This design choice significantly reduces activation memory while
processing high-resolution inputs.
Next, a convolutional compressor performs 16x token reduction through two convolutional layers
with kernel size 3, stride 2, and appropriate padding. This compression stage is crucial for
managing the number of vision tokens that will be processed by subsequent layers.
Finally, a global attention block based on CLIP-large applies dense attention across all
compressed tokens, capturing long-range dependencies and semantic relationships within the
document. This produces the final vision token representation.
According to experiments on the Fox benchmark, the system achieves 97% OCR decoding precision
at approximately 10x compression ratio. Even at 20x compression, accuracy remains around 60%.
This demonstrates the feasibility of using optical compression for long-context processing in
large language models.
The practical applications are significant. Optical Embeddings can generate training data for LLMs
and VLMs at a scale of 200,000+ pages per day using a single A100-40G GPU. On OmniDocBench,
it outperforms models like GOT-OCR2.0 while using fewer vision tokens, showcasing both
efficiency and effectiveness.
This technology opens new possibilities for handling ultra-long contexts in language models
by treating visual modality as an efficient compression medium. Historical conversation rounds
could be optically compressed to reduce token consumption, with older contexts progressively
downsampled to implement a forgetting mechanism that mirrors human memory decay.