compression-prompt 0.1.2

Fast statistical compression for LLM prompts - 50% token reduction with 91% quality retention
Documentation
# Compression Prompt


> Fast, intelligent prompt compression for LLMs - Save 50% tokens while maintaining 91% quality

A Rust implementation of **statistical filtering** for prompt compression. Achieves **50% token reduction** with **91% quality retention** (Claude Sonnet) in <1ms, validated across 6 flagship LLMs with 350+ test pairs.

## ๐ŸŽฏ Why Use This?


- **๐Ÿ’ฐ Save Money**: 50% fewer tokens = 50% lower LLM costs ($2.50 saved per million tokens)
- **โšก Ultra Fast**: <1ms compression time (10.58 MB/s throughput)
- **๐ŸŽ“ Proven Quality**: 91% quality with Claude Sonnet, 93% with Grok-4
- **โœ… LLM Validated**: A/B tested on 6 flagship models (Grok-4, Claude, GPT-5, Gemini)
- **๐Ÿš€ Production Ready**: No external models, pure Rust, minimal dependencies
- **๐Ÿ“Š Battle Tested**: 350+ test pairs, 1.6M tokens validated

## Quick Results


Validated on 200 real arXiv papers (1.6M tokens):

```
โœ… COMPRESSION SUCCESSFUL!

โœ“ Original: 1,662,729 tokens
โœ“ Compressed: 831,364 tokens
โœ“ Savings: 831,365 tokens (50.0%)
โœ“ Time: 0.92s (10.58 MB/s)
โœ“ Quality Score: 88.6%
โœ“ Keyword Retention: 100.0%
โœ“ Entity Retention: 91.8%
```

## How It Works


Statistical filtering uses **intelligent token scoring** to remove low-value words while preserving meaning:

1. **IDF Scoring**: Rare words get higher scores (technical terms preserved)
2. **Position Weight**: Start/end of text prioritized
3. **POS Heuristics**: Content words over function words
4. **Entity Detection**: Names, numbers, URLs preserved
5. **Entropy Analysis**: Vocabulary diversity maintained

**What gets removed:** "the" (75K), "and" (36K), "of" (35K), "a" (28K)  
**What stays:** Keywords, entities, technical terms, numbers (100% retention)

### โœจ NEW: JSON & Structured Data Protection (v0.1.2)


**Critical Fix**: JSON structures, code blocks, and structured data are now **100% preserved** during compression.

**Available in all implementations:**
- ๐Ÿ“ฆ Rust v0.1.2 - 41 tests โœ…
- ๐Ÿ Python v0.1.2 - 41 tests โœ…
- ๐Ÿ“˜ TypeScript v0.1.2 - 41 tests โœ…

```rust
// Before fix: JSON could be partially removed โŒ
{"user": {"name": "Alice", "age": 30}} โ†’ {"user": {"name": "Alice" 30}}

// After fix: JSON completely preserved โœ…
{"user": {"name": "Alice", "age": 30}} โ†’ {"user": {"name": "Alice", "age": 30}}
```

**Protected Content:**
- โœ… JSON objects & arrays (nested, multiline, escaped characters)
- โœ… Code blocks (`` ```code``` ``)
- โœ… File paths (`/path/to/file.ext`)
- โœ… Identifiers (`camelCase`, `snake_case`, `UPPER_CASE`)
- โœ… Domain terms (configurable)

**Use Cases:**
- API responses in prompts
- Configuration examples
- Technical documentation with code
- Any structured data in prompts

See [JSON Preservation Fix Documentation](docs/JSON_PRESERVATION_FIX.md) for details.

## Quick Start


### Installation


```bash
cd rust && cargo build --release
```

### Basic Usage


```rust
use compression_prompt::statistical_filter::{StatisticalFilter, StatisticalFilterConfig};
use compression_prompt::tokenizer::{MockTokenizer, Tokenizer};

// Use the recommended default (50% compression, 89% quality)
let config = StatisticalFilterConfig::default();
let filter = StatisticalFilter::new(config);
let tokenizer = MockTokenizer;

let compressed = filter.compress(&text, &tokenizer);

// Calculate savings
let savings = 1.0 - (tokenizer.count_tokens(&compressed) as f32 / 
                     tokenizer.count_tokens(&text) as f32);
println!("Savings: {:.1}%", savings * 100.0);
```

### Configuration Options


```rust
// Balanced (default) - 50% compression, 89% quality โญ
let balanced = StatisticalFilterConfig::default();

// Conservative - 30% compression, 96% quality
let conservative = StatisticalFilterConfig {
    compression_ratio: 0.7,
    ..Default::default()
};

// Aggressive - 70% compression, 71% quality
let aggressive = StatisticalFilterConfig {
    compression_ratio: 0.3,
    ..Default::default()
};
```

### Image Output Format (Optical Context Compression) ๐Ÿงช BETA


**NEW**: Inspired by [DeepSeek-OCR's optical context compression](https://arxiv.org/html/2510.18234v1), compress text into 1024x1024 images for vision model consumption.

```rust
use compression_prompt::{StatisticalFilter, ImageRenderer};

// Compress text with statistical filtering
let filter = StatisticalFilter::default();
let compressed = filter.compress(&text);

// Render to PNG image
let renderer = ImageRenderer::default();
let png_data = renderer.render_to_png(&compressed)?;
std::fs::write("compressed.png", png_data)?;

// Or render to JPEG (66% smaller than PNG)
let jpeg_data = renderer.render_to_jpeg(&compressed, 85)?; // quality: 85
std::fs::write("compressed.jpg", jpeg_data)?;
```

**Benefits:**
- ๐Ÿ“Š **Token Efficiency**: Use vision tokens instead of text tokens
- ๐Ÿ–ผ๏ธ **High Density**: 1024x1024 monospace rendering with 12.5pt font
- ๐ŸŽฏ **Vision Model Ready**: Compatible with GPT-4V, Claude 3, Gemini Vision
- โšก **Fast Rendering**: < 50ms per image
- ๐Ÿ’พ **JPEG Support**: 66% smaller files (vs PNG) with quality 85
- ๐Ÿ“„ **Auto-pagination**: Splits into multiple pages if text doesn't fit

**Image Formats:**
- **PNG**: Lossless, ~1.4 MB per page, perfect quality
- **JPEG Quality 85**: ~460 KB per page, 66% smaller, excellent readability

**Example:**
```bash
# Generate PNG images (50% compression)

cargo run --release --example paper_to_png_50pct

# Output: rnn_paper_compressed_page1.png, page2.png, page3.png...


# Compare PNG vs JPEG formats

cargo run --release --example compare_image_formats

# Tests different JPEG quality levels

```

**Use Cases:**
- ๐Ÿ“š Dense document compression for vision models
- ๐Ÿ’ฐ Reduce API costs using vision tokens vs text tokens
- ๐Ÿ”ฌ Research on optical context compression
- ๐Ÿ“Š Large-scale document processing

**Status:** Beta - Works well, pending extensive validation with vision models

## ๐Ÿ”ฌ Real Example


**Original (1.6M tokens):**
```
Bayesian Active Learning for Classification... Information theoretic 
active learning has been widely studied for probabilistic models...
[1.6 million more tokens...]
```

**Compressed (831K tokens - 50% reduction in 0.92s):**
```
Bayesian Active Classification... Information been widely studied 
probabilistic models...
[compressed to 831K tokens...]
```

**Removed:** 831,365 tokens (mainly "the", "and", "of", "a", "to")  
**Preserved:** 100% of keywords, 92% of entities  

## ๐Ÿ“Š LLM Validation Results


**Tested across 6 flagship LLMs with 350+ A/B test pairs:**

### Statistical 50% (Recommended Default โญ)


| LLM | Quality | Token Savings | Use Case |
|-----|---------|--------------|----------|
| **Grok-4** | **93%** | **50%** | Best overall performance |
| **Claude 3.5 Sonnet** | **91%** | **50%** | Best cost-benefit โญ |
| **Gemini Pro** | **89%** | **50%** | Balanced production |
| **GPT-5** | **89%** | **50%** | Keyword retention |
| **Grok** | **88%** | **50%** | Technical content |
| **Claude Haiku** | **87%** | **50%** | Cost-optimized |

### Statistical 70% (High Fidelity)


| LLM | Quality | Token Savings | Use Case |
|-----|---------|--------------|----------|
| **Grok-4** | **98%** | **30%** | Critical tasks |
| **Claude 3.5 Sonnet** | **97%** | **30%** | High precision |
| **GPT-5** | **96%** | **30%** | Legal/Medical |
| **Gemini Pro** | **96%** | **30%** | Near-perfect |
| **Grok** | **95%** | **30%** | Complex reasoning |
| **Claude Haiku** | **94%** | **30%** | Recommended for Haiku |

### Performance Characteristics


| Compression | Token Savings | Speed | Keyword Retention | Entity Retention |
|-------------|--------------|-------|-------------------|------------------|
| **50% (statistical_50)** โญ | **50%** | **0.16ms** | **92.0%** | **89.5%** |
| 70% (statistical_70) | 30% | 0.15ms | 99.2% | 98.4% |
| 30% (statistical_30) | 70% | 0.17ms | 72.4% | 71.5% |

## ๐Ÿ’ฐ Cost Savings (Validated Quality)


**For 1 million tokens with statistical_50:**

| LLM | Cost Before | Cost After | Savings | Quality Retained |
|-----|-------------|------------|---------|------------------|
| Grok-4 | $5.00 | $2.50 | **$2.50 (50%)** | **93%** |
| Claude Sonnet | $15.00 | $7.50 | **$7.50 (50%)** | **91%** โญ |
| GPT-5 | $5.00 | $2.50 | **$2.50 (50%)** | **89%** |
| Gemini Pro | $3.50 | $1.75 | **$1.75 (50%)** | **89%** |

**Annual savings** for high-volume applications (Claude Sonnet):
- **100M tokens/month**: $7,500/month = **$90,000/year** ๐Ÿ’ฐ
- **1B tokens/month**: $75,000/month = **$900,000/year** ๐Ÿ’ฐ

**ROI**: 91% quality with 50% cost reduction = **Excellent cost-benefit**

## ๐Ÿš€ Benchmarks


```bash
# Test on full dataset (200 papers, 1.6M tokens)

cargo run --release --bin test_statistical

# Quality benchmark (20 papers with detailed metrics)

cargo run --release --bin bench_quality

# Generate LLM evaluation dataset (63 prompt pairs)

cargo run --release --bin generate_llm_dataset
```

## ๐Ÿ“Š Complete A/B Test Results


**350+ test pairs validated across 6 LLMs:**

```bash
# View aggregated results

cat benchmarks/ab_tests/ab_test_comparison.md

# View LLM-specific reports

cat benchmarks/CLAUDE-SONNET-TEST-AB.md
cat benchmarks/GROK-4-TEST-AB.md
cat benchmarks/GPT5-TEST-AB.md
cat benchmarks/GEMINI-TEST-AB.md

# Access individual test files

ls benchmarks/llm_tests/100papers_statistical_50/  # 150 files
ls benchmarks/llm_tests/200papers_statistical_50/  # 300 files
```

**Test Coverage:**
- 100 papers dataset: 50 pairs per technique (150 pairs total)
- 200 papers dataset: 100 pairs per technique (300 pairs total)
- Techniques: statistical_50, statistical_70, hybrid
- All pairs include original + compressed + quality metrics

## ๐Ÿ“š Documentation


- **LLM Validation Reports:**
  - [Claude Sonnet A/B Test]benchmarks/CLAUDE-SONNET-TEST-AB.md - Recommended โญ
  - [Grok-4 A/B Test]benchmarks/GROK-4-TEST-AB.md - Best performance
  - [GPT-5 A/B Test]benchmarks/GPT5-TEST-AB.md
  - [Gemini A/B Test]benchmarks/GEMINI-TEST-AB.md
  - [Grok A/B Test]benchmarks/GROK-TEST-AB.md
  - [Claude Haiku A/B Test]benchmarks/CLAUDE-HAIKU-TEST-AB.md
- [A/B Test Comparison]benchmarks/ab_tests/ab_test_comparison.md - All results
- [Examples]docs/EXAMPLES.md - Practical use cases
- [Architecture]docs/ARCHITECTURE.md - System design
- [Roadmap]docs/ROADMAP.md - Future plans

## ๐ŸŽฏ Use Cases


### โœ… Perfect For:


- **RAG Systems**: Compress retrieved context (50% token savings)
- **Q&A Systems**: Reduce prompt size while preserving semantics
- **Long Document Processing**: Pre-compress before sending to LLM
- **Cost Optimization**: 50% fewer tokens = 50% lower API costs
- **Real-time Applications**: <1ms latency impact

### โš ๏ธ Not Ideal For:


- Creative writing (may lose style/voice)
- Poetry or literary text
- Very short texts (< 100 tokens)
- When every word matters (legal contracts, exact quotes)

## ๐Ÿงช Reproducing Our Results


All test pairs are available for independent validation:

```bash
# View a specific test pair

cat benchmarks/llm_tests/100papers_statistical_50/test_001_original.txt
cat benchmarks/llm_tests/100papers_statistical_50/test_001_compressed.txt

# Test with your LLM

python3 scripts/test_with_llm.py \
  --original benchmarks/llm_tests/100papers_statistical_50/test_001_original.txt \
  --compressed benchmarks/llm_tests/100papers_statistical_50/test_001_compressed.txt \
  --model claude-3-5-sonnet

# Expected results based on our validation:

# - Claude Sonnet: 91% quality, 50% savings

# - Grok-4: 93% quality, 50% savings

# - GPT-5: 89% quality, 50% savings

```

## ๐Ÿ”ง Advanced Configuration


Customize scoring weights for your use case:

```rust
let config = StatisticalFilterConfig {
    compression_ratio: 0.5,
    idf_weight: 0.3,         // Rare word importance (default: 0.3)
    position_weight: 0.2,    // Start/end prioritization (default: 0.2)
    pos_weight: 0.2,         // Content word importance (default: 0.2)
    entity_weight: 0.2,      // Named entity importance (default: 0.2)
    entropy_weight: 0.1,     // Vocabulary diversity (default: 0.1)
};
```

## ๐Ÿค Contributing


See [ROADMAP.md](docs/ROADMAP.md) for planned features.

## ๐Ÿ“„ License


MIT

## ๐Ÿ™ Acknowledgments


- Statistical filtering inspired by [LLMLingua]https://github.com/microsoft/LLMLingua
- Validated on arXiv papers from machine learning research
- A/B testing performed with: Grok-4, Claude 3.5 Sonnet, Claude 3.5 Haiku, GPT-5, Gemini Pro, Grok
- Built with Rust for maximum performance

## ๐ŸŽฏ Quick Recommendations


| Your LLM | Recommended Config | Quality | Savings | Why |
|----------|-------------------|---------|---------|-----|
| Grok-4 | statistical_50 | 93% | 50% | Best overall |
| Claude Sonnet | statistical_50 | 91% | 50% | Best cost-benefit โญ |
| GPT-5 | statistical_50 | 89% | 50% | Good balance |
| Gemini Pro | statistical_50 | 89% | 50% | Production ready |
| Claude Haiku | statistical_70 | 94% | 30% | Needs structure |
| Grok | statistical_70 | 95% | 30% | Conservative |

**Don't know which to choose?** โ†’ Use **Claude Sonnet + statistical_50** for the best cost-benefit ratio.