# oxify-connect-vision
🔍 **Vision/OCR connector for OxiFY workflow automation engine**
[]()
[]()
[]()
[]()
## Overview
High-performance OCR (Optical Character Recognition) library supporting multiple backends with GPU acceleration, async processing, and comprehensive output formats. Designed for production workflows requiring reliable document processing at scale.
## Features
- 🚀 **Multiple OCR Providers**: Mock (testing), Tesseract (traditional), Surya (modern), PaddleOCR (multilingual)
- ⚡ **GPU Acceleration**: CUDA and CoreML support via ONNX Runtime
- 🔄 **Async/Await**: Non-blocking processing for high throughput
- 💾 **Smart Caching**: Configurable LRU cache with TTL
- 📊 **Rich Output**: Text, Markdown, and structured JSON with bounding boxes
- 🌍 **Multi-language**: 100+ languages supported (provider-dependent)
- 🎯 **Layout Analysis**: Preserve document structure and hierarchy
- 🛡️ **Production Ready**: Zero warnings, comprehensive error handling
## Providers Comparison
| **Mock** | In-memory | ❌ | Any | Low | None |
| **Tesseract** | leptess | ❌ | 100+ | Medium | System package |
| **Surya** | ONNX Runtime | ✅ | 6+ | High | ONNX models |
| **PaddleOCR** | ONNX Runtime | ✅ | 80+ | High | ONNX models |
### Provider Details
#### Mock Provider
- **Purpose**: Testing and development
- **Performance**: <1ms per image
- **Use Cases**: Unit tests, CI/CD pipelines, demos
- **Limitations**: Returns placeholder text
#### Tesseract Provider
- **Purpose**: General-purpose OCR
- **Performance**: 200-500ms per page
- **Use Cases**: Printed documents, forms, simple layouts
- **Strengths**: Mature, widely used, no GPU required
- **Limitations**: Struggles with complex layouts
#### Surya Provider
- **Purpose**: Modern document understanding
- **Performance**: 50-300ms (GPU), 200-500ms (CPU)
- **Use Cases**: Complex layouts, academic papers, reports
- **Strengths**: Excellent layout analysis, good quality
- **Requirements**: ONNX detection & recognition models
#### PaddleOCR Provider
- **Purpose**: Multilingual document processing
- **Performance**: 60-400ms (GPU), 300-600ms (CPU)
- **Use Cases**: Asian languages, mixed scripts
- **Strengths**: 80+ languages, production-proven
- **Requirements**: ONNX detection, recognition, & classification models
## Installation
### Basic Installation
Add to your `Cargo.toml`:
```toml
[dependencies]
oxify-connect-vision = { path = "../oxify-connect-vision" }
```
### With Specific Providers
```toml
[dependencies]
oxify-connect-vision = {
path = "../oxify-connect-vision",
features = ["mock", "tesseract"]
}
```
### All Features (Development)
```toml
[dependencies]
oxify-connect-vision = {
path = "../oxify-connect-vision",
features = ["mock", "tesseract", "surya", "paddle", "cuda"]
}
```
### Feature Flags
| `mock` | Mock provider | None (default) |
| `tesseract` | Tesseract OCR | leptess, tesseract-sys |
| `surya` | Surya ONNX | ort |
| `paddle` | PaddleOCR ONNX | ort, ndarray |
| `onnx` | ONNX Runtime base | ort |
| `cuda` | CUDA GPU support | CUDA toolkit |
| `coreml` | CoreML (macOS) | CoreML |
## Quick Start
### 1. Simple OCR with Mock Provider
```rust
use oxify_connect_vision::{create_provider, VisionProviderConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create provider
let config = VisionProviderConfig::mock();
let provider = create_provider(&config)?;
// Load model (idempotent)
provider.load_model().await?;
// Process image
let image_data = std::fs::read("document.png")?;
let result = provider.process_image(&image_data).await?;
println!("📄 Text: {}", result.text);
println!("📝 Markdown:\n{}", result.markdown);
println!("📊 Blocks: {}", result.blocks.len());
Ok(())
}
```
### 2. Production OCR with Tesseract
```rust
use oxify_connect_vision::{create_provider, VisionProviderConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Configure Tesseract for Japanese
let config = VisionProviderConfig::tesseract(Some("jpn"));
let provider = create_provider(&config)?;
provider.load_model().await?;
// Process image
let image_data = std::fs::read("japanese_doc.png")?;
let result = provider.process_image(&image_data).await?;
// Access structured results
for block in &result.blocks {
println!(
"🔤 {} (role: {}, confidence: {:.2}%)",
block.text,
block.role,
block.confidence * 100.0
);
}
Ok(())
}
```
### 3. GPU-Accelerated OCR with Surya
```rust
use oxify_connect_vision::{create_provider, VisionProviderConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Configure Surya with GPU
let config = VisionProviderConfig::surya(
"/path/to/models", // Model directory
true // Enable GPU
);
let provider = create_provider(&config)?;
provider.load_model().await?;
let image_data = std::fs::read("complex_layout.png")?;
let start = std::time::Instant::now();
let result = provider.process_image(&image_data).await?;
let duration = start.elapsed();
println!("⚡ Processed in {:?}", duration);
println!("📊 Found {} text blocks", result.blocks.len());
Ok(())
}
```
### 4. Using the Cache
```rust
use oxify_connect_vision::{VisionCache, create_provider, VisionProviderConfig};
use std::time::Duration;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create cache
let mut cache = VisionCache::new();
cache.set_max_entries(1000);
cache.set_ttl(Duration::from_secs(3600));
let provider = create_provider(&VisionProviderConfig::mock())?;
provider.load_model().await?;
let image_data = std::fs::read("document.png")?;
let cache_key = format!("doc_{}", compute_hash(&image_data));
// Check cache first
let result = if let Some(cached) = cache.get(&cache_key) {
println!("💾 Cache hit!");
cached
} else {
println!("🔄 Processing image...");
let result = provider.process_image(&image_data).await?;
cache.put(cache_key.clone(), result.clone());
result
};
Ok(())
}
fn compute_hash(data: &[u8]) -> String {
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
let mut hasher = DefaultHasher::new();
data.hash(&mut hasher);
format!("{:x}", hasher.finish())
}
```
## CLI Usage
The Oxify CLI provides convenient commands for OCR operations:
```bash
# List available providers
oxify vision list
# Process an image with specific provider
oxify vision process document.png \
--provider tesseract \
--format markdown \
--output output.md
# Process with language specification
oxify vision process japanese.png \
--provider tesseract \
--language jpn
# Get detailed provider information
oxify vision info surya
# Benchmark multiple providers
oxify vision benchmark test.png \
--providers tesseract,surya,paddle \
--iterations 10
# Extract structured data
oxify vision extract receipt.png \
--data-type receipt \
--provider paddle
```
## Workflow Integration
### Using in JSON Workflows
```json
{
"nodes": [
{
"id": "ocr-node",
"name": "Document OCR",
"kind": {
"type": "Vision",
"config": {
"provider": "surya",
"model_path": "/models/surya",
"output_format": "markdown",
"use_gpu": true,
"language": "en",
"image_input": "{{input.document_image}}"
}
}
}
]
}
```
### Chaining with LLM Nodes
```json
{
"nodes": [
{
"id": "ocr",
"name": "Extract Text",
"kind": {
"type": "Vision",
"config": {
"provider": "tesseract",
"image_input": "{{input.image}}"
}
}
},
{
"id": "analyze",
"name": "Analyze Content",
"kind": {
"type": "LLM",
"config": {
"provider": "openai",
"model": "gpt-4",
"prompt_template": "Analyze this document:\n\n{{ocr.markdown}}"
}
}
}
],
"edges": [
{"from": "ocr", "to": "analyze"}
]
}
```
## Output Formats
### Text Output
```
Simple text extraction with whitespace preservation.
Suitable for full-text search and basic NLP.
```
### Markdown Output
```markdown
# Document Title
## Section Header
Regular text content with **formatting** preserved.
- List item 1
- List item 2
| Data 1 | Data 2 |
```
### JSON Output
```json
{
"text": "Full document text...",
"markdown": "# Document Title\n\n...",
"blocks": [
{
"text": "Document Title",
"bbox": [0.1, 0.1, 0.9, 0.2],
"confidence": 0.98,
"role": "Title"
}
],
"metadata": {
"provider": "surya",
"processing_time_ms": 145,
"image_width": 1920,
"image_height": 1080,
"languages": ["en"],
"page_count": 1
}
}
```
## Provider Setup
### Tesseract Installation
**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-jpn
```
**macOS:**
```bash
brew install tesseract
brew install tesseract-lang # Additional languages
```
**Windows:**
Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
**Verify:**
```bash
tesseract --version
tesseract --list-langs
```
### Surya Models
1. Download models from Surya releases
2. Place in a directory:
```
models/surya/
├── detection.onnx
└── recognition.onnx
```
3. Set path in configuration:
```rust
VisionProviderConfig::surya("/path/to/models/surya", false)
```
### PaddleOCR Models
1. Download from PaddlePaddle releases
2. Structure:
```
models/paddle/
├── det.onnx # Detection model
├── rec.onnx # Recognition model
└── cls.onnx # Classification model
```
3. Configure:
```rust
VisionProviderConfig::paddle("/path/to/models/paddle", true)
```
## Performance Benchmarks
Tested on: AMD Ryzen 9 5950X, NVIDIA RTX 3090, 1920x1080 images
| Mock | <1ms | - | <1MB | N/A |
| Tesseract | 450ms | - | ~200MB | 85% |
| Surya | 320ms | 45ms | ~1.5GB | 92% |
| PaddleOCR | 380ms | 55ms | ~1.8GB | 90% |
*Accuracy measured on standard document dataset
### Optimization Tips
1. **Enable Caching**: 10-1000x speedup for repeated images
2. **Use GPU**: 5-10x speedup for ONNX providers
3. **Batch Processing**: Process multiple images concurrently
4. **Image Preprocessing**: Resize large images before processing
5. **Choose Right Provider**: Match provider capabilities to use case
## Error Handling
```rust
use oxify_connect_vision::{VisionError, create_provider, VisionProviderConfig};
async fn safe_ocr(image: &[u8]) -> Result<String, String> {
let config = VisionProviderConfig::tesseract(None);
let provider = create_provider(&config)
.map_err(|e| format!("Provider creation failed: {}", e))?;
provider.load_model().await
.map_err(|e| format!("Model loading failed: {}", e))?;
match provider.process_image(image).await {
Ok(result) => Ok(result.text),
Err(VisionError::InvalidImage(msg)) => {
Err(format!("Invalid image: {}", msg))
}
Err(VisionError::ProcessingFailed(msg)) => {
Err(format!("Processing failed: {}", msg))
}
Err(e) => Err(format!("Unknown error: {}", e))
}
}
```
## Testing
Run tests:
```bash
# Unit tests (mock provider)
cargo test
# Integration tests (requires setup)
cargo test --features tesseract --ignored
# All tests with coverage
cargo test --all-features
```
Example test:
```rust
#[tokio::test]
async fn test_mock_ocr() {
let config = VisionProviderConfig::mock();
let provider = create_provider(&config).unwrap();
provider.load_model().await.unwrap();
let result = provider.process_image(b"test").await.unwrap();
assert!(!result.text.is_empty());
assert_eq!(result.metadata.provider, "mock");
}
```
## Troubleshooting
### "Model not loaded" Error
```rust
// Always call load_model() before processing
provider.load_model().await?;
```
### Poor OCR Quality
- Check image quality (DPI, contrast, noise)
- Try different provider (Surya for complex layouts)
- Specify correct language
- Preprocess image (denoise, deskew)
### ONNX Runtime Errors
- Verify model files are compatible ONNX format
- Check ONNX Runtime version: `cargo tree | grep ort`
- Ensure GPU drivers are installed (for CUDA/CoreML)
### Memory Issues
- Reduce cache size: `cache.set_max_entries(100)`
- Process images in batches
- Resize large images before processing
## Contributing
We welcome contributions! Areas of interest:
- Additional provider integrations
- Performance optimizations
- Language-specific improvements
- Documentation and examples
- Bug fixes and tests
See [TODO.md](./TODO.md) for planned enhancements.
## License
Apache-2.0 - See LICENSE file in the root directory.
## Links
- [Main Repository](https://github.com/cool-japan/oxify)
- [Documentation](https://docs.rs/oxify-connect-vision)
- [Issue Tracker](https://github.com/cool-japan/oxify/issues)
- [Changelog](./CHANGELOG.md)
---
**Built with ❤️ for the Oxify workflow automation platform**