oxify-connect-vision

🔍 Vision/OCR connector for OxiFY workflow automation engine

Overview

High-performance OCR (Optical Character Recognition) library supporting multiple backends with GPU acceleration, async processing, and comprehensive output formats. Designed for production workflows requiring reliable document processing at scale.

Features

🚀 Multiple OCR Providers: Mock (testing), Tesseract (traditional), Surya (modern), PaddleOCR (multilingual)
⚡ GPU Acceleration: CUDA and CoreML support via ONNX Runtime
🔄 Async/Await: Non-blocking processing for high throughput
💾 Smart Caching: Configurable LRU cache with TTL
📊 Rich Output: Text, Markdown, and structured JSON with bounding boxes
🌍 Multi-language: 100+ languages supported (provider-dependent)
🎯 Layout Analysis: Preserve document structure and hierarchy
🛡️ Production Ready: Zero warnings, comprehensive error handling

Providers Comparison

Provider	Backend	GPU	Languages	Quality	Setup
Mock	In-memory	❌	Any	Low	None
Tesseract	leptess	❌	100+	Medium	System package
Surya	ONNX Runtime	✅	6+	High	ONNX models
PaddleOCR	ONNX Runtime	✅	80+	High	ONNX models

Provider Details

Mock Provider

Purpose: Testing and development
Performance: <1ms per image
Use Cases: Unit tests, CI/CD pipelines, demos
Limitations: Returns placeholder text

Tesseract Provider

Purpose: General-purpose OCR
Performance: 200-500ms per page
Use Cases: Printed documents, forms, simple layouts
Strengths: Mature, widely used, no GPU required
Limitations: Struggles with complex layouts

Surya Provider

Purpose: Modern document understanding
Performance: 50-300ms (GPU), 200-500ms (CPU)
Use Cases: Complex layouts, academic papers, reports
Strengths: Excellent layout analysis, good quality
Requirements: ONNX detection & recognition models

PaddleOCR Provider

Purpose: Multilingual document processing
Performance: 60-400ms (GPU), 300-600ms (CPU)
Use Cases: Asian languages, mixed scripts
Strengths: 80+ languages, production-proven
Requirements: ONNX detection, recognition, & classification models

Installation

Basic Installation

Add to your Cargo.toml:

[dependencies]
oxify-connect-vision = { path = "../oxify-connect-vision" }

With Specific Providers

[dependencies]
oxify-connect-vision = {
    path = "../oxify-connect-vision",
    features = ["mock", "tesseract"]
}

All Features (Development)

[dependencies]
oxify-connect-vision = {
    path = "../oxify-connect-vision",
    features = ["mock", "tesseract", "surya", "paddle", "cuda"]
}

Feature Flags

Feature	Description	Dependencies
`mock`	Mock provider	None (default)
`tesseract`	Tesseract OCR	leptess, tesseract-sys
`surya`	Surya ONNX	ort
`paddle`	PaddleOCR ONNX	ort, ndarray
`onnx`	ONNX Runtime base	ort
`cuda`	CUDA GPU support	CUDA toolkit
`coreml`	CoreML (macOS)	CoreML

Quick Start

1. Simple OCR with Mock Provider

use oxify_connect_vision::{create_provider, VisionProviderConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create provider
    let config = VisionProviderConfig::mock();
    let provider = create_provider(&config)?;

    // Load model (idempotent)
    provider.load_model().await?;

    // Process image
    let image_data = std::fs::read("document.png")?;
    let result = provider.process_image(&image_data).await?;

    println!("📄 Text: {}", result.text);
    println!("📝 Markdown:\n{}", result.markdown);
    println!("📊 Blocks: {}", result.blocks.len());

    Ok(())
}

2. Production OCR with Tesseract

use oxify_connect_vision::{create_provider, VisionProviderConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Configure Tesseract for Japanese
    let config = VisionProviderConfig::tesseract(Some("jpn"));
    let provider = create_provider(&config)?;
    provider.load_model().await?;

    // Process image
    let image_data = std::fs::read("japanese_doc.png")?;
    let result = provider.process_image(&image_data).await?;

    // Access structured results
    for block in &result.blocks {
        println!(
            "🔤 {} (role: {}, confidence: {:.2}%)",
            block.text,
            block.role,
            block.confidence * 100.0
        );
    }

    Ok(())
}

3. GPU-Accelerated OCR with Surya

use oxify_connect_vision::{create_provider, VisionProviderConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Configure Surya with GPU
    let config = VisionProviderConfig::surya(
        "/path/to/models",  // Model directory
        true                // Enable GPU
    );

    let provider = create_provider(&config)?;
    provider.load_model().await?;

    let image_data = std::fs::read("complex_layout.png")?;

    let start = std::time::Instant::now();
    let result = provider.process_image(&image_data).await?;
    let duration = start.elapsed();

    println!("⚡ Processed in {:?}", duration);
    println!("📊 Found {} text blocks", result.blocks.len());

    Ok(())
}

4. Using the Cache

use oxify_connect_vision::{VisionCache, create_provider, VisionProviderConfig};
use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create cache
    let mut cache = VisionCache::new();
    cache.set_max_entries(1000);
    cache.set_ttl(Duration::from_secs(3600));

    let provider = create_provider(&VisionProviderConfig::mock())?;
    provider.load_model().await?;

    let image_data = std::fs::read("document.png")?;
    let cache_key = format!("doc_{}", compute_hash(&image_data));

    // Check cache first
    let result = if let Some(cached) = cache.get(&cache_key) {
        println!("💾 Cache hit!");
        cached
    } else {
        println!("🔄 Processing image...");
        let result = provider.process_image(&image_data).await?;
        cache.put(cache_key.clone(), result.clone());
        result
    };

    Ok(())
}

fn compute_hash(data: &[u8]) -> String {
    use std::collections::hash_map::DefaultHasher;
    use std::hash::{Hash, Hasher};
    let mut hasher = DefaultHasher::new();
    data.hash(&mut hasher);
    format!("{:x}", hasher.finish())
}

CLI Usage

The Oxify CLI provides convenient commands for OCR operations:

# List available providers
oxify vision list

# Process an image with specific provider
oxify vision process document.png \
  --provider tesseract \
  --format markdown \
  --output output.md

# Process with language specification
oxify vision process japanese.png \
  --provider tesseract \
  --language jpn

# Get detailed provider information
oxify vision info surya

# Benchmark multiple providers
oxify vision benchmark test.png \
  --providers tesseract,surya,paddle \
  --iterations 10

# Extract structured data
oxify vision extract receipt.png \
  --data-type receipt \
  --provider paddle

Workflow Integration

Using in JSON Workflows

{
  "nodes": [
    {
      "id": "ocr-node",
      "name": "Document OCR",
      "kind": {
        "type": "Vision",
        "config": {
          "provider": "surya",
          "model_path": "/models/surya",
          "output_format": "markdown",
          "use_gpu": true,
          "language": "en",
          "image_input": "{{input.document_image}}"
        }
      }
    }
  ]
}

Chaining with LLM Nodes

{
  "nodes": [
    {
      "id": "ocr",
      "name": "Extract Text",
      "kind": {
        "type": "Vision",
        "config": {
          "provider": "tesseract",
          "image_input": "{{input.image}}"
        }
      }
    },
    {
      "id": "analyze",
      "name": "Analyze Content",
      "kind": {
        "type": "LLM",
        "config": {
          "provider": "openai",
          "model": "gpt-4",
          "prompt_template": "Analyze this document:\n\n{{ocr.markdown}}"
        }
      }
    }
  ],
  "edges": [
    {"from": "ocr", "to": "analyze"}
  ]
}

Output Formats

Text Output

Simple text extraction with whitespace preservation.
Suitable for full-text search and basic NLP.

Markdown Output

# Document Title

## Section Header

Regular text content with **formatting** preserved.

- List item 1
- List item 2

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |

JSON Output

{
  "text": "Full document text...",
  "markdown": "# Document Title\n\n...",
  "blocks": [
    {
      "text": "Document Title",
      "bbox": [0.1, 0.1, 0.9, 0.2],
      "confidence": 0.98,
      "role": "Title"
    }
  ],
  "metadata": {
    "provider": "surya",
    "processing_time_ms": 145,
    "image_width": 1920,
    "image_height": 1080,
    "languages": ["en"],
    "page_count": 1
  }
}

Provider Setup

Tesseract Installation

Ubuntu/Debian:

sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-jpn

macOS:

brew install tesseract
brew install tesseract-lang  # Additional languages

Windows: Download installer from: https://github.com/UB-Mannheim/tesseract/wiki

Verify:

tesseract --version
tesseract --list-langs

Surya Models

Download models from Surya releases

Place in a directory:

models/surya/
├── detection.onnx
└── recognition.onnx

Set path in configuration:

VisionProviderConfig::surya("/path/to/models/surya", false)

PaddleOCR Models

Download from PaddlePaddle releases

Structure:

models/paddle/
├── det.onnx    # Detection model
├── rec.onnx    # Recognition model
└── cls.onnx    # Classification model

Configure:

VisionProviderConfig::paddle("/path/to/models/paddle", true)

Performance Benchmarks

Tested on: AMD Ryzen 9 5950X, NVIDIA RTX 3090, 1920x1080 images

Provider	CPU Time	GPU Time	Memory	Accuracy*
Mock	<1ms	-	<1MB	N/A
Tesseract	450ms	-	~200MB	85%
Surya	320ms	45ms	~1.5GB	92%
PaddleOCR	380ms	55ms	~1.8GB	90%

*Accuracy measured on standard document dataset

Optimization Tips

Enable Caching: 10-1000x speedup for repeated images
Use GPU: 5-10x speedup for ONNX providers
Batch Processing: Process multiple images concurrently
Image Preprocessing: Resize large images before processing
Choose Right Provider: Match provider capabilities to use case

Error Handling

use oxify_connect_vision::{VisionError, create_provider, VisionProviderConfig};

async fn safe_ocr(image: &[u8]) -> Result<String, String> {
    let config = VisionProviderConfig::tesseract(None);
    let provider = create_provider(&config)
        .map_err(|e| format!("Provider creation failed: {}", e))?;

    provider.load_model().await
        .map_err(|e| format!("Model loading failed: {}", e))?;

    match provider.process_image(image).await {
        Ok(result) => Ok(result.text),
        Err(VisionError::InvalidImage(msg)) => {
            Err(format!("Invalid image: {}", msg))
        }
        Err(VisionError::ProcessingFailed(msg)) => {
            Err(format!("Processing failed: {}", msg))
        }
        Err(e) => Err(format!("Unknown error: {}", e))
    }
}

Testing

Run tests:

# Unit tests (mock provider)
cargo test

# Integration tests (requires setup)
cargo test --features tesseract --ignored

# All tests with coverage
cargo test --all-features

Example test:

#[tokio::test]
async fn test_mock_ocr() {
    let config = VisionProviderConfig::mock();
    let provider = create_provider(&config).unwrap();
    provider.load_model().await.unwrap();

    let result = provider.process_image(b"test").await.unwrap();
    assert!(!result.text.is_empty());
    assert_eq!(result.metadata.provider, "mock");
}

Troubleshooting

"Model not loaded" Error

// Always call load_model() before processing
provider.load_model().await?;

Poor OCR Quality

Check image quality (DPI, contrast, noise)
Try different provider (Surya for complex layouts)
Specify correct language
Preprocess image (denoise, deskew)

ONNX Runtime Errors

Verify model files are compatible ONNX format
Check ONNX Runtime version: cargo tree | grep ort
Ensure GPU drivers are installed (for CUDA/CoreML)

Memory Issues

Reduce cache size: cache.set_max_entries(100)
Process images in batches
Resize large images before processing

Contributing

We welcome contributions! Areas of interest:

Additional provider integrations
Performance optimizations
Language-specific improvements
Documentation and examples
Bug fixes and tests

See TODO.md for planned enhancements.

License

Apache-2.0 - See LICENSE file in the root directory.

Links

Built with ❤️ for the Oxify workflow automation platform

oxify-connect-vision 0.1.0