oxify-connect-vision 0.1.0

Vision/OCR connector for OxiFY workflows
Documentation

oxify-connect-vision

🔍 Vision/OCR connector for OxiFY workflow automation engine

Build Status Tests Coverage Warnings

Overview

High-performance OCR (Optical Character Recognition) library supporting multiple backends with GPU acceleration, async processing, and comprehensive output formats. Designed for production workflows requiring reliable document processing at scale.

Features

  • 🚀 Multiple OCR Providers: Mock (testing), Tesseract (traditional), Surya (modern), PaddleOCR (multilingual)
  • GPU Acceleration: CUDA and CoreML support via ONNX Runtime
  • 🔄 Async/Await: Non-blocking processing for high throughput
  • 💾 Smart Caching: Configurable LRU cache with TTL
  • 📊 Rich Output: Text, Markdown, and structured JSON with bounding boxes
  • 🌍 Multi-language: 100+ languages supported (provider-dependent)
  • 🎯 Layout Analysis: Preserve document structure and hierarchy
  • 🛡️ Production Ready: Zero warnings, comprehensive error handling

Providers Comparison

Provider Backend GPU Languages Quality Setup
Mock In-memory Any Low None
Tesseract leptess 100+ Medium System package
Surya ONNX Runtime 6+ High ONNX models
PaddleOCR ONNX Runtime 80+ High ONNX models

Provider Details

Mock Provider

  • Purpose: Testing and development
  • Performance: <1ms per image
  • Use Cases: Unit tests, CI/CD pipelines, demos
  • Limitations: Returns placeholder text

Tesseract Provider

  • Purpose: General-purpose OCR
  • Performance: 200-500ms per page
  • Use Cases: Printed documents, forms, simple layouts
  • Strengths: Mature, widely used, no GPU required
  • Limitations: Struggles with complex layouts

Surya Provider

  • Purpose: Modern document understanding
  • Performance: 50-300ms (GPU), 200-500ms (CPU)
  • Use Cases: Complex layouts, academic papers, reports
  • Strengths: Excellent layout analysis, good quality
  • Requirements: ONNX detection & recognition models

PaddleOCR Provider

  • Purpose: Multilingual document processing
  • Performance: 60-400ms (GPU), 300-600ms (CPU)
  • Use Cases: Asian languages, mixed scripts
  • Strengths: 80+ languages, production-proven
  • Requirements: ONNX detection, recognition, & classification models

Installation

Basic Installation

Add to your Cargo.toml:

[dependencies]
oxify-connect-vision = { path = "../oxify-connect-vision" }

With Specific Providers

[dependencies]
oxify-connect-vision = {
    path = "../oxify-connect-vision",
    features = ["mock", "tesseract"]
}

All Features (Development)

[dependencies]
oxify-connect-vision = {
    path = "../oxify-connect-vision",
    features = ["mock", "tesseract", "surya", "paddle", "cuda"]
}

Feature Flags

Feature Description Dependencies
mock Mock provider None (default)
tesseract Tesseract OCR leptess, tesseract-sys
surya Surya ONNX ort
paddle PaddleOCR ONNX ort, ndarray
onnx ONNX Runtime base ort
cuda CUDA GPU support CUDA toolkit
coreml CoreML (macOS) CoreML

Quick Start

1. Simple OCR with Mock Provider

use oxify_connect_vision::{create_provider, VisionProviderConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create provider
    let config = VisionProviderConfig::mock();
    let provider = create_provider(&config)?;

    // Load model (idempotent)
    provider.load_model().await?;

    // Process image
    let image_data = std::fs::read("document.png")?;
    let result = provider.process_image(&image_data).await?;

    println!("📄 Text: {}", result.text);
    println!("📝 Markdown:\n{}", result.markdown);
    println!("📊 Blocks: {}", result.blocks.len());

    Ok(())
}

2. Production OCR with Tesseract

use oxify_connect_vision::{create_provider, VisionProviderConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Configure Tesseract for Japanese
    let config = VisionProviderConfig::tesseract(Some("jpn"));
    let provider = create_provider(&config)?;
    provider.load_model().await?;

    // Process image
    let image_data = std::fs::read("japanese_doc.png")?;
    let result = provider.process_image(&image_data).await?;

    // Access structured results
    for block in &result.blocks {
        println!(
            "🔤 {} (role: {}, confidence: {:.2}%)",
            block.text,
            block.role,
            block.confidence * 100.0
        );
    }

    Ok(())
}

3. GPU-Accelerated OCR with Surya

use oxify_connect_vision::{create_provider, VisionProviderConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Configure Surya with GPU
    let config = VisionProviderConfig::surya(
        "/path/to/models",  // Model directory
        true                // Enable GPU
    );

    let provider = create_provider(&config)?;
    provider.load_model().await?;

    let image_data = std::fs::read("complex_layout.png")?;

    let start = std::time::Instant::now();
    let result = provider.process_image(&image_data).await?;
    let duration = start.elapsed();

    println!("⚡ Processed in {:?}", duration);
    println!("📊 Found {} text blocks", result.blocks.len());

    Ok(())
}

4. Using the Cache

use oxify_connect_vision::{VisionCache, create_provider, VisionProviderConfig};
use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create cache
    let mut cache = VisionCache::new();
    cache.set_max_entries(1000);
    cache.set_ttl(Duration::from_secs(3600));

    let provider = create_provider(&VisionProviderConfig::mock())?;
    provider.load_model().await?;

    let image_data = std::fs::read("document.png")?;
    let cache_key = format!("doc_{}", compute_hash(&image_data));

    // Check cache first
    let result = if let Some(cached) = cache.get(&cache_key) {
        println!("💾 Cache hit!");
        cached
    } else {
        println!("🔄 Processing image...");
        let result = provider.process_image(&image_data).await?;
        cache.put(cache_key.clone(), result.clone());
        result
    };

    Ok(())
}

fn compute_hash(data: &[u8]) -> String {
    use std::collections::hash_map::DefaultHasher;
    use std::hash::{Hash, Hasher};
    let mut hasher = DefaultHasher::new();
    data.hash(&mut hasher);
    format!("{:x}", hasher.finish())
}

CLI Usage

The Oxify CLI provides convenient commands for OCR operations:

# List available providers
oxify vision list

# Process an image with specific provider
oxify vision process document.png \
  --provider tesseract \
  --format markdown \
  --output output.md

# Process with language specification
oxify vision process japanese.png \
  --provider tesseract \
  --language jpn

# Get detailed provider information
oxify vision info surya

# Benchmark multiple providers
oxify vision benchmark test.png \
  --providers tesseract,surya,paddle \
  --iterations 10

# Extract structured data
oxify vision extract receipt.png \
  --data-type receipt \
  --provider paddle

Workflow Integration

Using in JSON Workflows

{
  "nodes": [
    {
      "id": "ocr-node",
      "name": "Document OCR",
      "kind": {
        "type": "Vision",
        "config": {
          "provider": "surya",
          "model_path": "/models/surya",
          "output_format": "markdown",
          "use_gpu": true,
          "language": "en",
          "image_input": "{{input.document_image}}"
        }
      }
    }
  ]
}

Chaining with LLM Nodes

{
  "nodes": [
    {
      "id": "ocr",
      "name": "Extract Text",
      "kind": {
        "type": "Vision",
        "config": {
          "provider": "tesseract",
          "image_input": "{{input.image}}"
        }
      }
    },
    {
      "id": "analyze",
      "name": "Analyze Content",
      "kind": {
        "type": "LLM",
        "config": {
          "provider": "openai",
          "model": "gpt-4",
          "prompt_template": "Analyze this document:\n\n{{ocr.markdown}}"
        }
      }
    }
  ],
  "edges": [
    {"from": "ocr", "to": "analyze"}
  ]
}

Output Formats

Text Output

Simple text extraction with whitespace preservation.
Suitable for full-text search and basic NLP.

Markdown Output

# Document Title

## Section Header

Regular text content with **formatting** preserved.

- List item 1
- List item 2

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |

JSON Output

{
  "text": "Full document text...",
  "markdown": "# Document Title\n\n...",
  "blocks": [
    {
      "text": "Document Title",
      "bbox": [0.1, 0.1, 0.9, 0.2],
      "confidence": 0.98,
      "role": "Title"
    }
  ],
  "metadata": {
    "provider": "surya",
    "processing_time_ms": 145,
    "image_width": 1920,
    "image_height": 1080,
    "languages": ["en"],
    "page_count": 1
  }
}

Provider Setup

Tesseract Installation

Ubuntu/Debian:

sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-jpn

macOS:

brew install tesseract
brew install tesseract-lang  # Additional languages

Windows: Download installer from: https://github.com/UB-Mannheim/tesseract/wiki

Verify:

tesseract --version
tesseract --list-langs

Surya Models

  1. Download models from Surya releases
  2. Place in a directory:
    models/surya/
    ├── detection.onnx
    └── recognition.onnx
    
  3. Set path in configuration:
    VisionProviderConfig::surya("/path/to/models/surya", false)
    

PaddleOCR Models

  1. Download from PaddlePaddle releases
  2. Structure:
    models/paddle/
    ├── det.onnx    # Detection model
    ├── rec.onnx    # Recognition model
    └── cls.onnx    # Classification model
    
  3. Configure:
    VisionProviderConfig::paddle("/path/to/models/paddle", true)
    

Performance Benchmarks

Tested on: AMD Ryzen 9 5950X, NVIDIA RTX 3090, 1920x1080 images

Provider CPU Time GPU Time Memory Accuracy*
Mock <1ms - <1MB N/A
Tesseract 450ms - ~200MB 85%
Surya 320ms 45ms ~1.5GB 92%
PaddleOCR 380ms 55ms ~1.8GB 90%

*Accuracy measured on standard document dataset

Optimization Tips

  1. Enable Caching: 10-1000x speedup for repeated images
  2. Use GPU: 5-10x speedup for ONNX providers
  3. Batch Processing: Process multiple images concurrently
  4. Image Preprocessing: Resize large images before processing
  5. Choose Right Provider: Match provider capabilities to use case

Error Handling

use oxify_connect_vision::{VisionError, create_provider, VisionProviderConfig};

async fn safe_ocr(image: &[u8]) -> Result<String, String> {
    let config = VisionProviderConfig::tesseract(None);
    let provider = create_provider(&config)
        .map_err(|e| format!("Provider creation failed: {}", e))?;

    provider.load_model().await
        .map_err(|e| format!("Model loading failed: {}", e))?;

    match provider.process_image(image).await {
        Ok(result) => Ok(result.text),
        Err(VisionError::InvalidImage(msg)) => {
            Err(format!("Invalid image: {}", msg))
        }
        Err(VisionError::ProcessingFailed(msg)) => {
            Err(format!("Processing failed: {}", msg))
        }
        Err(e) => Err(format!("Unknown error: {}", e))
    }
}

Testing

Run tests:

# Unit tests (mock provider)
cargo test

# Integration tests (requires setup)
cargo test --features tesseract --ignored

# All tests with coverage
cargo test --all-features

Example test:

#[tokio::test]
async fn test_mock_ocr() {
    let config = VisionProviderConfig::mock();
    let provider = create_provider(&config).unwrap();
    provider.load_model().await.unwrap();

    let result = provider.process_image(b"test").await.unwrap();
    assert!(!result.text.is_empty());
    assert_eq!(result.metadata.provider, "mock");
}

Troubleshooting

"Model not loaded" Error

// Always call load_model() before processing
provider.load_model().await?;

Poor OCR Quality

  • Check image quality (DPI, contrast, noise)
  • Try different provider (Surya for complex layouts)
  • Specify correct language
  • Preprocess image (denoise, deskew)

ONNX Runtime Errors

  • Verify model files are compatible ONNX format
  • Check ONNX Runtime version: cargo tree | grep ort
  • Ensure GPU drivers are installed (for CUDA/CoreML)

Memory Issues

  • Reduce cache size: cache.set_max_entries(100)
  • Process images in batches
  • Resize large images before processing

Contributing

We welcome contributions! Areas of interest:

  • Additional provider integrations
  • Performance optimizations
  • Language-specific improvements
  • Documentation and examples
  • Bug fixes and tests

See TODO.md for planned enhancements.

License

Apache-2.0 - See LICENSE file in the root directory.

Links


Built with ❤️ for the Oxify workflow automation platform