ruvector-scipix 2.0.4

Rust OCR engine for scientific documents - extract LaTeX, MathML from math equations, research papers, and technical diagrams with ONNX GPU acceleration
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# Performance Optimizations Guide

This document describes the performance optimizations available in ruvector-scipix and how to use them effectively.

## Overview

The optimization module provides multiple strategies to improve performance:

1. **SIMD Operations**: Vectorized image processing (AVX2, AVX-512, NEON)
2. **Parallel Processing**: Multi-threaded execution using Rayon
3. **Memory Optimizations**: Object pooling, memory mapping, zero-copy views
4. **Model Quantization**: INT8 quantization for reduced memory and faster inference
5. **Dynamic Batching**: Intelligent batching for throughput optimization

## Feature Detection

The library automatically detects CPU capabilities at runtime:

```rust
use ruvector_scipix::optimize::{detect_features, get_features};

// Detect CPU features
let features = detect_features();
println!("AVX2: {}", features.avx2);
println!("AVX-512: {}", features.avx512f);
println!("NEON: {}", features.neon);
println!("SSE4.2: {}", features.sse4_2);
```

## SIMD Operations

### Grayscale Conversion

Convert RGBA images to grayscale using SIMD:

```rust
use ruvector_scipix::optimize::simd;

let rgba: Vec<u8> = /* your RGBA data */;
let mut gray = vec![0u8; rgba.len() / 4];

// Automatically uses best SIMD implementation available
simd::simd_grayscale(&rgba, &mut gray);
```

**Performance**: Up to 4x faster than scalar implementation on AVX2 systems.

### Threshold Operation

Fast binary thresholding:

```rust
simd::simd_threshold(&gray, 128, &mut binary);
```

**Performance**: Up to 8x faster on AVX2 systems.

### Normalization

Fast tensor normalization for model inputs:

```rust
let mut tensor_data: Vec<f32> = /* your data */;
simd::simd_normalize(&mut tensor_data);
```

**Performance**: Up to 3x faster on AVX2 systems.

## Parallel Processing

### Parallel Image Preprocessing

Process multiple images in parallel:

```rust
use ruvector_scipix::optimize::parallel;
use image::DynamicImage;

let images: Vec<DynamicImage> = /* your images */;

let processed = parallel::parallel_preprocess(images, |img| {
    // Your preprocessing function
    preprocess_image(img)
});
```

### Pipeline Execution

Create processing pipelines with parallel stages:

```rust
use ruvector_scipix::optimize::parallel::Pipeline3;

let pipeline = Pipeline3::new(
    |img| preprocess(img),
    |img| detect_regions(img),
    |regions| recognize_text(regions),
);

let results = pipeline.execute_batch(images);
```

### Async Parallel Execution

Execute async operations with concurrency limits:

```rust
use ruvector_scipix::optimize::parallel::AsyncParallelExecutor;

let executor = AsyncParallelExecutor::new(4); // Max 4 concurrent

let results = executor.execute(tasks, |task| async move {
    process_async(task).await
}).await;
```

## Memory Optimizations

### Buffer Pooling

Reuse buffers to reduce allocations:

```rust
use ruvector_scipix::optimize::memory::{BufferPool, GlobalPools};

// Use global pools
let pools = GlobalPools::get();
let mut buffer = pools.acquire_large(); // 1MB buffer
buffer.extend_from_slice(&data);
// Buffer automatically returns to pool when dropped

// Or create custom pool
let pool = BufferPool::new(
    || Vec::with_capacity(1024),
    initial_size: 10,
    max_size: 100
);
```

**Benefits**: Reduces allocation overhead, improves cache locality.

### Memory-Mapped Models

Load large models without copying to memory:

```rust
use ruvector_scipix::optimize::memory::MmapModel;

let model = MmapModel::from_file("model.bin")?;
let data = model.as_slice(); // Zero-copy access
```

**Benefits**: Faster loading, lower memory usage, shared across processes.

### Zero-Copy Image Views

Work with image data without copying:

```rust
use ruvector_scipix::optimize::memory::ImageView;

let view = ImageView::new(&data, width, height, channels)?;
let pixel = view.pixel(x, y);

// Create subview without copying
let roi = view.subview(x, y, width, height)?;
```

### Arena Allocation

Fast temporary allocations:

```rust
use ruvector_scipix::optimize::memory::Arena;

let mut arena = Arena::with_capacity(1024 * 1024);

for _ in 0..iterations {
    let buffer = arena.alloc(size, alignment);
    // Use buffer...
    arena.reset(); // Reuse capacity
}
```

## Model Quantization

### Basic Quantization

Quantize f32 weights to INT8:

```rust
use ruvector_scipix::optimize::quantize;

let weights: Vec<f32> = /* your model weights */;
let (quantized, params) = quantize::quantize_weights(&weights);

// Later, dequantize for inference
let restored = quantize::dequantize(&quantized, params);
```

**Benefits**: 4x memory reduction, faster inference on some hardware.

### Quantized Tensors

Work with quantized tensor representations:

```rust
use ruvector_scipix::optimize::quantize::QuantizedTensor;

let tensor = QuantizedTensor::from_f32(&data, vec![batch, channels, height, width]);
println!("Compression ratio: {:.2}x", tensor.compression_ratio());

// Dequantize when needed
let f32_data = tensor.to_f32();
```

### Per-Channel Quantization

Better accuracy for convolutional/linear layers:

```rust
use ruvector_scipix::optimize::quantize::PerChannelQuant;

// For weight tensor [out_channels, in_channels, ...]
let quant = PerChannelQuant::from_f32(&weights, shape);

// Each output channel has its own scale/zero-point
```

### Quality Metrics

Measure quantization quality:

```rust
use ruvector_scipix::optimize::quantize::{quantization_error, sqnr};

let (quantized, params) = quantize::quantize_weights(&original);

let mse = quantization_error(&original, &quantized, params);
let signal_noise_ratio = sqnr(&original, &quantized, params);

println!("MSE: {:.6}, SQNR: {:.2} dB", mse, signal_noise_ratio);
```

## Dynamic Batching

### Basic Batching

Automatically batch requests for better throughput:

```rust
use ruvector_scipix::optimize::batch::{DynamicBatcher, BatchConfig};

let config = BatchConfig {
    max_batch_size: 32,
    max_wait_ms: 50,
    max_queue_size: 1000,
    preferred_batch_size: 16,
};

let batcher = Arc::new(DynamicBatcher::new(config, |items: Vec<Image>| {
    process_batch(items) // Your batch processing logic
}));

// Start processing loop
tokio::spawn({
    let batcher = batcher.clone();
    async move { batcher.run().await }
});

// Add items
let result = batcher.add(image).await?;
```

### Adaptive Batching

Automatically adjust batch size based on latency:

```rust
use ruvector_scipix::optimize::batch::AdaptiveBatcher;
use std::time::Duration;

let batcher = Arc::new(AdaptiveBatcher::new(
    config,
    Duration::from_millis(100), // Target latency
    processor,
));

// Batch size adapts to maintain target latency
```

## Optimization Levels

Control which optimizations are enabled:

```rust
use ruvector_scipix::optimize::{OptLevel, set_opt_level};

// Set optimization level at startup
set_opt_level(OptLevel::Full); // All optimizations

// Available levels:
// - OptLevel::None:     No optimizations
// - OptLevel::Simd:     SIMD only
// - OptLevel::Parallel: SIMD + parallel
// - OptLevel::Full:     All optimizations (default)
```

## Benchmarking

Run benchmarks to compare optimized vs non-optimized implementations:

```bash
# Run all optimization benchmarks
cargo bench --bench optimization_bench

# Run specific benchmark group
cargo bench --bench optimization_bench -- grayscale

# Generate detailed reports
cargo bench --bench optimization_bench -- --verbose
```

### Expected Performance Improvements

Based on benchmarks on modern x86_64 systems with AVX2:

| Operation | Speedup | Notes |
|-----------|---------|-------|
| Grayscale conversion | 3-4x | AVX2 vs scalar |
| Threshold | 6-8x | AVX2 vs scalar |
| Normalization | 2-3x | AVX2 vs scalar |
| Parallel preprocessing (8 cores) | 6-7x | vs sequential |
| Buffer pooling | 2-3x | vs direct allocation |
| Quantization | 4x memory | INT8 vs FP32 |

## Best Practices

1. **Enable optimizations by default**: Use the `optimize` feature in production
2. **Profile first**: Use benchmarks to identify bottlenecks
3. **Use appropriate batch sizes**: Larger batches = better throughput, higher latency
4. **Pool buffers for hot paths**: Reduces allocation overhead significantly
5. **Quantize models**: 4x memory reduction with minimal accuracy loss
6. **Match parallelism to workload**: Use thread count ≤ CPU cores

## Platform-Specific Notes

### x86_64

- **AVX2**: Widely available on modern CPUs (2013+)
- **AVX-512**: Available on newer server CPUs, provides marginal improvements
- Best performance on CPUs with good SIMD execution units

### ARM (AArch64)

- **NEON**: Available on all ARMv8+ CPUs
- Good SIMD performance, especially on Apple Silicon
- Some operations may be faster with scalar code due to different execution units

### WebAssembly

- SIMD support is limited and experimental
- Optimizations gracefully degrade to scalar implementations
- Focus on algorithmic optimizations and caching

## Troubleshooting

### Low SIMD Performance

If SIMD optimizations are not providing expected speedup:

1. Check CPU features: `cargo run -- detect-features`
2. Ensure data is properly aligned (16-byte alignment for SIMD)
3. Profile to ensure SIMD code paths are being used
4. Try different optimization levels

### High Memory Usage

If memory usage is too high:

1. Enable buffer pooling for frequently allocated buffers
2. Use memory-mapped models instead of loading into RAM
3. Enable model quantization
4. Reduce batch sizes

### Thread Contention

If parallel performance is poor:

1. Reduce thread count: `set_thread_count(cores - 1)`
2. Use chunked parallel processing for better load balancing
3. Avoid fine-grained parallelism (prefer coarser chunks)
4. Profile mutex/lock contention

## Integration Example

Complete example using multiple optimizations:

```rust
use ruvector_scipix::optimize::*;
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<()> {
    // Set optimization level
    set_opt_level(OptLevel::Full);

    // Detect features
    let features = detect_features();
    println!("Features: {:?}", features);

    // Create buffer pools
    let pools = memory::GlobalPools::get();

    // Create adaptive batcher
    let batcher = Arc::new(batch::AdaptiveBatcher::new(
        batch::BatchConfig::default(),
        Duration::from_millis(100),
        |images| process_images(images),
    ));

    // Start batcher
    let batcher_clone = batcher.clone();
    tokio::spawn(async move { batcher_clone.run().await });

    // Process images
    let result = batcher.add(image).await?;

    Ok(())
}

fn process_images(images: Vec<Image>) -> Vec<Result<Output, String>> {
    // Use parallel processing
    parallel::parallel_map_chunked(images, 8, |img| {
        // Get pooled buffer
        let mut buffer = memory::GlobalPools::get().acquire_large();

        // Use SIMD operations
        let mut gray = vec![0u8; img.width() * img.height()];
        simd::simd_grayscale(img.as_rgba8(), &mut gray);

        // Process...
        Ok(output)
    })
}
```

## Future Optimizations

Planned improvements:

- GPU acceleration using wgpu
- Custom ONNX runtime integration
- Advanced quantization (INT4, mixed precision)
- Streaming processing for video
- Distributed inference

## References

- [SIMD in Rust]https://doc.rust-lang.org/std/arch/
- [Rayon Parallel Processing]https://docs.rs/rayon/
- [Quantization Techniques]https://arxiv.org/abs/2103.13630
- Benchmark results: See `benches/optimization_bench.rs`