omniparse 0.4.0

A Rust toolkit for detecting and extracting metadata, text, and content from various file formats
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
# Omniparse OCR Guide

Single canonical reference for the OCR subsystem. The README points here.

## 30-second start

ML backend (recommended for photos, screenshots, unknown typography):

```sh
cargo install omniparse --features ocr-ml
omniparse models download                 # one-time, ~12 MB to cache
OMNIPARSE_OCR=ml omniparse photo.jpg      # text appears in stdout
```

Classical backend (pure-Rust, no downloads, fast — but only good on clean
printed scans with a matched font):

```sh
cargo install omniparse --features ocr
OMNIPARSE_OCR=classical omniparse scan.png
```

That's it. Skip to [PDF OCR](#pdf-ocr) for scanned-PDF specifics, or read on
for the backend-chooser table and tuning knobs.

## Contents

1. [The OMNIPARSE_OCR env var]#the-omniparse_ocr-env-var
2. [Classical vs ML backend]#classical-vs-ml-backend
3. [Managing the model cache]#managing-the-model-cache
4. [Quickstart: ML backend]#quickstart-ml-backend
5. [Quickstart: classical backend]#quickstart-classical-backend
6. [Training custom prototypes]#training-custom-prototypes
7. [Validating the pipeline]#validating-the-pipeline
8. [Tuning the classical pipeline]#tuning-the-classical-pipeline
9. [Debugging OCR output]#debugging-ocr-output
10. [PDF OCR]#pdf-ocr
11. [Library API]#library-api
12. [FAQ]#faq

## The OMNIPARSE_OCR env var

A single switch controls which backend runs:

| Value           | Effect                                                  |
| --------------- | ------------------------------------------------------- |
| unset / `off`   | OCR disabled (image parsers extract EXIF only)          |
| `classical`     | Pure-Rust classical pipeline (requires `ocr` feature)   |
| `ml`            | ocrs + rten ML backend (requires `ocr-ml` feature)      |

> The previous two-variable scheme (`OMNIPARSE_OCR=1` plus
> `OMNIPARSE_OCR_ML=1`) still works but prints a one-shot deprecation
> warning. It will be removed in 0.5.

## When to enable OCR

OCR is **off by default**. Enable it when:

- You're extracting from images that contain text but have no embedded
  text layer (photographs, screenshots, scanned PDFs).
- You're extracting from image/PDF inputs where text-layer extraction
  returns empty.

Skip it when:

- Your inputs are digital PDFs with intact text layers (`lopdf` already
  handles those — OCR adds no value and costs time).
- Your inputs are text-based formats (HTML, Markdown, Office documents).

## Classical vs ML backend

| Aspect               | Classical (`ocr`)               | ML (`ocr-ml`)                       |
| -------------------- | ------------------------------- | ----------------------------------- |
| Accuracy (clean scan)| Good                            | Very good                           |
| Accuracy (photo text)| Poor                            | Very good                           |
| Handwriting          | Useless                         | Limited                             |
| Non-Latin scripts    | Needs custom prototypes         | English model only out-of-box       |
| First-run cost       | Zero (algorithms only)          | ~30 MB model download               |
| Per-image cost       | Milliseconds                    | Seconds                             |
| Build size           | +imageproc / symspell           | +rten + rten-imageproc              |
| Offline-friendly     | Yes (pure algorithms)           | Only after first download           |
| Customization        | Extensive (train your own font) | Fixed models                        |

**Pick the classical backend when** you need full control, have matched-
font prototypes, are running on minimal infrastructure, or can't accept a
network download.

**Pick the ML backend when** your inputs are photographs, screenshots of
real applications, or anything with unknown typography — and you can accept
a one-time model download.

## Managing the model cache

The CLI ships dedicated subcommands for the ML model cache. They're useful
for pre-fetching in CI, baking models into a container image, debugging
"why isn't OCR working", and air-gapped installs.

```sh
omniparse models path        # print cache directory
omniparse models download    # fetch any missing models (~12 MB)
omniparse models list        # show name / size / sha256 / status per model
omniparse models verify      # re-hash cached files; exit 0 only if all match
```

All four commands require the `ocr-ml` feature; without it they exit with
a clear error message.

Override the cache directory with the `OMNIPARSE_OCR_MODELS` env var:

```sh
OMNIPARSE_OCR_MODELS=/opt/omniparse/models omniparse models download
```

Useful for shared installs or air-gapped environments where you want to
pre-populate the cache manually, or for Docker images that bake models
into a known path (see the project `Dockerfile`).

Programmatic equivalents live on [`omniparse::ocr::ml`]:
[`prefetch_all`][prefetch_all], [`verify_all`][verify_all],
[`list_models`][list_models], [`model_dir`][model_dir].

[prefetch_all]: https://docs.rs/omniparse/latest/omniparse/ocr/ml/fn.prefetch_all.html
[verify_all]: https://docs.rs/omniparse/latest/omniparse/ocr/ml/fn.verify_all.html
[list_models]: https://docs.rs/omniparse/latest/omniparse/ocr/ml/fn.list_models.html
[model_dir]: https://docs.rs/omniparse/latest/omniparse/ocr/ml/fn.model_dir.html

## Quickstart: ML backend

### Install

```toml
[dependencies]
omniparse = { version = "0.4", features = ["ocr-ml"] }
```

### Command line

```sh
# One-time setup
omniparse models download

# Every recognition run
OMNIPARSE_OCR=ml omniparse photo.jpg
```

Models live at `~/Library/Caches/omniparse/ocrs-models/` (macOS) or the
platform-appropriate cache dir. Override with
`OMNIPARSE_OCR_MODELS=<path>`.

### Library

```rust
use omniparse::ocr::ml::MlOcrEngine;
let engine = MlOcrEngine::new()?;          // downloads + sha256-verifies on first call
let image = image::open("photo.jpg")?;
let output = engine.recognize(image)?;
println!("{}", output.text);
```

## Quickstart: classical backend

### Install

```toml
[dependencies]
omniparse = { version = "0.4", features = ["ocr", "ocr-train"] }
```

(`ocr-train` is only needed for prototype generation; drop it for runtime-
only deployments that ship their own prototype JSON.)

### Command line with bundled prototypes

```sh
OMNIPARSE_OCR=classical omniparse image.png
```

The bundled prototypes are hand-authored 7×9 bitmap glyphs for uppercase
A–Z and digits 0–9. They're a smoke-test set — good enough to verify
the pipeline runs, **not good enough for real-world OCR**. Train custom
prototypes before production use.

### Command line with trained prototypes

```sh
# One-time training
cargo run --features ocr-train --example train_prototypes -- \
    /System/Library/Fonts/Supplemental/Arial.ttf ./arial.json 24,48,96

# Every recognition run
OMNIPARSE_OCR=classical OMNIPARSE_OCR_PROTOTYPES=./arial.json \
    omniparse image.png
```

## Training custom prototypes

### Single font, multiple sizes

```sh
cargo run --features ocr-train --example train_prototypes -- \
    Arial.ttf out.json 24,48,96,128
```

Produces one prototype per character × per pixel size. Include sizes that
match the heights of glyphs in your real images. Too-small or too-large
sizes are harmless but waste prototype entries.

### Multiple fonts

```sh
cargo run --features ocr-train --example train_prototypes -- \
    Arial.ttf:ArialBold.ttf:Helvetica.ttf:Verdana.ttf \
    out.json 24,48,96
```

Colon-separated font paths. Useful when the target input may contain any
of several similar typefaces.

### Custom character set

```sh
cargo run --features ocr-train --example train_prototypes -- \
    Arial.ttf out.json 48 "0123456789.,:"
```

Fourth argument is the literal string of characters to train. Defaults to
uppercase + lowercase + digits + common punctuation.

### Deduplicating a large prototype set

Multi-font × multi-scale sets can balloon into thousands of entries.
Reduce them via k-medoids per label:

```rust
use omniparse::ocr::prototypes::{load_prototypes_json, save_prototypes_json,
                                  dedupe_prototypes};
let protos = load_prototypes_json("multifont.json")?;
let reduced = dedupe_prototypes(protos, 4); // at most 4 per label
save_prototypes_json(&reduced, "multifont_small.json")?;
```

## Validating the pipeline

Before debugging a failing OCR run on a real image, confirm the pipeline
works end-to-end on a controlled input.

```sh
cargo run --features ocr-train --example ocr_validate -- \
    Arial.ttf "HELLO WORLD" 48
```

Expected output:

```
validator: font=Arial.ttf text="HELLO WORLD" px_size=48
trained 90 prototypes
recognized: "HELLO WORLD"
mean_confidence: 0.56
accuracy: 10/10 (100.0%)
```

100% accuracy confirms the pipeline is correct. Poor accuracy on real
images after this check is a font/input mismatch, not a code bug.

## Tuning the classical pipeline

Every stage has sensible defaults; override via env vars (CLI-friendly)
or the `OcrConfig` struct + `OcrEngineBuilder` (library-friendly).

### When the image has non-uniform lighting

```sh
export OMNIPARSE_OCR_BINARIZE=sauvola
export OMNIPARSE_OCR_CLAHE=1
```

Sauvola thresholds each pixel against its local neighborhood. CLAHE
normalizes contrast in overlapping tiles before binarization.

### When the image is a photograph with text overlay

```sh
export OMNIPARSE_OCR_LAYOUT=mser
export OMNIPARSE_OCR_SW_CV_MAX=0.5
export OMNIPARSE_OCR_NEIGHBOR_MIN=2
export OMNIPARSE_OCR_LINE_FILTER=1
```

MSER detects blob-like structures better than connected components on
photographic input. Stroke-width constancy rejects photo edges. Neighbor
density rejects isolated components. Line filter rejects clusters with
mismatched heights.

### When the image might be rotated

```sh
export OMNIPARSE_OCR_AUTO_ROTATE=1
```

Runs the pipeline four times (original + 90° + 180° + 270°) and keeps
the orientation with the highest `text_length × mean_confidence` score.
4× runtime cost.

### When the recognizer's top label is sometimes wrong

```sh
export OMNIPARSE_OCR_K=5
export OMNIPARSE_OCR_BIGRAM=1
```

k-NN voting uses the 5 nearest prototypes with inverse-distance weights.
Bigram re-ranking picks the character per position that maximizes English
character-bigram probability given the preceding character.

### When you want dictionary-constrained output

```sh
export OMNIPARSE_OCR_BEAM=1
export OMNIPARSE_OCR_BEAM_WIDTH=12
```

Word-level beam search. Picks the line-global string that jointly
maximizes recognition confidence + bigram fluency + dictionary membership.
Trade-off: will force-fit unknown words to the closest dictionary entry.

### When inputs are mixed-scale

```sh
export OMNIPARSE_OCR_NORMALIZE_HEIGHT=32
```

Resizes each detected region to 32px tall (aspect-preserving) before
feature extraction. Compensates for residual scale sensitivity in the
55-dim feature vector.

### When you have thousands of prototypes

```sh
export OMNIPARSE_OCR_KDTREE=1
```

Swaps the linear NN scan for a k-d tree. Recommended above ~500
prototypes.

### When input polarity is unknown

```sh
export OMNIPARSE_OCR_POLARITY=1
```

Extracts features from both the crop and its inverse; keeps whichever
produced the smaller nearest-neighbor distance.

## Debugging OCR output

### The three diagnostic fields

Every image parser output (with OCR on) includes:

```json
{
  "ocr_applied": true,
  "ocr_status": "recognized",
  "ocr_confidence": 0.82
}
```

Possible `ocr_status` values:

- `recognized` — text extracted. Content field is populated.
- `no_text_found` — pipeline ran, nothing passed the confidence filter.
  Also see `ocr_regions` (how many candidates the layout stage found).
- `error` — engine error. See `ocr_error` metadata.
- (field absent) — OCR didn't run. Check `OMNIPARSE_OCR=classical` (or
  `=ml`) and that the `ocr` or `ocr-ml` feature is compiled in.

### Visual debugging

```sh
export OMNIPARSE_OCR_DEBUG_DIR=/tmp/omniparse_debug
OMNIPARSE_OCR=classical omniparse image.jpg

open /tmp/omniparse_debug/01_input.png         # original (grayscale)
open /tmp/omniparse_debug/02_preprocessed.png  # after binarize + despeckle
open /tmp/omniparse_debug/03_layout.png        # red bboxes on detected regions
```

Interpretation:

- `02_preprocessed.png` shows muddy grey → binarization is failing.
  Switch to `sauvola` or enable `CLAHE`.
- `03_layout.png` shows red bboxes scattered across the photo background
  → layout analyzer is firing on image edges. Switch to `mser`, add
  `SW_CV_MAX` + `NEIGHBOR_MIN` filters.
- `03_layout.png` shows no bboxes at all → layout stage rejected every
  candidate. Relax filter thresholds or try a different layout analyzer.

## PDF OCR

When a PDF's text layer extraction returns empty, the PDF parser
automatically OCRs every embedded `DCTDecode` (JPEG) image. Requires the
same `OMNIPARSE_OCR=classical|ml` gate as image parsers. No extra
configuration.

```sh
OMNIPARSE_OCR=ml omniparse scanned.pdf
```

Output concatenates per-image recognized text with `[image N of M]`
headers. Metadata includes `ocr_images_total` and
`ocr_images_recognized`.

Limitations:

- `FlateDecode`, `JPXDecode`, `CCITTFaxDecode` image filters are not yet
  supported (require pixel reconstruction from the ColorSpace +
  BitsPerComponent dictionary). Affected images are silently skipped.
- No per-page rasterization of vector-only scanned PDFs. Pure-Rust PDF
  rasterization is not currently feasible.

## Library API

### OcrEngine — classical

```rust
use omniparse::ocr::{OcrEngine, OcrEngineBuilder, OcrConfig};
use omniparse::ocr::preprocess::{ImageprocPreprocessor, PreprocessConfig, BinarizeMode};
use omniparse::ocr::layout::ConnectedComponentAnalyzer;
use omniparse::ocr::recognize::FeatureRecognizer;
use omniparse::ocr::prototypes::load_prototypes_json;

let engine: OcrEngine = OcrEngineBuilder::default()
    .preprocessor(ImageprocPreprocessor::with_config(PreprocessConfig {
        binarize: BinarizeMode::Sauvola { window: 25, k: 0.2, r: 128.0 },
        clahe: true,
        bilateral_radius: 2,
        ..Default::default()
    }))
    .layout(ConnectedComponentAnalyzer::default())
    .recognizer(
        FeatureRecognizer::new(load_prototypes_json("/tmp/arial.json")?)
            .with_k(5)
            .with_both_polarities(true)
            .with_normalize_height(Some(32))
            .build_kdtree(),
    )
    .config(OcrConfig {
        min_confidence: 0.2,
        bigram_rerank: true,
        auto_rotate: true,
        stroke_width_cv_max: Some(0.5),
        neighbor_density_min: Some(2),
        ..Default::default()
    })
    .build();

let output = engine.recognize(image::open("page.png")?)?;
```

### MlOcrEngine — ML

```rust
#[cfg(feature = "ocr-ml")]
let engine = omniparse::ocr::ml::MlOcrEngine::new()?;
let output = engine.recognize(image::open("photo.jpg")?)?;
```

### Plug a custom stage

Every stage is a trait:

```rust
use omniparse::ocr::preprocess::Preprocessor;
use omniparse::ocr::error::OcrResult;
use image::{DynamicImage, GrayImage};

struct MyPreprocessor;
impl Preprocessor for MyPreprocessor {
    fn process(&self, img: DynamicImage) -> OcrResult<GrayImage> {
        Ok(img.into_luma8())
    }
}

let engine = OcrEngineBuilder::default()
    .preprocessor(MyPreprocessor)
    .build();
```

Same pattern for `LayoutAnalyzer`, `Recognizer`, `PostProcessor`.

### Standalone OCR helper

```rust
#[cfg(feature = "ocr")]
let text = omniparse::ocr::extract_text_from_image("image.png")?;
```

Convenience wrapper that builds the default engine and returns the
recognized text string.

## FAQ

**Q: Do I need the `ocr-train` feature at runtime?**
A: No. Prototype generation is a one-off build step. Production binaries
only need the `ocr` or `ocr-ml` feature plus a JSON prototype file (if
classical) or the auto-downloaded models (if ML).

**Q: Can I use the ML backend without internet access?**
A: Yes. On a machine that does have internet, run `omniparse models
download` (or `omniparse models download --force` to refresh) and then
copy the resulting cache directory to the offline host. Point the offline
binary at the copy via `OMNIPARSE_OCR_MODELS=/some/path`. The pre-built
Docker image (see project `Dockerfile`) already bakes the models in.

**Q: Why does the classical pipeline produce garbage on my photograph?**
A: Likely font mismatch. The classical recognizer shape-matches input
glyphs against trained prototypes. If the prototypes were trained from
Arial but the image is in Helvetica Neue, Futura, or the site's custom
brand font, shape distances won't match well. Identify the real font
(using WhatFont, Font Squirrel Matcherator, etc.) and retrain.

**Q: How much slower is ML OCR?**
A: Roughly 10–100× slower per image than the classical pipeline, but
still sub-second for most inputs on modern hardware. Enable
`ocr-parallel` if you're batching many images.

**Q: Can I train prototypes for non-Latin scripts?**
A: The training pipeline accepts any Unicode characters as the character
set argument. The feature extractor is script-agnostic. In practice the
bundled bigram table is English-only, so disable bigram rerank / beam
search for non-Latin scripts. The ML backend's bundled models are English-
only; multi-lingual models are on the ocrs-models roadmap.

**Q: Can I bundle models with my binary?**
A: Not on crates.io (10 MB package cap). For a private build, point
`OMNIPARSE_OCR_MODELS` at a known location and ship the `.rten` files
alongside your binary.

**Q: How do I disable the result cache?**

```sh
export OMNIPARSE_OCR_CACHE=0
```

**Q: Is there a version that runs in the browser / WASM?**
A: Not currently. The `image` crate decode path is WASM-compatible, and
`rten` has WASM support upstream, but the full omniparse build has not
been exercised against WASM targets. Try `cargo build --target wasm32-*
--features ocr-ml` and report issues.