manga-ocr-rs 0.1.5

Japanese manga OCR — ViT encoder + BERT decoder via ONNX Runtime
Documentation

manga-ocr-rs

Japanese manga OCR in pure Rust — no Python, no pip.

Runs mayocream/manga-ocr-onnx (the original kha-white/manga-ocr-base ONNX export) via ONNX Runtime. Returns raw Japanese text from an image crop; no translation, no furigana stripping — pure image-to-text.

Handles yokogaki (horizontal), tategaki (vertical), and tegaki (handwritten) text. Images are squish-resized to 224×224 matching the original training pipeline.


Quick start

# Cargo.toml
[dependencies]
manga-ocr-rs = "0.1"
use manga_ocr_rs::MangaOcr;

// Models are downloaded automatically on first `cargo build` (~441 MB).
let ocr = MangaOcr::new(manga_ocr_rs::default_model_dir())?;
let img = image::open("panel.png")?;

// Simple: just the text
println!("{}", ocr.recognize(&img)?);

// With confidence scores
let r = ocr.recognize_with_score(&img)?;
if r.confidence > 0.80 {
    println!("accepted: {} (confidence: {:.4})", r.text, r.confidence);
}

The first cargo build downloads three files (~441 MB total) from HuggingFace into ~/.cache/manga-ocr-rs/ via curl. Subsequent builds are instant.

To use a pre-downloaded copy:

MANGA_OCR_MODELS_DIR=/path/to/models cargo build

CLI

cargo install manga-ocr-rs

manga-ocr panel.png
manga-ocr inspect # print model I/O names

Test results (debug build, beam search k=4)

Unit-test fixtures (rescaled 2026-04-15)

Fixture Size Expected Result Confidence Tokens Time
Unit-test-yokogaki.png 360×197 データを正確に読み取る PASS (exact) 0.9997 12 ~1.8 s
Unit-test-tategaki.png 480×262 『言語モデルのテスト』 PASS (「」 bracket variant) 0.7967 12 ~2.0 s
Unit-test-tegaki.png 480×262 手書きの文字サンプル PASS (exact) 0.9552 11 ~1.9 s

All three pass. Tategaki reads the correct text but uses single corner brackets 「」 instead of double 『』 — the bracket style is ambiguous at this resolution. When cropped tighter by a text detector (e.g. DBNet → 142×262), the model returns the correct 『』.

Confidence is the dimension-adjusted geometric mean of per-token probabilities (0.0–1.0). No calibration penalty at these realistic manga-bubble sizes.

See unified benchmark for cross-engine comparison (manga-ocr-rs vs DBNet+manga-ocr-rs vs PaddleOCR-VL vs Umi-OCR).

Real manga — ubunchu01_02.png (9 speech bubbles)

Bubble Expected Result Score Tokens Trunc Time
Top right, line 1 あ あたしの オススメは PASS 0.9791 11 32,270 ms
Top right, large text うぶんちゅ FAIL — prefix leak from neighbour 0.9799 8 26,976 ms
Top left bubble 最近人気の デスクトップな リナックスです! PASS 0.9998 21 36,087 ms
Center caption ※ うぶんちゅではなくウブントゥです FAIL — tiny text, hallucination 0.1429 300 YES 39,811 ms
Middle bubble 却下! PASS 0.9147 4 5,275 ms
Bottom center マジ いてえ んだぞ! FAIL — slanted action text 0.0763 300 YES 25,298 ms
Bottom right よけんな このっ! FAIL — screaming/action text 0.0679 248 18,689 ms
Bottom left, top ハモリながら ケンカしないでーっ FAIL — ケンカケアカ 0.7022 16 1,213 ms
Bottom left, bottom 一瞬くらい 検討して くださいよー! PASS 0.9911 17 39,629 ms

4/9 pass on real manga. Failures are documented in the test source. Comparison normalises whitespace and full-width !?!?.

Score is the dimension-adjusted confidence value. Trunc indicates the decoder hit the 300-step limit without emitting EOS — a strong hallucination signal. Notice the pattern:

  • Hallucinations (center caption, bottom center/right): score < 0.15, tokens 248–300, two truncated. These are runaway decoder loops.
  • Correct results: score > 0.91, tokens 4–21, none truncated.
  • Minor errors (bottom left top, ケンカケアカ): score 0.70, 16 tokens — model is partially confident but single-char confused.
  • Prefix leak (top right large): score 0.98, 8 tokens — model is confident but saw neighboring text in the crop. Confidence alone won't catch this; crop quality matters.

A threshold of confidence >= 0.80 && !truncated would reject all garbage while keeping valid results.

Note: test_ubunchu_annotations is currently #[ignore]d in CI because several annotations fail due to bounding-box overlap, tiny crops, and decorative action text that the model hallucinates on. Run it manually with cargo test test_ubunchu_annotations -- --ignored. See docs/ubunchu-test-analysis.md for details.

Times are from unoptimised debug builds; cargo test --release is significantly faster.


Architecture

DynamicImage
    │
    ▼  preprocess()
    │  grayscale → RGB, squish-resize 224×224 Bilinear
    │  normalize mean=0.5 std=0.5
    │  shape: [1, 3, 224, 224]
    │
    ▼  encoder_model.onnx  (ViT, ~328 MB)
    │  last_hidden_state: [1, 196, 768]
    │
    ▼  decoder_model.onnx  (BERT, ~113 MB)
    │  beam search: 4 beams, batched per step
    │  no_repeat_ngram_size=3, length_penalty=2.0
    │  stops at EOS or 300 steps
    │
    ▼  vocab.txt  (~30 KB, line-indexed)
    │
    Recognition { text, score, confidence, raw_confidence }

Beam search parameters match generation_config.json from the original model.


License

MIT — see LICENSE.

Credits and Citations

Model: mayocream/manga-ocr-onnx
Original: kha-white/manga-ocr (MIT)

Manga109: manga109-dataset

@article{multimedia_aizawa_2020,
    author={Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta},
    title={Building a Manga Dataset ``Manga109'' with Annotations for Multimedia Applications},
    journal={IEEE MultiMedia},
    volume={27},
    number={2},
    pages={8--18},
    doi={10.1109/mmul.2020.2987895},
    year={2020}
}

Ubunchu: Ubunchu manga