manga-ocr-rs

Japanese manga OCR in pure Rust — no Python, no pip.

Runs mayocream/manga-ocr-onnx (the original kha-white/manga-ocr-base ONNX export) via ONNX Runtime. Returns raw Japanese text from an image crop; no translation, no furigana stripping — pure image-to-text.

Handles yokogaki (horizontal), tategaki (vertical), and tegaki (handwritten) text. Images are squish-resized to 224×224 matching the original training pipeline.

Quick start

# Cargo.toml
[dependencies]
manga-ocr-rs = "0.1"

use manga_ocr_rs::MangaOcr;

// Models are downloaded automatically on first `cargo build` (~441 MB).
let ocr = MangaOcr::new(manga_ocr_rs::default_model_dir())?;
let img = image::open("panel.png")?;

// Simple: just the text
println!("{}", ocr.recognize(&img)?);

// With confidence scores
let r = ocr.recognize_with_score(&img)?;
if r.confidence > 0.80 {
    println!("accepted: {} (confidence: {:.4})", r.text, r.confidence);
}

The first cargo build downloads three files (~441 MB total) from HuggingFace into ~/.cache/manga-ocr-rs/ via curl. Subsequent builds are instant.

To use a pre-downloaded copy:

MANGA_OCR_MODELS_DIR=/path/to/models cargo build

CLI

cargo install manga-ocr-rs

manga-ocr panel.png
manga-ocr inspect # print model I/O names

Test results (debug build, beam search k=4)

Unit-test fixtures (rescaled 2026-04-15)

Fixture	Size	Expected	Result	Confidence	Tokens	Time
`Unit-test-yokogaki.png`	360×197	`データを正確に読み取る`	PASS (exact)	0.9997	12	~1.8 s
`Unit-test-tategaki.png`	480×262	`『言語モデルのテスト』`	PASS (`「」` bracket variant)	0.7967	12	~2.0 s
`Unit-test-tegaki.png`	480×262	`手書きの文字サンプル`	PASS (exact)	0.9552	11	~1.9 s

All three pass. Tategaki reads the correct text but uses single corner brackets 「」 instead of double 『』 — the bracket style is ambiguous at this resolution. When cropped tighter by a text detector (e.g. DBNet → 142×262), the model returns the correct 『』.

Confidence is the dimension-adjusted geometric mean of per-token probabilities (0.0–1.0). No calibration penalty at these realistic manga-bubble sizes.

See unified benchmark for cross-engine comparison (manga-ocr-rs vs DBNet+manga-ocr-rs vs PaddleOCR-VL vs Umi-OCR).

Real manga — `ubunchu01_02.png` (9 speech bubbles)

Bubble	Expected	Result	Score	Tokens	Trunc	Time
Top right, line 1	`ああたしのオススメは`	PASS	0.9791	11		32,270 ms
Top right, large text	`うぶんちゅ`	FAIL — prefix leak from neighbour	0.9799	8		26,976 ms
Top left bubble	`最近人気のデスクトップなリナックスです！`	PASS	0.9998	21		36,087 ms
Center caption	`※ うぶんちゅではなくウブントゥです`	FAIL — tiny text, hallucination	0.1429	300	YES	39,811 ms
Middle bubble	`却下！`	PASS	0.9147	4		5,275 ms
Bottom center	`マジいてえんだぞ！`	FAIL — slanted action text	0.0763	300	YES	25,298 ms
Bottom right	`よけんなこのっ！`	FAIL — screaming/action text	0.0679	248		18,689 ms
Bottom left, top	`ハモリながらケンカしないでーっ`	FAIL — `ケンカ` → `ケアカ`	0.7022	16		1,213 ms
Bottom left, bottom	`一瞬くらい検討してくださいよー！`	PASS	0.9911	17		39,629 ms

4/9 pass on real manga. Failures are documented in the test source. Comparison normalises whitespace and full-width ！？ → !?.

Score is the dimension-adjusted confidence value. Trunc indicates the decoder hit the 300-step limit without emitting EOS — a strong hallucination signal. Notice the pattern:

Hallucinations (center caption, bottom center/right): score < 0.15, tokens 248–300, two truncated. These are runaway decoder loops.
Correct results: score > 0.91, tokens 4–21, none truncated.
Minor errors (bottom left top, ケンカ→ケアカ): score 0.70, 16 tokens — model is partially confident but single-char confused.
Prefix leak (top right large): score 0.98, 8 tokens — model is confident but saw neighboring text in the crop. Confidence alone won't catch this; crop quality matters.

A threshold of confidence >= 0.80 && !truncated would reject all garbage while keeping valid results.

Note: test_ubunchu_annotations is currently #[ignore]d in CI because several annotations fail due to bounding-box overlap, tiny crops, and decorative action text that the model hallucinates on. Run it manually with cargo test test_ubunchu_annotations -- --ignored. See docs/ubunchu-test-analysis.md for details.

Times are from unoptimised debug builds; cargo test --release is significantly faster.

Architecture

DynamicImage
    │
    ▼  preprocess()
    │  grayscale → RGB, squish-resize 224×224 Bilinear
    │  normalize mean=0.5 std=0.5
    │  shape: [1, 3, 224, 224]
    │
    ▼  encoder_model.onnx  (ViT, ~328 MB)
    │  last_hidden_state: [1, 196, 768]
    │
    ▼  decoder_model.onnx  (BERT, ~113 MB)
    │  beam search: 4 beams, batched per step
    │  no_repeat_ngram_size=3, length_penalty=2.0
    │  stops at EOS or 300 steps
    │
    ▼  vocab.txt  (~30 KB, line-indexed)
    │
    Recognition { text, score, confidence, raw_confidence }

Beam search parameters match generation_config.json from the original model.

License

MIT — see LICENSE.

Credits and Citations

Model: mayocream/manga-ocr-onnx
Original: kha-white/manga-ocr (MIT)

Manga109: manga109-dataset

@article{multimedia_aizawa_2020,
    author={Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta},
    title={Building a Manga Dataset ``Manga109'' with Annotations for Multimedia Applications},
    journal={IEEE MultiMedia},
    volume={27},
    number={2},
    pages={8--18},
    doi={10.1109/mmul.2020.2987895},
    year={2020}
}

Ubunchu: Ubunchu manga

manga-ocr-rs 0.1.5