manga-ocr-rs
Japanese manga OCR in pure Rust — no Python, no pip.
Runs mayocream/manga-ocr-onnx (the original kha-white/manga-ocr-base ONNX export) via ONNX Runtime. Returns raw Japanese text from an image crop; no translation, no furigana stripping — pure image-to-text.
Handles yokogaki (horizontal), tategaki (vertical), and tegaki (handwritten) text. Images are squish-resized to 224×224 matching the original training pipeline.
Quick start
# Cargo.toml
[]
= "0.1"
use MangaOcr;
// Models are downloaded automatically on first `cargo build` (~441 MB).
let ocr = new?;
let img = open?;
// Simple: just the text
println!;
// With confidence scores
let r = ocr.recognize_with_score?;
if r.confidence > 0.80
The first cargo build downloads three files (~441 MB total) from HuggingFace
into ~/.cache/manga-ocr-rs/ via curl. Subsequent builds are instant.
To use a pre-downloaded copy:
MANGA_OCR_MODELS_DIR=/path/to/models
CLI
Test results (debug build, beam search k=4)
Unit-test fixtures (rescaled 2026-04-15)
| Fixture | Size | Expected | Result | Confidence | Tokens | Time |
|---|---|---|---|---|---|---|
Unit-test-yokogaki.png |
360×197 | データを正確に読み取る |
PASS (exact) | 0.9997 | 12 | ~1.8 s |
Unit-test-tategaki.png |
480×262 | 『言語モデルのテスト』 |
PASS (「」 bracket variant) |
0.7967 | 12 | ~2.0 s |
Unit-test-tegaki.png |
480×262 | 手書きの文字サンプル |
PASS (exact) | 0.9552 | 11 | ~1.9 s |
All three pass. Tategaki reads the correct text but uses single corner brackets
「」 instead of double 『』 — the bracket style is ambiguous at this resolution.
When cropped tighter by a text detector (e.g. DBNet → 142×262), the model returns
the correct 『』.
Confidence is the dimension-adjusted geometric mean of per-token probabilities (0.0–1.0). No calibration penalty at these realistic manga-bubble sizes.
See unified benchmark for cross-engine comparison (manga-ocr-rs vs DBNet+manga-ocr-rs vs PaddleOCR-VL vs Umi-OCR).
Real manga — ubunchu01_02.png (9 speech bubbles)
| Bubble | Expected | Result | Score | Tokens | Trunc | Time |
|---|---|---|---|---|---|---|
| Top right, line 1 | あ あたしの オススメは |
PASS | 0.9791 | 11 | 32,270 ms | |
| Top right, large text | うぶんちゅ |
FAIL — prefix leak from neighbour | 0.9799 | 8 | 26,976 ms | |
| Top left bubble | 最近人気の デスクトップな リナックスです! |
PASS | 0.9998 | 21 | 36,087 ms | |
| Center caption | ※ うぶんちゅではなくウブントゥです |
FAIL — tiny text, hallucination | 0.1429 | 300 | YES | 39,811 ms |
| Middle bubble | 却下! |
PASS | 0.9147 | 4 | 5,275 ms | |
| Bottom center | マジ いてえ んだぞ! |
FAIL — slanted action text | 0.0763 | 300 | YES | 25,298 ms |
| Bottom right | よけんな このっ! |
FAIL — screaming/action text | 0.0679 | 248 | 18,689 ms | |
| Bottom left, top | ハモリながら ケンカしないでーっ |
FAIL — ケンカ → ケアカ |
0.7022 | 16 | 1,213 ms | |
| Bottom left, bottom | 一瞬くらい 検討して くださいよー! |
PASS | 0.9911 | 17 | 39,629 ms |
4/9 pass on real manga. Failures are documented in the test source.
Comparison normalises whitespace and full-width !? → !?.
Score is the dimension-adjusted confidence value. Trunc indicates the
decoder hit the 300-step limit without emitting EOS — a strong hallucination
signal. Notice the pattern:
- Hallucinations (center caption, bottom center/right): score < 0.15, tokens 248–300, two truncated. These are runaway decoder loops.
- Correct results: score > 0.91, tokens 4–21, none truncated.
- Minor errors (bottom left top,
ケンカ→ケアカ): score 0.70, 16 tokens — model is partially confident but single-char confused. - Prefix leak (top right large): score 0.98, 8 tokens — model is confident but saw neighboring text in the crop. Confidence alone won't catch this; crop quality matters.
A threshold of confidence >= 0.80 && !truncated would reject all
garbage while keeping valid results.
Note:
test_ubunchu_annotationsis currently#[ignore]d in CI because several annotations fail due to bounding-box overlap, tiny crops, and decorative action text that the model hallucinates on. Run it manually withcargo test test_ubunchu_annotations -- --ignored. See docs/ubunchu-test-analysis.md for details.
Times are from unoptimised debug builds; cargo test --release is significantly
faster.
Architecture
DynamicImage
│
▼ preprocess()
│ grayscale → RGB, squish-resize 224×224 Bilinear
│ normalize mean=0.5 std=0.5
│ shape: [1, 3, 224, 224]
│
▼ encoder_model.onnx (ViT, ~328 MB)
│ last_hidden_state: [1, 196, 768]
│
▼ decoder_model.onnx (BERT, ~113 MB)
│ beam search: 4 beams, batched per step
│ no_repeat_ngram_size=3, length_penalty=2.0
│ stops at EOS or 300 steps
│
▼ vocab.txt (~30 KB, line-indexed)
│
Recognition { text, score, confidence, raw_confidence }
Beam search parameters match generation_config.json from the original model.
License
MIT — see LICENSE.
Credits and Citations
Model: mayocream/manga-ocr-onnx
Original: kha-white/manga-ocr (MIT)
Manga109: manga109-dataset
Ubunchu: Ubunchu manga