harumi
Pure-Rust PDF — CJK font embedding, OCR text overlay, text extraction, HTML→PDF, page merge/split.
Why harumi?
| Need | API |
|---|---|
| OCR invisible text layer on scanned PDFs | add_invisible_text · add_invisible_text_runs |
| Extract searchable text from existing PDFs | extract_text_runs · extract_text_chunks · extract_as_markdown |
| Watermark / stamp visible text | add_text · add_text_with_rotation |
| Merge or split PDFs | merge_from · extract_pages |
| Draw shapes (rect, line, ellipse, polygon, path) | add_rect · add_line · add_ellipse · add_polygon |
| Embed JPEG/PNG images with transparency | add_image · add_image_with_opacity |
| Convert HTML to PDF | render_html_to_pdf (html feature) |
| Use in WASM / Lambda / Edge (no C/C++ deps) | All APIs work cross-platform |
Quick Start
Overlay invisible OCR text
use Document;
let mut doc = from_file?;
let font = doc.embed_font?;
// Invisible layer for search/copy
doc.page?
.add_invisible_text?;
doc.save?;
Add visible watermark
// Red "CONFIDENTIAL" stamp centered on page
let = doc.page?.size?;
doc.page?.add_text?;
Extract searchable text
let runs = doc.extract_text_runs?;
for run in runs
Feature Flags
| Flag | What it enables | Dependencies |
|---|---|---|
| (default) | Text overlay, font embedding, text boxes, text extraction | lopdf, ttf-parser, getrandom |
draw |
Shapes: rect, line, ellipse, polygon, polyline | none |
image |
JPEG/PNG embed, extract from scanned PDFs | png crate |
ocr |
Tesseract coordinate conversion helpers | none |
flow |
FlowDocument: auto-pagination, headers/footers | none |
html |
HTML→PDF conversion (h1–h6, p, table, ul/ol) | none |
Supported Fonts
| Format | Status |
|---|---|
TrueType (.ttf) |
✅ Full support — pure-Rust subsetting |
TTC collections (.ttc, multiple faces) |
✅ Full support — embed_font_at(bytes, face_index) |
OpenType CFF (.otf) |
⚠️ Accepted, no subsetting (full font embedded) |
Recommended fonts (CJK):
NotoSansCJKjp-Regular.ttf(Japanese)NotoSansCJKsc-Regular.ttf(Simplified Chinese)NotoSansCJKtc-Regular.ttf(Traditional Chinese)NotoSansCJKkr-Regular.ttf(Korean)
Installation
[]
= "1"
For image or HTML features:
= { = "1", = ["image", "html"] }
Use via MCP Server
Use harumi's PDF tools directly from Claude Code, Cursor, or Continue IDE:
# Build the MCP server (pure Rust, no runtime dependency)
# Configure in your IDE and use tools:
# - pdf_extract_text: Extract text with positions
# - pdf_add_invisible_text: Add searchable OCR layer
# - pdf_html_to_pdf: HTML to PDF conversion
# - pdf_merge: Merge PDFs
# - pdf_page_info: Get page count & dimensions
Register on: smithery.ai or mcp.so (coming soon)
Why choose harumi?
✅ Pure Rust — zero C/C++ dependencies; works in WASM, Lambda, cross-compile
✅ CJK-native — full support for Chinese, Japanese, Korean fonts
✅ Simple API — complex font subsetting happens automatically at save time
✅ Text extraction — decode CID fonts + standard fonts (Type1, TrueType, WinAnsi)
✅ Text replacement — rewrite Tj/TJ operators with automatic re-subsetting
✅ Rich features — draw shapes, embed images, page merge/split, HTML→PDF
✅ Well-tested — 100+ unit + integration + E2E tests
More Info
- Full documentation — docs.rs/harumi
- Live demo — Browser annotation editor (WASM)
- Source code — github.com/kent-tokyo/harumi
- License — MIT OR Apache-2.0
Roadmap
- v1.x — Current stable
- v2.0 — PDF/A compliance, true digital signature verification (RSA/ECDSA)
- Future — AES-256 write encryption, RTL text (Arabic/Hebrew)
See the full README on GitHub for extensive examples, API reference, and internals explanation.