harumi 1.3.2

Pure-Rust PDF — CJK font embedding (Chinese/Japanese/Korean), OCR text overlay, text extraction, HTML→PDF, page merge/split. WASM-ready, zero C deps.
Documentation

harumi

Pure-Rust PDF — CJK font embedding, OCR text overlay, text extraction, HTML→PDF, page merge/split.

Crates.io docs.rs License: MIT OR Apache-2.0


Why harumi?

Need API
OCR invisible text layer on scanned PDFs add_invisible_text · add_invisible_text_runs
Extract searchable text from existing PDFs extract_text_runs · extract_text_chunks · extract_as_markdown
Watermark / stamp visible text add_text · add_text_with_rotation
Merge or split PDFs merge_from · extract_pages
Draw shapes (rect, line, ellipse, polygon, path) add_rect · add_line · add_ellipse · add_polygon
Embed JPEG/PNG images with transparency add_image · add_image_with_opacity
Convert HTML to PDF render_html_to_pdf (html feature)
Use in WASM / Lambda / Edge (no C/C++ deps) All APIs work cross-platform

Quick Start

Overlay invisible OCR text

use harumi::Document;

let mut doc = Document::from_file("scanned.pdf")?;
let font = doc.embed_font(include_bytes!("NotoSansCJK-Regular.ttf"))?;

// Invisible layer for search/copy
doc.page(1)?
    .add_invisible_text("日本語テキスト", font, [72.0, 700.0], 12.0)?;

doc.save("searchable.pdf")?;

Add visible watermark

// Red "CONFIDENTIAL" stamp centered on page
let (w, h) = doc.page(1)?.size()?;
doc.page(1)?.add_text(
    "CONFIDENTIAL",
    font,
    [w / 2.0 - 60.0, h / 2.0],
    24.0,
    [0.8, 0.0, 0.0], // RGB red
)?;

Extract searchable text

let runs = doc.extract_text_runs(1)?;
for run in runs {
    println!("Text: {} at ({}, {})", run.text, run.x, run.y);
}

Feature Flags

Flag What it enables Dependencies
(default) Text overlay, font embedding, text boxes, text extraction lopdf, ttf-parser, getrandom
draw Shapes: rect, line, ellipse, polygon, polyline none
image JPEG/PNG embed, extract from scanned PDFs png crate
ocr Tesseract coordinate conversion helpers none
flow FlowDocument: auto-pagination, headers/footers none
html HTML→PDF conversion (h1–h6, p, table, ul/ol) none

Supported Fonts

Format Status
TrueType (.ttf) ✅ Full support — pure-Rust subsetting
TTC collections (.ttc, multiple faces) ✅ Full support — embed_font_at(bytes, face_index)
OpenType CFF (.otf) ⚠️ Accepted, no subsetting (full font embedded)

Recommended fonts (CJK):

  • NotoSansCJKjp-Regular.ttf (Japanese)
  • NotoSansCJKsc-Regular.ttf (Simplified Chinese)
  • NotoSansCJKtc-Regular.ttf (Traditional Chinese)
  • NotoSansCJKkr-Regular.ttf (Korean)

Installation

[dependencies]
harumi = "1"

For image or HTML features:

harumi = { version = "1", features = ["image", "html"] }

Use via MCP Server

Use harumi's PDF tools directly from Claude Code, Cursor, or Continue IDE:

# Build the MCP server (pure Rust, no runtime dependency)
cargo build -p harumi-mcp

# Configure in your IDE and use tools:
# - pdf_extract_text: Extract text with positions
# - pdf_add_invisible_text: Add searchable OCR layer
# - pdf_html_to_pdf: HTML to PDF conversion
# - pdf_merge: Merge PDFs
# - pdf_page_info: Get page count & dimensions

Register on: smithery.ai or mcp.so (coming soon)


Why choose harumi?

Pure Rust — zero C/C++ dependencies; works in WASM, Lambda, cross-compile
CJK-native — full support for Chinese, Japanese, Korean fonts
Simple API — complex font subsetting happens automatically at save time
Text extraction — decode CID fonts + standard fonts (Type1, TrueType, WinAnsi)
Text replacement — rewrite Tj/TJ operators with automatic re-subsetting
Rich features — draw shapes, embed images, page merge/split, HTML→PDF
Well-tested — 100+ unit + integration + E2E tests


More Info


Roadmap

  • v1.x — Current stable
  • v2.0 — PDF/A compliance, true digital signature verification (RSA/ECDSA)
  • Future — AES-256 write encryption, RTL text (Arabic/Hebrew)

See the full README on GitHub for extensive examples, API reference, and internals explanation.