harumi

Overlay text, extract content, merge/split pages, draw shapes — all in pure Rust.
Full CJK (Japanese / Chinese / Korean) font support. Zero C dependencies. WASM-ready.

日本語 | 中文 | 한국어

Try the live browser demo → — annotation editor (text · rect · line · freehand pen) running entirely in your browser via WASM

What harumi solves

Before (without harumi):
Hand-assemble CID font objects from the PDF spec. Implement CMap generation, GID mapping, and subsetting in hundreds of lines. Still fight character rendering bugs.

After (with harumi):

let mut doc = Document::from_file("scanned.pdf")?;
let font = doc.embed_font(include_bytes!("NotoSansCJK-Regular.ttf"))?;
doc.page(1)?.add_invisible_text("検索対象テキスト", font, [72.0, 700.0], 12.0)?;
doc.save("searchable.pdf")?;

Font subsetting, CID encoding, and ToUnicode CMap generation are all automatic. The library handles it.

What you get

Challenge	harumi's answer
CJK font subsetting is complex	One `embed_font()` call — only used glyphs are included, GIDs correctly remapped
Don't want to corrupt existing PDF structure	Append-only: harumi never touches the original object graph
Need to run in WASM / Lambda / cross-compile	Pure Rust — zero C/C++ dependencies
Need OCR text at specific coordinates	`add_invisible_text` / batch `add_invisible_text_runs`
Need to stamp a watermark on PDFs	`add_text(color)` overlays visible text in any RGB color
Need to position text relative to page size	`page.size()` reads the MediaBox
Need in-memory output for Tauri / WASM	`save_to_bytes()` returns a `Vec<u8>` directly
Need to draw highlight rectangles or lines	`add_rect` / `add_line` (`draw` feature, no extra deps)
Need to draw a box border or polygon (callout)	`add_rect_stroke` / `add_polygon` (`draw` feature)
Need multi-line wrapped text in a box	`add_text_box` (no feature gate needed)
Need to embed JPEG / PNG images	`add_image` / `add_image_with_opacity` (`image` feature)
Need PNG transparency (signatures, watermarks)	Transparent PNGs use PDF SMask automatically — no white background
Need to rotate, remove, or reorder pages	`rotate_page` / `remove_page` / `insert_blank_page` / `reorder_pages` (no feature gate)
Need to merge two PDFs into one	`merge_from` appends all pages from another document; content and fonts preserved
Need to create a PDF from scratch (no existing file)	`Document::new(size)` creates a blank 1-page PDF; add pages with `insert_blank_page`
Need to split a PDF into separate files	`extract_pages` returns a new `Document` with the specified pages in any order
Need to extract text positions from an existing PDF	`extract_text_runs` decodes CID fonts and standard simple fonts (Type1, TrueType, WinAnsi, etc.)
Need to read or write PDF metadata (title, author…)	`doc.metadata()` reads `/Info`; `doc.set_metadata(&meta)` writes it
Need to replace text in an existing PDF (new font)	`page.replace_text(old, new, font)` rewrites the content stream in-place; returns the match count as `usize`; automatic font-switching and width compensation
Need to replace text using the original font	`page.replace_text_preserve_font(old, new)` — no `FontHandle` needed; returns match count; validates glyphs eagerly (not at `save()`)
Need to check replaceability without modifying	`page.can_replace_text(old, new)` — pure read-only scan; returns match count or `Err(FontCharNotMapped)`
Need to draw an ellipse or circle	`add_ellipse(rect, color, opacity, filled, stroke_width)` (`draw` feature)
Need fill + stroke on same shape	pass `filled=true` and `stroke_width>0` to `add_ellipse` / `add_polygon` / `add_path` — uses PDF `B` operator
Need open or closed path (polyline + polygon unified)	`add_path(points, closed, color, filled, stroke_width, opacity)` (`draw` feature)
Need rotated text (watermarks, stamps at an angle)	`add_text_with_rotation(text, font, pos, size, color, opacity, degrees)`
Need to replace text spanning multiple Tj operators	`replace_text` / `replace_text_preserve_font` — cross-operator matching supported
Need to extract an embedded image from a scanned PDF	`extract_page_image` returns JPEG or PNG bytes (`image` feature); scanned PDFs only
Need clickable URL links in a PDF	`add_link_url([x, y, w, h], url)` — invisible URI annotation; click opens the URL in any viewer
Need internal navigation links (TOC)	`add_link_internal([x, y, w, h], target_page)` — jumps to a page within the same document
Need a bookmarks / navigation outline	`add_bookmark(title, page, y)` — flat PDF outline entries; CJK titles stored as UTF-16BE automatically
Need page numbers / running headers–footers on every page	`FlowOptions { header: Some(hf), footer: Some(hf), .. }` with `HeaderFooter` (`flow` feature); `{{page}}` / `{{total}}` substituted at render
Need headings to auto-generate outline entries	`FlowOptions { auto_bookmarks: true, .. }` (default) — every `push_heading` creates a bookmark
Need to load a password-protected PDF	`Document::from_file_with_password(path, pw)` / `from_bytes_with_password(bytes, pw)` — decrypts on load; both user and owner passwords accepted
Need to save a PDF with password protection	`doc.set_encryption(user_pw, owner_pw)` — encrypts at `save()` time with 128-bit RC4
Need to check if a PDF was originally encrypted	`doc.is_encrypted()` — `true` even after successful decryption
Need to highlight / underline / strike through text	`add_highlight` / `add_underline` / `add_strikeout` / `add_squiggly` with color — standard PDF markup annotations with QuadPoints
Need to add a sticky-note comment to a page	`add_sticky_note([x, y], "note text")` — Text annotation, Unicode contents
Need to read PDF form field values	`doc.form_fields()` — returns `Vec<FormField>` with name, type, and current value
Need to fill in a PDF form programmatically	`doc.fill_form(&[("FieldName", "value")])` — sets values and triggers NeedAppearances
Need to set/read page crop or print boxes	`page.crop_box()` / `set_crop_box(rect)` / `trim_box()` / `bleed_box()` — all box types in `[x,y,w,h]` format
Need to use CMYK colors (print workflow)	`Color::Cmyk([c, m, y, k])` — unified `Color` enum; `Color::Rgb()` still works via `From<[f32; 3]>` (v1.0+, breaking change)
Need to verify digital signatures on a PDF	`doc.verify_signatures(&pdf_bytes)` — extracts signature metadata (signer, timestamp, field name); cryptographic validation TBD (`digital-signature` feature)

Comparison with similar tools

Feature	harumi	pdf-lib (JS)	printpdf (Rust)	lopdf (Rust)	pdfium-render (Rust)
Pure Rust — no C/C++ deps	Yes	N/A	Yes	Yes	No (C++ PDFium)
WASM / cross-platform	Yes	Yes	Yes	Yes	Partial (complex setup)
CJK text on existing PDF	Yes	Yes	No (new PDFs only)	No (manual)	Yes
Text extraction	Yes (CID + simple)	Partial (basic)	No	Partial (basic)	Yes full
Text replacement (with re-subsetting)	Yes	No	No	No	No
Page manipulation	Yes	Yes	Partial (limited)	Yes (low-level)	Yes
Draw shapes	Yes	Yes	Yes	No (manual)	Yes
Flow document / auto-pagination	Yes	No	No	No	No
HTML → PDF	Yes	No	No	No	No
Inline bold / italic / color	Yes (synthetic)	No	No	No	Yes
Encryption (read)	Yes (RC4)	Yes	No	Partial	Yes
Encryption (write)	Yes (RC4-128)	Yes	No	No	Yes
Markup annotations	Yes	Partial (basic)	No	No	Yes
CMYK color support	Yes (v1.0+)	Yes	Yes	No	Yes
Digital signature verification	Partial (metadata)	Partial (basic)	No	No	Yes

Yes = supported Partial = partial / limited No = not supported N/A = language-level feature

Comparison with modern Rust PDF alternatives

Feature	harumi	unpdf	pdf_oxide	justpdf-core
Direction	Read + Write	Read only	Full lifecycle	Full lifecycle
Primary use case	CJK text overlay on existing PDFs	PDF → Markdown/text extraction	Multi-language PDF ops	Comprehensive PDF engine
Pure Rust (zero C/C++ deps)	Yes	Yes	Likely	Yes
WASM support	Yes (verified)	Yes	Yes	Not documented
Text extraction
— CID fonts (ToUnicode CMap)	Yes	Yes ⭐	Yes	Yes
— Simple fonts (Type1/TrueType)	Yes	Yes	Yes	Yes
— Form XObject recursion	No (v1.3)	Yes ⭐	Yes	Unknown
— Graphic state preservation	No (v1.3)	Yes ⭐	Yes	Unknown
— `uni<XXXX>` glyph names	No (v1.3)	Yes ⭐	Unknown	Unknown
— Reading order / XY-Cut	No	Yes ⭐	Yes	Unknown
— RTL / BiDi support	No	Yes ⭐	Unknown	Unknown
Text writing
— CJK font embedding	Yes ⭐	N/A	Partial	Yes
— Font subsetting	Yes ⭐ (deferred)	N/A	Unknown	Yes
— Identity-H / Identity-V	Yes ⭐	N/A	Unknown	Yes
— Type0 CID generation	Yes ⭐	N/A	Unknown	Yes
Page operations	Yes	No	Yes	Yes
Drawing (shapes, images)	Yes	No	Yes (partial)	Yes
Encryption (read)	Yes (RC4)	Yes (RC4)	Yes	Yes (RC4, AES)
Encryption (write)	Yes (RC4-128, AES-256)	No	Yes	Yes (RC4, AES-256)
Digital signatures	Partial (metadata)	No	Yes	Yes (PKCS#7/CMS)
PDF/A compliance	Planned (v1.3)	No	Yes (validate)	Yes (validate)
Performance focus	Correctness	Speed (specialized)	Speed (5× PyMuPDF)	Comprehensive
Multi-language bindings	WASM only	None	7 languages	C FFI only

Key differences:

harumi — Specialized for writing CJK text onto existing PDFs; explicit deferred subsetting strategy; confirmed WASM support
unpdf — Specialized for reading PDFs and extracting clean Markdown/text; superior CJK extraction quality (XY-Cut, RTL, Form XObject)
pdf_oxide — General-purpose PDF engine with multi-language bindings; 5× faster extraction via zero-copy tokenization; Rust core with Python/JS/Go/C#/Java bindings
justpdf-core — Full PDF engine; uses region-specific CID orderings (Japan1/GB1/CNS1/Korea1) for legacy PDF compatibility

Recommendation: Use harumi if you're overlay writing CJK onto existing PDFs (OCR layers, stamps, watermarks). Use unpdf if you need to extract text from CJK PDFs and fix garbled characters. Use pdf_oxide if you need multi-language support and fast extraction. Use justpdf-core if you need a comprehensive PDF engine without specialized CJK focus.

⭐ = unique strength in this category

Why this gap existed

JS has pdf-lib — it handles font subsetting, CMap generation, and text layer composition transparently. In Rust, the existing options force you to choose between:

lopdf — low-level binary surgery; you hand-assemble CID font objects from the PDF spec
printpdf — create-only; cannot modify existing PDFs
pdfium-render — C++ bindings that break WASM, cross-compilation, and Lambda deploys

harumi fills the gap.

Quick Start

[dependencies]
harumi = "1.1"

Invisible OCR text layer

use harumi::{Document, TextRun};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = Document::from_file("scanned.pdf")?;

    // Embed a font — subsetting and CMap generation happen automatically at save()
    let font = doc.embed_font(include_bytes!("NotoSansCJK-Regular.ttf"))?;

    // Overlay invisible OCR text on page 1
    doc.page(1)?.add_invisible_text(
        "ここにOCRで読み取った日本語テキスト",
        font,
        [100.0, 250.0], // x, y in PDF points (origin: bottom-left)
        12.0,
    )?;

    // Save — the original PDF structure is preserved
    doc.save("searchable_japanese.pdf")?;
    Ok(())
}

Visible text overlay

// Overlay a red stamp centered on the page
let (w, h) = doc.page(1)?.size()?;
doc.page(1)?.add_text(
    "CONFIDENTIAL",
    font,
    [w / 2.0 - 60.0, h / 2.0],
    24.0,
    [0.8, 0.0, 0.0], // red (RGB 0.0–1.0)
)?;

In-memory output

// For Tauri commands, WASM, or any in-memory pipeline
let pdf_bytes: Vec<u8> = doc.save_to_bytes()?;

Multi-line text box (no feature gate)

// Wraps at word boundaries (Latin) or any character (CJK); clips at box bottom
doc.page(1)?.add_text_box(
    "This is a long sentence that wraps inside a 200pt-wide bounding box.",
    font,
    [72.0, 400.0, 200.0, 120.0], // [x, y, width, height]
    12.0,
    [0.0, 0.0, 0.0],              // black
    0.0,                          // 0.0 = use font_size * 1.2 line height
)?;

Page manipulation

// Rotate all pages 90° clockwise
for page_num in 1..=doc.page_count() {
    doc.rotate_page(page_num, 90)?;
}

// Remove a blank cover page
doc.remove_page(1)?;

// Insert a blank A4 title page before page 1
doc.insert_blank_page(0, (595.0, 842.0))?;

// Reverse page order in a 3-page document
doc.reorder_pages(&[3, 2, 1])?;

doc.save("output.pdf")?;

Merge PDFs

let mut base = Document::from_file("a.pdf")?;
let appendix = Document::from_file("b.pdf")?;
base.merge_from(appendix)?;
base.save("merged.pdf")?;

Preserved: all page content, embedded fonts, images, resources.
Not preserved: Outlines/Bookmarks, AcroForm, /Info metadata (author, creation date).

Precondition: other must have no unflushed pending operations (freshly loaded, or reloaded after save_to_bytes()).

Create a blank PDF

let mut doc = Document::new((595.0, 842.0))?;   // blank A4
let font = doc.embed_font(include_bytes!("NotoSansCJK-Regular.ttf"))?;
doc.page(1)?.add_text("Hello, world!", font, [72.0, 700.0], 24.0, [0.0, 0.0, 0.0])?;
doc.save("output.pdf")?;

Extract pages

let doc = Document::from_file("large.pdf")?;
let mut excerpt = doc.extract_pages(&[3, 5, 7])?;  // pages 3, 5, 7 in that order
excerpt.save("excerpt.pdf")?;

Extract text runs from an existing PDF

let doc = Document::from_file("existing.pdf")?;
let runs = doc.extract_text_runs(1)?;
for frag in &runs {
    println!(
        "{:?} at ({:.1}, {:.1}) font={} color={:?} invisible={}",
        frag.text, frag.x, frag.y, frag.font_name, frag.color, frag.invisible,
    );
}

Each TextFragment carries: text, x/y (PDF-point coordinates), width, font_size, font_name (PDF resource name e.g. "HR0"), color (RGB fill [f32; 3]), and invisible (true for OCR Tr 3 text).

Works on arbitrary PDFs — Identity-H CID fonts (harumi output) and standard simple fonts (Type1, TrueType) with WinAnsiEncoding, MacRomanEncoding, StandardEncoding, or /Differences encoding dicts.

Replace text in an existing PDF

let mut doc = Document::from_file("contract.pdf")?;
let font = doc.embed_font(include_bytes!("NotoSansJP-Regular.ttf"))?;
// Returns the number of matches found (0 means old_text was not present)
let n = doc.page(1)?.replace_text("Hello", "こんにちは", font)?;
doc.save("translated.pdf")?;

Matches text that spans consecutive Tj/TJ operators within the same font context (cross-operator matching). Only splits across positional operators (Td, Tm) are not matched.

Replace text using the original embedded font

When you don't have the font file but know the replacement text uses only glyphs already in the PDF. Glyph validation is eager: Err(FontCharNotMapped) is returned immediately at call time if a glyph is missing, so you can fall back in one pass:

let mut doc = Document::from_file("contract.pdf")?;
match doc.page(1)?.replace_text_preserve_font("Draft", replacement) {
    Ok(n) if n > 0 => { /* n replacements queued — no extra font needed */ }
    Ok(_) => { /* old_text not found */ }
    Err(_) => {
        // glyph missing from subset — fall back to explicit font
        let font = doc.embed_font(include_bytes!("font.ttf"))?;
        doc.page(1)?.replace_text("Draft", replacement, font)?;
    }
}
doc.save("output.pdf")?;

Pre-flight check without modifying the document

Use can_replace_text to inspect replaceability before queuing any operations:

let mut doc = Document::from_file("contract.pdf")?;
match doc.page(1)?.can_replace_text("Draft", "Final") {
    Ok(0) => println!("'Draft' not found on page 1"),
    Ok(n) => println!("{n} occurrence(s) found; glyphs OK"),
    Err(e) => println!("glyph missing: {e}"),
}

Replace text with font subset expansion

When the new text contains characters not present in the original font subset, use replace_text_resubset. Pass the original (unsubsetted) TTF/OTF bytes — harumi expands the subset, re-encodes all content streams, and performs the replacement in one save() call.

let font_bytes = include_bytes!("NotoSansJP-Regular.ttf");
let mut doc = Document::from_file("contract.pdf")?;

// replace_text_preserve_font would fail with FontCharNotMapped here
let n = doc.page(1)?.replace_text_resubset("Hello", "日本語", font_bytes)?;
doc.save("output.pdf")?;

Works for any language — Chinese, Korean, Arabic — as long as the supplied font contains the characters.

Note: Requires the original unsubsetted font file, not the subset embedded in the PDF. Only CIDFontType2 fonts with CIDToGIDMap /Identity are supported (what harumi embeds).

Read/write PDF metadata

use harumi::{Document, PdfMetadata};

let mut doc = Document::from_file("report.pdf")?;

// Read existing metadata
let meta = doc.metadata()?;
println!("Title: {:?}", meta.title);

// Write new metadata (None fields are omitted from /Info)
doc.set_metadata(&PdfMetadata {
    title: Some("Annual Report 2026".into()),
    author: Some("Harumi Team".into()),
    subject: None,
    keywords: None,
    creator: None,
})?;
doc.save("report_with_meta.pdf")?;

Draw shapes (`draw` feature)

harumi = { version = "0.5", features = ["draw"] }

// Yellow filled highlight rectangle (x, y, width, height in PDF points)
doc.page(1)?.add_rect([72.0, 690.0, 200.0, 14.0], [1.0, 1.0, 0.0], 0.4)?;

// Blue border rectangle (stroke only, no fill)
doc.page(1)?.add_rect_stroke([72.0, 400.0, 200.0, 100.0], [0.0, 0.0, 1.0], 1.5, 1.0)?;

// Filled triangle (callout arrow tip) — last arg is stroke_width (0.0 = no stroke)
doc.page(1)?.add_polygon(
    &[[100.0, 500.0], [150.0, 600.0], [200.0, 500.0]],
    [1.0, 0.5, 0.0], 1.0, true, 0.0,
)?;

// Filled + stroked triangle simultaneously (fill-then-stroke, PDF `B` operator)
doc.page(1)?.add_polygon(
    &[[100.0, 500.0], [150.0, 600.0], [200.0, 500.0]],
    [0.0, 0.6, 1.0], 1.0, true, 2.0,
)?;

// Black underline stroke
doc.page(1)?.add_line([72.0, 600.0], [300.0, 600.0], [0.0, 0.0, 0.0], 1.5, 1.0)?;

// Semi-transparent blue filled ellipse
doc.page(1)?.add_ellipse([200.0, 300.0, 150.0, 100.0], [0.0, 0.4, 1.0], 0.7, true, 0.0)?;

// Circle outline only (no fill, 2pt border)
doc.page(1)?.add_ellipse([100.0, 100.0, 80.0, 80.0], [1.0, 0.0, 0.0], 1.0, false, 2.0)?;

// Open polyline path (triangle without closing edge)
doc.page(1)?.add_path(
    &[[100.0, 500.0], [150.0, 600.0], [200.0, 500.0]],
    false,               // open path (no closepath)
    [0.2, 0.8, 0.2],    // green
    false, 1.5, 1.0,    // stroke only, 1.5pt line width, full opacity
)?;

// Rotated watermark text (45° counter-clockwise)
let font = doc.embed_font(include_bytes!("NotoSansCJK.ttf"))?;
let (w, h) = doc.page(1)?.size()?;
doc.page(1)?.add_text_with_rotation(
    "CONFIDENTIAL",
    font,
    [w / 2.0, h / 2.0],
    48.0,
    [0.8, 0.0, 0.0],   // red
    0.3,               // 30 % opacity
    45.0,              // degrees (counter-clockwise)
)?;

Embed images (`image` feature)

harumi = { version = "0.5", features = ["image"] }

let jpeg = std::fs::read("stamp.jpg")?;
// Place at [x, y, width, height]; supports JPEG (no decode) and PNG
doc.page(1)?.add_image(&jpeg, [72.0, 500.0, 100.0, 100.0])?;

// With opacity (0.0 = transparent, 1.0 = opaque)
doc.page(1)?.add_image_with_opacity(&jpeg, [72.0, 400.0, 100.0, 100.0], 0.75)?;

// PNG with alpha channel — transparent regions use PDF SMask, no white background
let sig_png = std::fs::read("signature.png")?;
doc.page(1)?.add_image(&sig_png, [72.0, 300.0, 200.0, 80.0])?;

Extract an embedded image from a scanned PDF (`image` feature)

Designed for OCR workflows: load a scanned PDF, extract the raster image, run OCR, then write the invisible text layer back.

use harumi::{Document, PageImageFormat};

let doc = Document::from_file("scanned.pdf")?;
let img = doc.extract_page_image(1)?;

match img.format {
    PageImageFormat::Jpeg => std::fs::write("page1.jpg", &img.bytes)?,
    PageImageFormat::Png  => std::fs::write("page1.png", &img.bytes)?,
}
println!("{}×{} pixels", img.width, img.height);

Scanned PDFs only. This extracts an existing Image XObject — it does not rasterize the page. Text and vector PDFs have no Image XObject and will return Error::InvalidInput.

Build a structured document with auto-pagination (`flow` feature)

harumi = { version = "0.5", features = ["flow"] }

use harumi::{FlowDocument, FlowOptions, Margins};

let font = include_bytes!("NotoSansCJK-Regular.ttf");
let mut doc = FlowDocument::new(font.as_ref(), FlowOptions::default())?;

doc.push_heading("Annual Report", 1)?;
doc.push_paragraph("This document summarizes our performance.")?;
doc.push_key_value_table(&[
    ("Revenue", "$1,000,000"),
    ("Expenses", "$800,000"),
    ("Profit", "$200,000"),
])?;
doc.push_list(&["Expanded to 3 new markets", "Launched 2 new products"], false)?;

// Page breaks are inserted automatically when content overflows.
// Call push_page_break() to force a manual break.

let pdf_bytes = doc.render()?;

Supports Japanese / Chinese / Korean out of the box — pass a CJK TTF font and text wraps at any character boundary.

Inline text styling in FlowDocument (`flow` feature)

Bold, italic, and color can be mixed inline within a paragraph:

use harumi::{FlowDocument, FlowOptions, InlineSpan};

let mut doc = FlowDocument::new(font_bytes, FlowOptions::default())?;
doc.push_paragraph_styled(&[
    InlineSpan::plain("Normal text, "),
    InlineSpan::bold("bold text, "),
    InlineSpan::italic("italic text, "),
    InlineSpan::colored("and red.", [0.8, 0.0, 0.0]),
])?;
let pdf = doc.render()?;

Bold and italic are synthetic (fill+stroke and 12° shear respectively) — no separate bold/italic font file is required.

Header / footer with page numbers (`flow` feature)

use harumi::{FlowDocument, FlowOptions, HeaderFooter};

let opts = FlowOptions {
    // Left "harumi docs", right "v0.5" on every page
    header: Some(HeaderFooter {
        left:  Some("harumi docs".into()),
        right: Some("v0.5".into()),
        ..Default::default()
    }),
    // Centred "1 / 3" page counter
    footer: Some(HeaderFooter::page_number()),
    // push_heading() automatically creates a bookmark entry (default: true)
    auto_bookmarks: true,
    ..Default::default()
};

let mut doc = FlowDocument::new(font, opts)?;
doc.push_heading("Chapter 1", 1)?;
doc.push_paragraph("Body text here.")?;
let pdf_bytes = doc.render()?;

Link annotations

// Clickable URL region (x, y, width, height)
doc.page(1)?.add_link_url([72.0, 40.0, 200.0, 18.0], "https://example.com")?;

// Internal link: clicking the area jumps to page 3 of the same document
doc.page(1)?.add_link_internal([72.0, 700.0, 150.0, 18.0], 3)?;

Markup annotations (highlight, underline, strikeout, squiggly)

// Yellow highlight
doc.page(1)?.add_highlight([72.0, 690.0, 200.0, 14.0], [1.0, 1.0, 0.0])?;

// Red underline
doc.page(1)?.add_underline([72.0, 640.0, 200.0, 12.0], [1.0, 0.0, 0.0])?;

// Strikethrough
doc.page(1)?.add_strikeout([72.0, 590.0, 200.0, 12.0], [0.0, 0.0, 0.0])?;

// Squiggly (wavy) underline
doc.page(1)?.add_squiggly([72.0, 540.0, 200.0, 12.0], [0.0, 0.6, 0.2])?;

// Sticky-note comment
doc.page(1)?.add_sticky_note([500.0, 700.0], "Review this section")?;
doc.save("annotated.pdf")?;

Password-protected PDFs

// Load an encrypted PDF
let mut doc = Document::from_file_with_password("protected.pdf", "secret")?;
assert!(doc.is_encrypted());

// Wrong password returns Error::WrongPassword
match Document::from_bytes_with_password(&bytes, "wrong") {
    Err(harumi::Error::WrongPassword) => println!("Bad password"),
    _ => {}
}

// Save with password protection
let mut doc = Document::new((595.0, 842.0))?;
doc.set_encryption("userpass", "ownerpass")?;
doc.save("protected_output.pdf")?;

AcroForm: read and fill form fields

// Read all form fields
let mut doc = Document::from_file("form.pdf")?;
for field in doc.form_fields()? {
    println!("{}: {:?} = {:?}", field.name, field.field_type, field.value);
}

// Fill fields by name
let updated = doc.fill_form(&[
    ("FullName",    "Jane Doe"),
    ("Agree",       "yes"),       // checkbox → /Yes
    ("Department",  "Engineering"),
])?;
println!("{updated} fields updated");
doc.save("filled_form.pdf")?;

Page boxes (print workflow)

// Read/write CropBox (visible area clip)
let cb = doc.page(1)?.crop_box()?;   // Option<[f32;4]>

doc.page(1)?.set_crop_box([10.0, 10.0, 575.0, 822.0])?;   // [x,y,w,h]
doc.page(1)?.set_trim_box([0.0, 0.0, 595.0, 842.0])?;
doc.page(1)?.set_bleed_box([0.0, 0.0, 601.0, 848.0])?;
doc.save("print_ready.pdf")?;

Document bookmarks (outline)

// Builds the bookmarks panel in PDF viewers.
// Non-ASCII titles (CJK, accented Latin…) are encoded as UTF-16BE automatically.
doc.add_bookmark("Chapter 1",   1, 800.0)?;   // title, page (1-indexed), y coord
doc.add_bookmark("第2章 概要",  2, 800.0)?;
doc.save("report.pdf")?;

Convert HTML to PDF (`html` feature)

harumi = { version = "0.5", features = ["html"] }

use harumi::{render_html_to_pdf, HtmlRenderOptions};

let font = include_bytes!("NotoSansCJK-Regular.ttf").to_vec();
let html = r#"
    <h1>Annual Report</h1>
    <p>Introduction paragraph.</p>
    <table>
      <tr><th>Revenue</th><td>$1,000,000</td></tr>
      <tr><th>Profit</th><td>$200,000</td></tr>
    </table>
    <h2>Highlights</h2>
    <ul><li>Expanded to 3 new markets</li><li>Launched 2 new products</li></ul>
    <div style="page-break-after: always"></div>
    <h1>Page Two</h1>
"#;

let pdf_bytes = render_html_to_pdf(html, HtmlRenderOptions {
    font_bytes: font,
    ..HtmlRenderOptions::default()
})?;

Supported elements: <h1>–<h6>, <p>, <table>/<tr>/<th>/<td>, <ul>/<ol>/<li>, <div>/<section>/<article> (block containers).
Page breaks: style="page-break-after: always" or class="page-break".
Skipped: <script>, <style>, <head>.
Inline styles: <strong>/<b> (bold), <em>/<i> (italic), <span style="color: #RRGGBB"> (color), <a href> (blue link color).
Handles deeply nested HTML without stack overflow (iterative parser, tested with 5 000 nested <div>s).

API Overview

// Load
let mut doc = Document::from_file("path/to/file.pdf")?;
let mut doc = Document::from_bytes(&bytes)?;

// Font embedding (one per font file; reuse the handle across pages)
let font: FontHandle = doc.embed_font(ttf_bytes)?;

// Page size (PDF points, width × height)
let (width, height) = doc.page(1)?.size()?;

// Invisible text — for OCR text layers
doc.page(1)?.add_invisible_text(text, font, [x, y], size)?;

// Visible text — for watermarks, stamps, annotations
doc.page(1)?.add_text(text, font, [x, y], size, [r, g, b])?;

// Batch placement (one subsetting pass — efficient for OCR output)
doc.page(1)?.add_invisible_text_runs(&[
    TextRun { text: "line one".into(), font, x: 72.0, y: 700.0, font_size: 11.0, render_mode: 3, color: [0.0; 3] },
    TextRun { text: "line two".into(), font, x: 72.0, y: 685.0, font_size: 11.0, render_mode: 3, color: [0.0; 3] },
])?;

// Page structure (no feature gate)
doc.page_count()                          // u32
doc.rotate_page(n, degrees)?;             // multiple of 90; accumulates
doc.remove_page(n)?;                      // cannot remove the last page
doc.insert_blank_page(after, (w, h))?;    // after=0 prepends
doc.reorder_pages(&[new_order...])?;      // 1-indexed old page numbers
doc.extract_pages(&[n1, n2, ...])?;       // new Document with selected pages

// Create from scratch
Document::new((w, h))?;                   // blank 1-page PDF

// Merge documents (no pending ops in other)
doc.merge_from(other)?;             // append other's pages to end

// Save
doc.save("output.pdf")?;
doc.save_to_bytes()?;   // in-memory variant

// Extract text from existing PDFs (CID + standard simple fonts)
let runs: Vec<TextFragment> = doc.extract_text_runs(page_number)?;

// PDF metadata (/Info dictionary)
let meta: PdfMetadata = doc.metadata()?;
doc.set_metadata(&PdfMetadata { title: Some("...".into()), ..Default::default() })?;

// Replace text in existing content stream (single-operator match); returns match count
let n: usize = doc.page(1)?.replace_text(old_text, new_text, font)?;
// Replace using the original embedded font; eager glyph validation; returns match count
let n: usize = doc.page(1)?.replace_text_preserve_font(old_text, new_text)?;
// Read-only scan: returns match count or Err(FontCharNotMapped)
let n: usize = doc.page(1)?.can_replace_text(old_text, new_text)?;
// Replace text + expand font subset to include new characters
let n: usize = doc.page(1)?.replace_text_resubset(old, new, font_bytes)?;

// Styled visible text (bold/italic synthetic effects, no extra font file needed)
doc.page(1)?.add_text_styled(text, font, [x, y], size, [r, g, b], bold, italic)?;

// Link annotations (no feature gate)
doc.page(1)?.add_link_url([x, y, w, h], "https://example.com")?;   // URL link
doc.page(1)?.add_link_internal([x, y, w, h], target_page)?;         // in-document link

// Document outline / bookmarks (no feature gate)
doc.add_bookmark("Section Title", page, y)?;  // appends a flat outline entry

// Markup annotations (no feature gate)
doc.page(1)?.add_highlight([x, y, w, h], [r, g, b])?;
doc.page(1)?.add_underline([x, y, w, h], [r, g, b])?;
doc.page(1)?.add_strikeout([x, y, w, h], [r, g, b])?;
doc.page(1)?.add_squiggly([x, y, w, h], [r, g, b])?;
doc.page(1)?.add_sticky_note([x, y], "comment text")?;

// AcroForm (no feature gate)
let fields: Vec<FormField> = doc.form_fields()?;
let n: usize = doc.fill_form(&[("field_name", "value")])?;

// Page boxes (no feature gate)
let cb: Option<[f32; 4]> = doc.page(1)?.crop_box()?;
doc.page(1)?.set_crop_box([x, y, w, h])?;
doc.page(1)?.set_trim_box([x, y, w, h])?;
doc.page(1)?.set_bleed_box([x, y, w, h])?;
let mb: [f32; 4] = doc.page(1)?.media_box()?;
doc.page(1)?.set_media_box([x, y, w, h])?;

// Password protection (no feature gate)
Document::from_file_with_password(path, password)?;
Document::from_bytes_with_password(bytes, password)?;
doc.is_encrypted()                     // true if PDF was encrypted when loaded
doc.set_encryption(user_pw, owner_pw)?; // encrypt on next save()

Coordinate system

Coordinates are in PDF points (1 pt = 1/72 inch), origin at the bottom-left of the page. If your OCR engine (e.g. Tesseract / hOCR) gives pixel coordinates from the top-left, use the ocr feature helper:

harumi = { version = "0.5", features = ["ocr"] }

Feature flags

Flag	What it enables	Extra dependencies
(default)	Text overlay, font embedding, `add_text_box`, `add_text_box_aligned`, `add_text_with_opacity`, `add_text_box_with_opacity`	lopdf, allsorts, ttf-parser
`draw`	`add_rect`, `add_line`, `add_rect_stroke`, `add_polygon`, `add_polyline`, `add_ellipse` — shapes	none
`image`	`add_image`, `add_image_with_opacity` — JPEG/PNG raster images; `extract_page_image` — extract embedded image from scanned PDF (enables `draw`)	`image` crate
`ocr`	`ocr::hocr_y_to_pdf`, `ocr::hocr_x_to_pdf`, `ocr::pixel_size_to_pt` — Tesseract coordinate conversion	none
`flow`	`FlowDocument` push-style builder with automatic pagination (`push_heading`, `push_paragraph`, `push_paragraph_styled`, `push_key_value_table`, `push_list`, `push_page_break`, `render`); `InlineSpan` for inline bold/italic/color within a paragraph; `HeaderFooter` for per-page header/footer with `{{page}}`/`{{total}}` substitution; `auto_bookmarks` for automatic outline from headings	none
`html`	`render_html_to_pdf` — HTML → PDF (h1–h6, p, table, ul/ol, page-break; enables `flow`)	`scraper`

let pdf_y = harumi::ocr::hocr_y_to_pdf(pixel_y, page_height_pts, image_dpi);
let pdf_x = harumi::ocr::hocr_x_to_pdf(pixel_x, image_dpi);
let pt    = harumi::ocr::pixel_size_to_pt(pixel_size, image_dpi);

Supported Fonts

Font format	Status
TrueType (`.ttf`, `sfntVersion = 0x00010000`)	Supported
OpenType with CFF outlines (`.otf`, `OTTO`)	Accepted; subsetting depends on allsorts
TTC collections	Supported (index 0)

For Japanese/Chinese/Korean, use the TrueType variant of Noto Sans CJK — end-to-end verified:

NotoSansCJKjp-Regular.ttf  (Japanese)
NotoSansCJKsc-Regular.ttf  (Simplified Chinese)
NotoSansCJKtc-Regular.ttf  (Traditional Chinese)
NotoSansCJKkr-Regular.ttf  (Korean)

OTF note: harumi accepts .otf files and routes them through FontFile3 /OpenType embedding. However, allsorts v0.17 cannot subset all CFF variants (e.g. CFF2 variable fonts). If subsetting fails you will get a FontParse error at save() time. Use the TTF variants above for guaranteed compatibility.

Internals

harumi
├── lopdf v0.40          — parse and modify existing PDF object graph
├── allsorts v0.17+      — TrueType font subsetting (used in Prince typesetter)
└── ttf-parser           — font metadata (bbox, units_per_em, ascender)

The font pipeline:

Parse used characters → collect Unicode code points
Map code points → original Glyph IDs via the font's cmap table (ttf-parser)
Subset the TTF to used glyphs only (allsorts); GIDs are compacted to 0..N
Remap gid_to_char and advance widths from original GIDs to the new compact GIDs
Build the CID font object graph: Type0 → CIDFontType2 → FontDescriptor → FontFile2
Generate a /ToUnicode CMap stream so viewers can copy/search the text
Append a new content stream to the page's /Contents array

Subsetting is deferred: embed_font() stores the raw TTF bytes; at save() time, harumi collects all characters used across every page, subsets once per font, and writes everything in one pass.

Why "harumi"

晴海 — haru (clear sky) + umi (sea). Calm on the surface, a lot going on underneath.

Roadmap

Version	Scope
v0.1	TrueType fonts, invisible + visible text, batch placement, `page.size()`, `save_to_bytes()`, GID remapping, OTF accepted
v0.2	`draw` feature (`add_rect`, `add_line`), `image` feature (`add_image`, PNG SMask transparency), page manipulation (`rotate_page`, `remove_page`, `insert_blank_page`, `reorder_pages`)
v0.3	`add_text_box`, `add_rect_stroke`, `add_polygon`, `add_ellipse`, `add_path`; `add_text_with_rotation`; security hardening; `merge_from`; `Document::new`; `extract_pages`
v0.4	`extract_text_runs` (CID + standard fonts), PDF metadata r/w, `replace_text` (Tj/TJ rewrite, cross-operator matching, width compensation, preserve-font mode), `flow` feature (`FlowDocument`, CJK auto-pagination), `html` feature, `extract_page_image`
v0.5	`add_link_url`, `add_link_internal` — clickable PDF link annotations; `add_bookmark` — document outline/bookmarks with CJK UTF-16BE titles; `HeaderFooter` + `{{page}}`/`{{total}}` for `FlowDocument`; `auto_bookmarks` from headings; security fixes
v0.6	`from_file_with_password` / `from_bytes_with_password` / `is_encrypted` / `Error::WrongPassword`; markup annotations (highlight, underline, strikeout, sticky-note); AcroForm `form_fields()` / `fill_form()`; AGL table +116 entries (Central EU, ligatures, euro); Identity-H text extraction fallback
v0.7 (current)	`set_encryption` — write password-protected PDFs; `add_squiggly` — wavy underline annotation; full page-box API (`crop_box`, `trim_box`, `bleed_box`, `media_box` read/write)
v0.8	`replace_text_resubset` — expand font subset at replacement time (any language); `InlineSpan` bold/italic/color in `FlowDocument` + HTML `<strong>`/`<em>`/`<span>` inline styles; nested `/Pages` tree inherited-attribute fix; TTC E2E tests; `wasm-pack test --node` CI; `cargo semver-checks` CI
Next	AES-256 write encryption

Contributing

Issues and PRs welcome at github.com/kent-tokyo/harumi.

The most complex part of this codebase is src/font/embed.rs — the CID font object graph construction. When reporting rendering bugs in a specific PDF viewer, include the viewer name and version in your issue.

License

MIT OR Apache-2.0

harumi 1.0.0