Expand description
§harumi
Pure-Rust PDF library — CJK font embedding (Chinese/Japanese/Korean), OCR text overlay, text extraction, HTML→PDF, page merge/split. Zero C/C++ dependencies. WASM-compatible.
§Use cases
| Scenario | Key API |
|---|---|
| OCR invisible text layer | add_invisible_text · add_invisible_text_runs |
| AI / RAG text extraction | extract_text_runs · extract_text_chunks · extract_as_markdown |
| PDF watermark / stamp | add_text · add_text_with_rotation |
| Scanned PDF → searchable | add_invisible_text + hOCR helpers (ocr feature) |
| HTML → PDF | render_html_to_pdf (html feature) |
| PDF text replacement | replace_text · replace_text_resubset |
| Page merge / split | merge_from · extract_pages |
| WASM / Edge / Lambda | All APIs — zero C/C++ dependencies |
§Motivation
Rust lacks a high-level, zero-C-dependency library for injecting text into
existing PDFs. Low-level crates like lopdf expose the raw PDF object graph
and require manual CID font assembly. harumi wraps that complexity behind
a simple, ergonomic API.
§Quick start
use harumi::{Document, TextRun};
let mut doc = Document::from_file("scanned.pdf")?;
let font = doc.embed_font(include_bytes!("../tests/fixtures/NotoSansJP-Regular.ttf"))?;
// Invisible OCR text layer
doc.page(1)?.add_invisible_text("日本語テキスト", font, [72.0, 700.0], 12.0)?;
// Visible red label
doc.page(1)?.add_text("CONFIDENTIAL", font, [72.0, 750.0], 18.0, [0.8, 0.0, 0.0])?;
doc.save("output.pdf")?;§Coordinate system
All coordinates are in PDF points (1 pt = 1/72 inch). The origin is at
the bottom-left of the page. Use page.size() to
query the page dimensions and position text relative to them.
§Font subsetting
embed_font stores the raw TTF bytes without
processing. At save time, harumi collects every
character used across all pages, runs a single subset per font, and embeds
the result. This means subsetting overhead is paid once regardless of how
many pages or text runs reference the same font.
§Feature flags
| Flag | Enables | Extra deps |
|---|---|---|
ocr | hOCR pixel→PDF coordinate helpers | none |
draw | Shapes: rect, line, ellipse, polygon, path | none |
image | JPEG/PNG embed + extraction; enables draw | png crate |
flow | FlowDocument auto-pagination builder + headers/footers | none |
html | HTML→PDF renderer; enables flow | none (internal tokenizer) |
Re-exports§
pub use signature::SignatureInfo;
Modules§
- signature
- Digital signature verification for PDFs.
Structs§
- Document
- An existing PDF document that can be annotated with text overlays.
- Font
Handle - Opaque handle to a font registered with
crate::Document::embed_font. - Form
Field - A PDF form field returned by
Document::form_fields. - Page
Handle - A handle to a specific page for queuing text overlays.
- PdfMetadata
- PDF /Info dictionary fields.
- Text
Chunk - A semantic text chunk extracted from a page.
- Text
Field Options - Options for creating a text field via
Document::add_text_field. - Text
Fragment - A text fragment extracted from a page content stream.
- TextRun
- A single text placement descriptor for use with
PageHandle::add_invisible_text_runs.
Enums§
- Chunk
Type - The semantic type of a text chunk.
- Color
- A color value that can be either RGB or CMYK.
- Error
- Errors returned by harumi operations.
- Field
Type - The type of a PDF form field.
- Vertical
Align - Vertical alignment for
PageHandle::add_text_box_aligned.
Functions§
- sort_
by_ reading_ order - Sort text fragments by reading order: top-to-bottom, then left-to-right.
Type Aliases§
- Result
- Alias for
std::result::Result<T, harumi::Error>.