Skip to main content

Crate harumi

Crate harumi 

Source
Expand description

§harumi

Pure-Rust PDF library — CJK font embedding (Chinese/Japanese/Korean), OCR text overlay, text extraction, HTML→PDF, page merge/split. Zero C/C++ dependencies. WASM-compatible.

§Use cases

ScenarioKey API
OCR invisible text layeradd_invisible_text · add_invisible_text_runs
AI / RAG text extractionextract_text_runs · extract_text_chunks · extract_as_markdown
PDF watermark / stampadd_text · add_text_with_rotation
Scanned PDF → searchableadd_invisible_text + hOCR helpers (ocr feature)
HTML → PDFrender_html_to_pdf (html feature)
PDF text replacementreplace_text · replace_text_resubset
Page merge / splitmerge_from · extract_pages
WASM / Edge / LambdaAll APIs — zero C/C++ dependencies

§Motivation

Rust lacks a high-level, zero-C-dependency library for injecting text into existing PDFs. Low-level crates like lopdf expose the raw PDF object graph and require manual CID font assembly. harumi wraps that complexity behind a simple, ergonomic API.

§Quick start

use harumi::{Document, TextRun};

let mut doc = Document::from_file("scanned.pdf")?;
let font = doc.embed_font(include_bytes!("../tests/fixtures/NotoSansJP-Regular.ttf"))?;

// Invisible OCR text layer
doc.page(1)?.add_invisible_text("日本語テキスト", font, [72.0, 700.0], 12.0)?;

// Visible red label
doc.page(1)?.add_text("CONFIDENTIAL", font, [72.0, 750.0], 18.0, [0.8, 0.0, 0.0])?;

doc.save("output.pdf")?;

§Coordinate system

All coordinates are in PDF points (1 pt = 1/72 inch). The origin is at the bottom-left of the page. Use page.size() to query the page dimensions and position text relative to them.

§Font subsetting

embed_font stores the raw TTF bytes without processing. At save time, harumi collects every character used across all pages, runs a single subset per font, and embeds the result. This means subsetting overhead is paid once regardless of how many pages or text runs reference the same font.

§Feature flags

FlagEnablesExtra deps
ocrhOCR pixel→PDF coordinate helpersnone
drawShapes: rect, line, ellipse, polygon, pathnone
imageJPEG/PNG embed + extraction; enables drawpng crate
flowFlowDocument auto-pagination builder + headers/footersnone
htmlHTML→PDF renderer; enables flownone (internal tokenizer)

Re-exports§

pub use signature::SignatureInfo;

Modules§

signature
Digital signature verification for PDFs.

Structs§

Document
An existing PDF document that can be annotated with text overlays.
FontHandle
Opaque handle to a font registered with crate::Document::embed_font.
FormField
A PDF form field returned by Document::form_fields.
PageHandle
A handle to a specific page for queuing text overlays.
PdfMetadata
PDF /Info dictionary fields.
TextChunk
A semantic text chunk extracted from a page.
TextFieldOptions
Options for creating a text field via Document::add_text_field.
TextFragment
A text fragment extracted from a page content stream.
TextRun
A single text placement descriptor for use with PageHandle::add_invisible_text_runs.

Enums§

ChunkType
The semantic type of a text chunk.
Color
A color value that can be either RGB or CMYK.
Error
Errors returned by harumi operations.
FieldType
The type of a PDF form field.
VerticalAlign
Vertical alignment for PageHandle::add_text_box_aligned.

Functions§

sort_by_reading_order
Sort text fragments by reading order: top-to-bottom, then left-to-right.

Type Aliases§

Result
Alias for std::result::Result<T, harumi::Error>.