Expand description
§harumi
Pure-Rust PDF library — CJK font embedding (Chinese/Japanese/Korean), OCR text overlay, text extraction, HTML→PDF, page merge/split. Zero C/C++ dependencies. WASM-compatible.
§Use cases
| Scenario | Key API |
|---|---|
| OCR invisible text layer | add_invisible_text · add_invisible_text_runs |
| AI / RAG text extraction | extract_text_runs · extract_text_chunks · extract_as_markdown |
| PDF watermark / stamp | add_text · add_text_with_rotation |
| Scanned PDF → searchable | add_invisible_text + hOCR helpers (ocr feature) |
| HTML → PDF | render_html_to_pdf (html feature) |
| PDF text replacement | replace_text · replace_text_resubset |
| Page merge / split | merge_from · extract_pages |
| Digital signature creation | sign_document · add_signature_field (digital-signature feature) |
| WASM / Edge / Lambda | All APIs — zero C/C++ dependencies |
§Motivation
Rust lacks a high-level, zero-C-dependency library for injecting text into
existing PDFs. Low-level crates like lopdf expose the raw PDF object graph
and require manual CID font assembly. harumi wraps that complexity behind
a simple, ergonomic API.
§Quick start
use harumi::{Document, TextRun};
let mut doc = Document::from_file("scanned.pdf")?;
let font = doc.embed_font(include_bytes!("../tests/fixtures/NotoSansJP-Regular.ttf"))?;
// Invisible OCR text layer
doc.page(1)?.add_invisible_text("日本語テキスト", font, [72.0, 700.0], 12.0)?;
// Visible red label
doc.page(1)?.add_text("CONFIDENTIAL", font, [72.0, 750.0], 18.0, [0.8, 0.0, 0.0])?;
doc.save("output.pdf")?;§Coordinate system
All coordinates are in PDF points (1 pt = 1/72 inch). The origin is at
the bottom-left of the page. Use page.size() to
query the page dimensions and position text relative to them.
§Font subsetting
embed_font stores the raw TTF bytes without
processing. At save time, harumi collects every
character used across all pages, runs a single subset per font, and embeds
the result. This means subsetting overhead is paid once regardless of how
many pages or text runs reference the same font.
§Feature flags
| Flag | Enables | Extra deps |
|---|---|---|
ocr | hOCR pixel→PDF coordinate helpers | none |
draw | Shapes: rect, line, ellipse, polygon, path | none |
image | JPEG/PNG embed + extraction; enables draw | png crate |
flow | FlowDocument auto-pagination builder + headers/footers | none |
html | HTML→PDF renderer; enables flow | none (internal tokenizer) |
digital-signature | Create and verify PKCS#7/CMS signatures | RustCrypto crates |
Re-exports§
pub use flow::FlowDocument;pub use flow::FlowOptions;pub use flow::InlineSpan;pub use flow::Margins;pub use flow::html::HtmlRenderOptions;pub use flow::html::render_html_to_pdf;pub use signature::SignatureInfo;pub use signature_create::CertificateInput;pub use signature_create::PrivateKeyInput;pub use signature_create::SignatureFieldOptions;pub use signature_create::SigningContext;
Modules§
- flow
- High-level flow-based document builder for generating structured PDFs.
- ocr
- Helpers for converting OCR engine output coordinates to PDF coordinates.
- signature
- Digital signature verification for PDFs.
- signature_
create - Digital signature creation for PDFs.
Structs§
- Attachment
Info - Information about a file attached to a PDF.
- Batch
Entry - One entry for [
PageHandle::replace_text_fragments_batch_opts]. - BoxFit
Options - Options for
Document::fit_text_to_box. - Classified
Collision - A
Collisionannotated with the structural relationship between the two overlappingLayoutRegions. - Collision
- A pair of overlapping
PlacedBoxes returned bydetect_collisions. - Column
Zone - A horizontal text zone returned by
detect_text_columns. - Debug
Overlay Options - Display options for
PageHandle::add_fit_debug_overlay. - Document
- An existing PDF document that can be annotated with text overlays.
- Extraction
Warning - A non-fatal issue encountered while extracting text from a page.
- FitOptions
- Options for [
PageHandle::replace_fragments_fit_to_bbox]. - FitResult
- Result of
Document::fit_text_to_box. - Font
Handle - Opaque handle to a font registered with
crate::Document::embed_font. - Form
Field - A PDF form field returned by
Document::form_fields. - Fragment
Replace Opts - Placement options for [
PageHandle::replace_text_fragments_opts]. - Label
Value Pair - A matched label/value region pair extracted from a form or table layout.
- Layout
Issue - One concrete layout issue in a page-level quality report.
- Layout
Region - A detected layout region on a page, with both source-text bounds and the inferred available rectangle for replacement text.
- Layout
Region Options - Options for
extract_layout_regions. - Page
FitSummary - Page-level aggregate quality summary derived from a batch of
RegionFitPlans. - Page
Handle - A handle to a specific page for queuing text overlays.
- Page
Image - A raster image extracted from a PDF page.
- Page
Layout Quality - Page-level layout quality report for translated or replacement text.
- PdfMetadata
- PDF /Info dictionary fields.
- Placed
Box - A positioned rectangle for collision detection.
- Region
FitPlan - Combines a
LayoutRegionwith thecrate::FitResultfor its planned replacement text and anyCollisions against neighbouring regions. - Region
Text FitOptions - Per-region fitting policy for
crate::Document::plan_text_for_regions_with_policy. - Replace
Options - Options for [
PageHandle::replace_text_opts]. - Table
Cell - A text cell detected by
extract_table_cells. - Text
Chunk - A semantic text chunk extracted from a page.
- Text
Field Options - Options for creating a text field via
Document::add_text_field. - Text
Fragment - A text fragment extracted from a page content stream.
- Text
Group - A group of
TextFragments merged into a single logical text block. - TextRun
- A single text placement descriptor for use with [
PageHandle::add_invisible_text_runs].
Enums§
- Baseline
Policy - How to anchor replacement text vertically within a layout region.
- Chunk
Type - The semantic type of a text chunk.
- Collision
Kind - Structural relationship between two overlapping
LayoutRegions. - Collision
Severity - How bad a collision is, based on how much of the smaller box it covers.
- Color
- A color value that can be either RGB or CMYK.
- Error
- Errors returned by harumi operations.
- Field
Type - The type of a PDF form field.
- Fragment
Replace Failure Reason - Reason why a
TextFragmentcannot be suppressed by [PageHandle::replace_text_fragments]. - Grouping
Strategy - Controls how
group_text_fragmentsmerges individualTextFragments. - Layout
Issue Kind - Type of layout problem found while checking planned replacement text.
- Layout
Issue Severity - Severity of a layout issue.
- Layout
Region Kind - Classifies the structural role of a
LayoutRegion. - Layout
Region Role - Functional role of a
LayoutRegionin a translation or editing workflow. - Overflow
Policy - How
Document::fit_text_to_boxhandles text that does not fit. - Page
Image Format - Format of the bytes stored in
PageImage::bytes. - Placement
Status - Outcome of a
Document::fit_text_to_boxplacement. - Vertical
Align - Vertical alignment for
PageHandle::add_text_box_aligned. - Warning
Kind - Why a content stream or Form XObject was not fully decoded during text extraction.
- Width
Policy - How to determine the available width for replacement text.
Functions§
- calculate_
text_ width - Calculate the total width of a text string in PDF points from raw TTF bytes.
- classify_
collisions - Annotate each
Collisionwith aCollisionKindby comparing theLayoutRegionmetadata (row, col, role) at the collision indices. - collision_
severity - Compute a
CollisionSeverityfrom raw box areas. - detect_
collisions - Detect pairwise axis-aligned bounding-box overlaps between
boxes. - detect_
text_ columns - Estimate column layout from a set of text fragments.
- extract_
label_ value_ pairs - Pair
LayoutRegionRole::LeftLabelregions with their same-rowLayoutRegionRole::RightValuesiblings. - extract_
layout_ regions - Detect layout regions on a page, inferring the usable area for each cell.
- extract_
table_ cells - Detect table structure in a flat list of text fragments.
- font_
covers_ char - Return
truewhenfont_bytescontains a glyph forch. - glyph_
advance_ pt - Width of one character in PDF points given the font face and font size. Returns None if the character is not present in the font (no glyph mapping).
- group_
text_ fragments - Group text fragments into logical blocks according to
strategy. - merge_
short_ cjk_ tails - Merge short CJK “tail” fragments into the preceding fragment.
- sort_
by_ reading_ order - Sort text fragments by reading order: top-to-bottom, then left-to-right.
- text_
fragment_ bounds - Return the axis-aligned bounding box that covers all fragments in
fragmentsas[x, y, width, height]in PDF points (origin: bottom-left of the page). - wrap_
paragraph - Greedy line-breaking for a single paragraph (no embedded newlines).
Type Aliases§
- Result
- Alias for
std::result::Result<T, harumi::Error>.