Skip to main content

Crate harumi

Crate harumi 

Source
Expand description

§harumi

Pure-Rust PDF library — CJK font embedding (Chinese/Japanese/Korean), OCR text overlay, text extraction, HTML→PDF, page merge/split. Zero C/C++ dependencies. WASM-compatible.

§Use cases

ScenarioKey API
OCR invisible text layeradd_invisible_text · add_invisible_text_runs
AI / RAG text extractionextract_text_runs · extract_text_chunks · extract_as_markdown
PDF watermark / stampadd_text · add_text_with_rotation
Scanned PDF → searchableadd_invisible_text + hOCR helpers (ocr feature)
HTML → PDFrender_html_to_pdf (html feature)
PDF text replacementreplace_text · replace_text_resubset
Page merge / splitmerge_from · extract_pages
Digital signature creationsign_document · add_signature_field (digital-signature feature)
WASM / Edge / LambdaAll APIs — zero C/C++ dependencies

§Motivation

Rust lacks a high-level, zero-C-dependency library for injecting text into existing PDFs. Low-level crates like lopdf expose the raw PDF object graph and require manual CID font assembly. harumi wraps that complexity behind a simple, ergonomic API.

§Quick start

use harumi::{Document, TextRun};

let mut doc = Document::from_file("scanned.pdf")?;
let font = doc.embed_font(include_bytes!("../tests/fixtures/NotoSansJP-Regular.ttf"))?;

// Invisible OCR text layer
doc.page(1)?.add_invisible_text("日本語テキスト", font, [72.0, 700.0], 12.0)?;

// Visible red label
doc.page(1)?.add_text("CONFIDENTIAL", font, [72.0, 750.0], 18.0, [0.8, 0.0, 0.0])?;

doc.save("output.pdf")?;

§Coordinate system

All coordinates are in PDF points (1 pt = 1/72 inch). The origin is at the bottom-left of the page. Use page.size() to query the page dimensions and position text relative to them.

§Font subsetting

embed_font stores the raw TTF bytes without processing. At save time, harumi collects every character used across all pages, runs a single subset per font, and embeds the result. This means subsetting overhead is paid once regardless of how many pages or text runs reference the same font.

§Feature flags

FlagEnablesExtra deps
ocrhOCR pixel→PDF coordinate helpersnone
drawShapes: rect, line, ellipse, polygon, pathnone
imageJPEG/PNG embed + extraction; enables drawpng crate
flowFlowDocument auto-pagination builder + headers/footersnone
htmlHTML→PDF renderer; enables flownone (internal tokenizer)
digital-signatureCreate and verify PKCS#7/CMS signaturesRustCrypto crates

Re-exports§

pub use flow::FlowDocument;
pub use flow::FlowOptions;
pub use flow::HeaderFooter;
pub use flow::InlineSpan;
pub use flow::Margins;
pub use flow::html::HtmlRenderOptions;
pub use flow::html::render_html_to_pdf;
pub use signature::SignatureInfo;
pub use signature_create::CertificateInput;
pub use signature_create::PrivateKeyInput;
pub use signature_create::SignatureFieldOptions;
pub use signature_create::SigningContext;

Modules§

flow
High-level flow-based document builder for generating structured PDFs.
ocr
Helpers for converting OCR engine output coordinates to PDF coordinates.
signature
Digital signature verification for PDFs.
signature_create
Digital signature creation for PDFs.

Structs§

AttachmentInfo
Information about a file attached to a PDF.
BatchEntry
One entry for [PageHandle::replace_text_fragments_batch_opts].
BoxFitOptions
Options for Document::fit_text_to_box.
ClassifiedCollision
A Collision annotated with the structural relationship between the two overlapping LayoutRegions.
Collision
A pair of overlapping PlacedBoxes returned by detect_collisions.
ColumnZone
A horizontal text zone returned by detect_text_columns.
DebugOverlayOptions
Display options for PageHandle::add_fit_debug_overlay.
Document
An existing PDF document that can be annotated with text overlays.
ExtractionWarning
A non-fatal issue encountered while extracting text from a page.
FitOptions
Options for [PageHandle::replace_fragments_fit_to_bbox].
FitResult
Result of Document::fit_text_to_box.
FontHandle
Opaque handle to a font registered with crate::Document::embed_font.
FormField
A PDF form field returned by Document::form_fields.
FragmentReplaceOpts
Placement options for [PageHandle::replace_text_fragments_opts].
LabelValuePair
A matched label/value region pair extracted from a form or table layout.
LayoutIssue
One concrete layout issue in a page-level quality report.
LayoutRegion
A detected layout region on a page, with both source-text bounds and the inferred available rectangle for replacement text.
LayoutRegionOptions
Options for extract_layout_regions.
PageFitSummary
Page-level aggregate quality summary derived from a batch of RegionFitPlans.
PageHandle
A handle to a specific page for queuing text overlays.
PageImage
A raster image extracted from a PDF page.
PageLayoutQuality
Page-level layout quality report for translated or replacement text.
PdfMetadata
PDF /Info dictionary fields.
PlacedBox
A positioned rectangle for collision detection.
RegionFitPlan
Combines a LayoutRegion with the crate::FitResult for its planned replacement text and any Collisions against neighbouring regions.
RegionTextFitOptions
Per-region fitting policy for crate::Document::plan_text_for_regions_with_policy.
ReplaceOptions
Options for [PageHandle::replace_text_opts].
TableCell
A text cell detected by extract_table_cells.
TextChunk
A semantic text chunk extracted from a page.
TextFieldOptions
Options for creating a text field via Document::add_text_field.
TextFragment
A text fragment extracted from a page content stream.
TextGroup
A group of TextFragments merged into a single logical text block.
TextRun
A single text placement descriptor for use with [PageHandle::add_invisible_text_runs].

Enums§

BaselinePolicy
How to anchor replacement text vertically within a layout region.
ChunkType
The semantic type of a text chunk.
CollisionKind
Structural relationship between two overlapping LayoutRegions.
CollisionSeverity
How bad a collision is, based on how much of the smaller box it covers.
Color
A color value that can be either RGB or CMYK.
Error
Errors returned by harumi operations.
FieldType
The type of a PDF form field.
FragmentReplaceFailureReason
Reason why a TextFragment cannot be suppressed by [PageHandle::replace_text_fragments].
GroupingStrategy
Controls how group_text_fragments merges individual TextFragments.
LayoutIssueKind
Type of layout problem found while checking planned replacement text.
LayoutIssueSeverity
Severity of a layout issue.
LayoutRegionKind
Classifies the structural role of a LayoutRegion.
LayoutRegionRole
Functional role of a LayoutRegion in a translation or editing workflow.
OverflowPolicy
How Document::fit_text_to_box handles text that does not fit.
PageImageFormat
Format of the bytes stored in PageImage::bytes.
PlacementStatus
Outcome of a Document::fit_text_to_box placement.
VerticalAlign
Vertical alignment for PageHandle::add_text_box_aligned.
WarningKind
Why a content stream or Form XObject was not fully decoded during text extraction.
WidthPolicy
How to determine the available width for replacement text.

Functions§

calculate_text_width
Calculate the total width of a text string in PDF points from raw TTF bytes.
classify_collisions
Annotate each Collision with a CollisionKind by comparing the LayoutRegion metadata (row, col, role) at the collision indices.
collision_severity
Compute a CollisionSeverity from raw box areas.
detect_collisions
Detect pairwise axis-aligned bounding-box overlaps between boxes.
detect_text_columns
Estimate column layout from a set of text fragments.
extract_label_value_pairs
Pair LayoutRegionRole::LeftLabel regions with their same-row LayoutRegionRole::RightValue siblings.
extract_layout_regions
Detect layout regions on a page, inferring the usable area for each cell.
extract_table_cells
Detect table structure in a flat list of text fragments.
font_covers_char
Return true when font_bytes contains a glyph for ch.
glyph_advance_pt
Width of one character in PDF points given the font face and font size. Returns None if the character is not present in the font (no glyph mapping).
group_text_fragments
Group text fragments into logical blocks according to strategy.
merge_short_cjk_tails
Merge short CJK “tail” fragments into the preceding fragment.
sort_by_reading_order
Sort text fragments by reading order: top-to-bottom, then left-to-right.
text_fragment_bounds
Return the axis-aligned bounding box that covers all fragments in fragments as [x, y, width, height] in PDF points (origin: bottom-left of the page).
wrap_paragraph
Greedy line-breaking for a single paragraph (no embedded newlines).

Type Aliases§

Result
Alias for std::result::Result<T, harumi::Error>.