pub struct ExtractionOptions {
pub preserve_layout: bool,
pub space_threshold: f64,
pub tj_space_threshold: f64,
pub newline_threshold: f64,
pub sort_by_position: bool,
pub detect_columns: bool,
pub column_threshold: f64,
pub merge_hyphenated: bool,
pub track_space_decisions: bool,
pub reconstruct_paragraphs: bool,
pub include_artifacts: bool,
}Expand description
Text extraction options
Fields§
§preserve_layout: boolPreserve the original layout (spacing and positioning)
space_threshold: f64Minimum space width to insert space character (in text space units)
tj_space_threshold: f64Threshold for synthesising an implicit U+0020 from a TJ numeric
kerning offset, expressed as a fraction of the current font size.
A TJ kern advances the text matrix by -adjustment/1000 * font_size
without rendering any glyph; many PDFs (academic publishers, LaTeX,
kerned typography) encode inter-word gaps purely as wide negative
kerns rather than literal space bytes. When the synthesised advance
exceeds tj_space_threshold * font_size, the extractor inserts one
U+0020. Default 0.2 (200 milli-em) sits well between typical
intra-word kerning (10-50 milli-em) and the width of a space
glyph in most fonts (250-300 milli-em). Lower values catch tighter
spaces; higher values reduce false positives in fonts with unusually
wide kerning. Separate from space_threshold (which governs the
post-glyph gap between separate text-show operators) because the TJ
numeric kern is measured without any glyph advance baseline and
needs a more sensitive threshold (issue #272).
newline_threshold: f64Minimum vertical distance to insert newline (in text space units)
sort_by_position: boolSort text fragments by position (useful for multi-column layouts)
detect_columns: boolDetect and handle columns
column_threshold: f64Column separation threshold (in page units)
merge_hyphenated: boolMerge hyphenated words at line ends
track_space_decisions: boolTrack space insertion decisions in each TextFragment (default: false).
When false: zero overhead. When true: populates TextFragment::space_decisions.
reconstruct_paragraphs: boolReconstruct visual lines and paragraphs from the raw text fragments
produced by PDF text-show operators. When true, the extractor groups
fragments by baseline into single-line fragments, then groups
consecutive lines with normal leading into paragraph-level fragments.
This is what the partition pipeline needs to produce Element values at
paragraph granularity rather than at per-Tj granularity (see
issue #261).
Default false for backward compatibility with direct extract_text
callers. The PdfDocument::partition* entry points force this to
true.
include_artifacts: boolInclude content inside /Artifact marked-content scopes (page headers,
footers, watermarks, decorative content). Default false — Artifact
content is filtered out, as the PDF/UA conformance level recommends
for accessibility tooling and as RAG callers consistently want
(issue #269 Phase 1). Opt-in by setting true when extracting
page furniture matters (e.g. forensic auditing, redaction tools).
Trait Implementations§
Source§impl Clone for ExtractionOptions
impl Clone for ExtractionOptions
Source§fn clone(&self) -> ExtractionOptions
fn clone(&self) -> ExtractionOptions
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more