Skip to main content

ExtractionOptions

Struct ExtractionOptions 

Source
pub struct ExtractionOptions {
    pub preserve_layout: bool,
    pub space_threshold: f64,
    pub tj_space_threshold: f64,
    pub newline_threshold: f64,
    pub sort_by_position: bool,
    pub detect_columns: bool,
    pub column_threshold: f64,
    pub merge_hyphenated: bool,
    pub track_space_decisions: bool,
    pub reconstruct_paragraphs: bool,
    pub include_artifacts: bool,
}
Expand description

Text extraction options

Fields§

§preserve_layout: bool

Preserve the original layout (spacing and positioning)

§space_threshold: f64

Minimum space width to insert space character (in text space units)

§tj_space_threshold: f64

Threshold for synthesising an implicit U+0020 from a TJ numeric kerning offset, expressed as a fraction of the current font size. A TJ kern advances the text matrix by -adjustment/1000 * font_size without rendering any glyph; many PDFs (academic publishers, LaTeX, kerned typography) encode inter-word gaps purely as wide negative kerns rather than literal space bytes. When the synthesised advance exceeds tj_space_threshold * font_size, the extractor inserts one U+0020. Default 0.2 (200 milli-em) sits well between typical intra-word kerning (10-50 milli-em) and the width of a space glyph in most fonts (250-300 milli-em). Lower values catch tighter spaces; higher values reduce false positives in fonts with unusually wide kerning. Separate from space_threshold (which governs the post-glyph gap between separate text-show operators) because the TJ numeric kern is measured without any glyph advance baseline and needs a more sensitive threshold (issue #272).

§newline_threshold: f64

Minimum vertical distance to insert newline (in text space units)

§sort_by_position: bool

Sort text fragments by position (useful for multi-column layouts)

§detect_columns: bool

Detect and handle columns

§column_threshold: f64

Column separation threshold (in page units)

§merge_hyphenated: bool

Merge hyphenated words at line ends

§track_space_decisions: bool

Track space insertion decisions in each TextFragment (default: false). When false: zero overhead. When true: populates TextFragment::space_decisions.

§reconstruct_paragraphs: bool

Reconstruct visual lines and paragraphs from the raw text fragments produced by PDF text-show operators. When true, the extractor groups fragments by baseline into single-line fragments, then groups consecutive lines with normal leading into paragraph-level fragments. This is what the partition pipeline needs to produce Element values at paragraph granularity rather than at per-Tj granularity (see issue #261).

Default false for backward compatibility with direct extract_text callers. The PdfDocument::partition* entry points force this to true.

§include_artifacts: bool

Include content inside /Artifact marked-content scopes (page headers, footers, watermarks, decorative content). Default false — Artifact content is filtered out, as the PDF/UA conformance level recommends for accessibility tooling and as RAG callers consistently want (issue #269 Phase 1). Opt-in by setting true when extracting page furniture matters (e.g. forensic auditing, redaction tools).

Trait Implementations§

Source§

impl Clone for ExtractionOptions

Source§

fn clone(&self) -> ExtractionOptions

Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for ExtractionOptions

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for ExtractionOptions

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more