Skip to main content

PdfStream

Struct PdfStream 

Source
pub struct PdfStream { /* private fields */ }
Expand description

On-demand page extractor backed by a single parsed pdf_oxide::PdfDocument.

pdf_oxide parses the file lazily — open does the xref + catalog and returns in tens of milliseconds even on the 31 MB / 1310-page PDF reference, where lopdf::Document::load (the old backend) took ~40 s because it eagerly decompressed every content stream. Per-page extraction is sub-millisecond warm, hundreds of micros cold.

pdf_oxide::PdfDocument is Send + Sync (its interior-mutable caches are Mutex-guarded), so a PdfStream can be wrapped in Arc and shared between the main thread (rendering the first visible page) and the background loader thread (extracting the rest of the document) the same way the lopdf-backed version was.

Implementations§

Source§

impl PdfStream

Source

pub fn open(pdf_path: &str) -> Result<Self, Box<dyn Error>>

Open a PDF and parse its catalog. Does not extract any page text.

Source

pub fn open_with_bundled_ocr(pdf_path: &str) -> Result<Self, Box<dyn Error>>

Source

pub fn total_pages(&self) -> usize

Source

pub fn canonical_path(&self) -> &Path

Source

pub fn extract_page(&self, page_index: usize) -> Option<String>

Extract sanitized text for a single page.

page_index is 1-based to match the historical lopdf-backed API (the rest of hygg counts pages from 1 in saved progress, status line, etc.). Returns None if the index is out of range, the page has no extractable text, or extraction panicked. pdf_oxide claims a 100 % pass rate on its 3 830-PDF corpus, but we still wrap in catch_unwind so a misbehaving page can’t take down the background loader thread and leave every later page stuck on “loading”.

Uses pdf_oxide’s positional extract_text_lines rather than the simpler extract_text. The former returns each visual line with its bounding box; we group lines that share a row (overlapping y ranges) and join them left-to-right. Without that step pdf_oxide can interleave adjacent TOC entries — “1.3 Foo1.4 Bar 3231” — and the downstream sanitizer can’t recover them.

Source

pub fn extract_page_with_images( &self, page_index: usize, col: usize, ) -> Option<PdfRenderedPage>

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<U, T> ToOwnedObj<U> for T
where U: FromObjRef<T>,

Source§

fn to_owned_obj(&self, data: FontData<'_>) -> U

Convert this type into T, using the provided data to resolve any offsets.
Source§

impl<U, T> ToOwnedTable<U> for T
where U: FromTableRef<T>,

Source§

fn to_owned_table(&self) -> U

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V