pub struct PdfStream { /* private fields */ }Expand description
On-demand page extractor backed by a single parsed pdf_oxide::PdfDocument.
pdf_oxide parses the file lazily — open does the xref + catalog and
returns in tens of milliseconds even on the 31 MB / 1310-page PDF
reference, where lopdf::Document::load (the old backend) took ~40 s
because it eagerly decompressed every content stream. Per-page
extraction is sub-millisecond warm, hundreds of micros cold.
pdf_oxide::PdfDocument is Send + Sync (its interior-mutable caches
are Mutex-guarded), so a PdfStream can be wrapped in Arc and
shared between the main thread (rendering the first visible page) and
the background loader thread (extracting the rest of the document) the
same way the lopdf-backed version was.
Implementations§
Source§impl PdfStream
impl PdfStream
Sourcepub fn open(pdf_path: &str) -> Result<Self, Box<dyn Error>>
pub fn open(pdf_path: &str) -> Result<Self, Box<dyn Error>>
Open a PDF and parse its catalog. Does not extract any page text.
pub fn open_with_bundled_ocr(pdf_path: &str) -> Result<Self, Box<dyn Error>>
pub fn total_pages(&self) -> usize
pub fn canonical_path(&self) -> &Path
Sourcepub fn extract_page(&self, page_index: usize) -> Option<String>
pub fn extract_page(&self, page_index: usize) -> Option<String>
Extract sanitized text for a single page.
page_index is 1-based to match the historical lopdf-backed API
(the rest of hygg counts pages from 1 in saved progress, status
line, etc.). Returns None if the index is out of range, the page
has no extractable text, or extraction panicked. pdf_oxide claims a
100 % pass rate on its 3 830-PDF corpus, but we still wrap in
catch_unwind so a misbehaving page can’t take down the background
loader thread and leave every later page stuck on “loading”.
Uses pdf_oxide’s positional extract_text_lines rather than the
simpler extract_text. The former returns each visual line with
its bounding box; we group lines that share a row (overlapping y
ranges) and join them left-to-right. Without that step pdf_oxide
can interleave adjacent TOC entries — “1.3 Foo1.4 Bar 3231” — and
the downstream sanitizer can’t recover them.
pub fn extract_page_with_images( &self, page_index: usize, col: usize, ) -> Option<PdfRenderedPage>
Auto Trait Implementations§
impl !Freeze for PdfStream
impl RefUnwindSafe for PdfStream
impl Send for PdfStream
impl Sync for PdfStream
impl Unpin for PdfStream
impl UnsafeUnpin for PdfStream
impl UnwindSafe for PdfStream
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<U, T> ToOwnedObj<U> for Twhere
U: FromObjRef<T>,
impl<U, T> ToOwnedObj<U> for Twhere
U: FromObjRef<T>,
Source§fn to_owned_obj(&self, data: FontData<'_>) -> U
fn to_owned_obj(&self, data: FontData<'_>) -> U
T, using the provided data to resolve any offsets.