pub struct Extractor { /* private fields */ }Expand description
PDF text extractor with configurable behavior.
Use the builder pattern to configure extraction, then call
extract, extract_document,
or pages.
§Example
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("document.pdf")?;
let text = Extractor::new()
.parallel(false)
.normalize_whitespace(true)
.extract(&data)?;
println!("{text}");
Ok(())
}Implementations§
Source§impl Extractor
impl Extractor
Sourcepub fn new() -> Self
pub fn new() -> Self
Creates a new extractor with default configuration.
§Example
use pdfvec::Extractor;
let extractor = Extractor::new();Sourcepub fn with_config(config: Config) -> Self
pub fn with_config(config: Config) -> Self
Creates a new extractor with the given configuration.
§Example
use pdfvec::{Extractor, Config};
let config = Config::default().with_parallel(false);
let extractor = Extractor::with_config(config);Sourcepub fn parallel(self, enabled: bool) -> Self
pub fn parallel(self, enabled: bool) -> Self
Enables or disables parallel page processing.
Default: true
§Example
use pdfvec::Extractor;
let extractor = Extractor::new().parallel(false);Sourcepub fn page_separator(self, separator: impl Into<String>) -> Self
pub fn page_separator(self, separator: impl Into<String>) -> Self
Sets the string inserted between pages.
Default: "\n\n"
§Example
use pdfvec::Extractor;
let extractor = Extractor::new().page_separator("\n---\n");Sourcepub fn normalize_whitespace(self, enabled: bool) -> Self
pub fn normalize_whitespace(self, enabled: bool) -> Self
Enables or disables whitespace normalization.
When enabled, consecutive whitespace is collapsed to single spaces.
Default: true
§Example
use pdfvec::Extractor;
let extractor = Extractor::new().normalize_whitespace(false);Sourcepub fn config(&self) -> &Config
pub fn config(&self) -> &Config
Returns a reference to the extractor’s configuration.
§Example
use pdfvec::Extractor;
let extractor = Extractor::new();
assert!(extractor.config().parallel());Sourcepub fn extract(&self, data: &[u8]) -> Result<String>
pub fn extract(&self, data: &[u8]) -> Result<String>
Extracts all text from PDF data as a single string.
Pages are joined with the configured separator.
§Errors
Returns an error if:
- The data is empty (
Error::EmptyDocument) - The PDF structure is invalid (
Error::InvalidStructure) - The PDF is encrypted (
Error::EncryptedDocument) - Page extraction fails (
Error::PageExtractionFailed)
§Example
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("document.pdf")?;
let text = Extractor::new().extract(&data)?;
println!("Extracted {} characters", text.len());
Ok(())
}Sourcepub fn extract_document(&self, data: &[u8]) -> Result<Document>
pub fn extract_document(&self, data: &[u8]) -> Result<Document>
Extracts a structured Document from PDF data.
Returns a document with individual page access.
§Errors
Same as extract.
§Example
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("document.pdf")?;
let doc = Extractor::new().extract_document(&data)?;
for page in doc.pages() {
println!("Page {}: {}", page.number(), page.char_count());
}
Ok(())
}Sourcepub fn pages<'a>(&'a self, data: &'a [u8]) -> PageIterator<'a> ⓘ
pub fn pages<'a>(&'a self, data: &'a [u8]) -> PageIterator<'a> ⓘ
Returns a streaming iterator over pages.
Pages are extracted on-demand, maintaining constant memory usage regardless of PDF size. This is ideal for processing large documents.
Note: Streaming extraction is always sequential (not parallel).
§Example
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("large.pdf")?;
for page_result in Extractor::new().pages(&data) {
let page = page_result?;
if page.char_count() > 100 {
println!("Page {} has content", page.number());
}
}
Ok(())
}Sourcepub fn extract_metadata(&self, data: &[u8]) -> Result<Metadata>
pub fn extract_metadata(&self, data: &[u8]) -> Result<Metadata>
Extracts document metadata without processing page content.
This is faster than full extraction when you only need metadata like title, author, or creation date.
§Errors
Returns an error if:
- The data is empty (
Error::EmptyDocument) - The PDF structure is invalid (
Error::PdfLibrary)
§Example
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("document.pdf")?;
let meta = Extractor::new().extract_metadata(&data)?;
println!("Title: {:?}", meta.title());
println!("Author: {:?}", meta.author());
println!("Pages: {}", meta.page_count());
Ok(())
}Trait Implementations§
Auto Trait Implementations§
impl Freeze for Extractor
impl RefUnwindSafe for Extractor
impl Send for Extractor
impl Sync for Extractor
impl Unpin for Extractor
impl UnwindSafe for Extractor
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more