pub struct PdfLoaderConfig {
pub min_chunk_size: usize,
pub max_chunk_size: usize,
pub chunk_overlap: usize,
pub default_weight: f32,
pub include_metadata: bool,
pub split_by_sentence: bool,
pub split_by_paragraph: bool,
pub clean_text: bool,
pub remove_page_numbers: bool,
pub dehyphenate: bool,
}Expand description
Structure for configuring how PDFs are converted to training data
Fields§
§min_chunk_size: usizeMinimum length of a chunk (in characters)
max_chunk_size: usizeMaximum length of a chunk (in characters)
chunk_overlap: usizeOverlap between chunks (in characters)
default_weight: f32Default weight for generated training examples
include_metadata: boolWhether metadata like page number and position in the document should be added
split_by_sentence: boolWhether chunks should be split at sentence boundaries
split_by_paragraph: boolWhether to primarily split at paragraph boundaries (\n\n)
clean_text: boolWhether to apply the text cleaning pipeline
remove_page_numbers: boolWhether to remove page number lines
dehyphenate: boolWhether to rejoin hyphenated words across line breaks
Trait Implementations§
Auto Trait Implementations§
impl Freeze for PdfLoaderConfig
impl RefUnwindSafe for PdfLoaderConfig
impl Send for PdfLoaderConfig
impl Sync for PdfLoaderConfig
impl Unpin for PdfLoaderConfig
impl UnsafeUnpin for PdfLoaderConfig
impl UnwindSafe for PdfLoaderConfig
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more