Skip to main content

DocumentConfig

Struct DocumentConfig 

Source
pub struct DocumentConfig {
    pub enabled: bool,
    pub max_pages: usize,
    pub attempt_scanned: bool,
    pub max_upload_bytes: usize,
    pub upload_concurrency: usize,
    pub max_concurrent_parses: usize,
    pub parse_timeout_ms: u64,
    pub max_decompressed_bytes: usize,
    pub sandbox: bool,
    pub sandbox_memory_bytes: u64,
}
Expand description

[document] section — controls PDF (and future binary-document) parsing. All knobs honor CRW_DOCUMENT__* env overrides.

Fields§

§enabled: bool

Master switch for document parsing at runtime (independent of the compile-time pdf cargo feature). When false, PDFs are left unparsed.

§max_pages: usize

Cap on the number of pages converted per document. 0 = no limit.

§attempt_scanned: bool

Best-effort extraction from scanned/image PDFs (no OCR; usually empty).

§max_upload_bytes: usize

Maximum upload size in bytes for POST /v2/parse. Defaults to 50 MB, matching the HTTP renderer’s response cap.

§upload_concurrency: usize

Maximum number of concurrent uploads being parsed at once — bounds peak memory (each in-flight upload buffers up to max_upload_bytes).

§max_concurrent_parses: usize

Process-wide cap on concurrent PDF parses across ALL surfaces (URL scrape, crawl, batch, upload). Bounds peak CPU + decompressed memory: a malicious PDF can decompress far beyond its on-wire size, so this is the primary memory-DoS guard. Independent of upload_concurrency (which only bounds upload body buffering).

§parse_timeout_ms: u64

Wall-clock timeout (ms) for a single PDF parse. A parse exceeding this returns a timeout error to the caller; protects against pathological documents that spin the parser. 0 disables the timeout.

§max_decompressed_bytes: usize

Decompression-bomb guard: maximum total DECOMPRESSED bytes a document’s FlateDecode streams may inflate to. Checked in bounded memory BEFORE the parser runs, so a small file that explodes to many GB is rejected with pdf_too_large having allocated only kilobytes. This is the primary guard against OOM-crashing the host. 0 disables it. Default 100 MiB — huge for text extraction (millions of words) yet tiny next to host RAM. Raise only if you must parse image-heavy PDFs.

§sandbox: bool

Run each PDF parse in an isolated child PROCESS (Unix only) instead of in-process. The child gets a hard OS memory ceiling (RLIMIT_AS) and CPU limit, inherits no env/secrets, and is killed on timeout. A crash, OOM, or even a hypothetical parser RCE is contained to the child — the main server (scrape/crawl) keeps running. Costs ~1-3ms spawn overhead per parse. Recommended for hosts that accept untrusted uploads. Default off.

§sandbox_memory_bytes: u64

Hard address-space limit (bytes) for a sandbox child (RLIMIT_AS). The child is aborted by the OS if it allocates beyond this — the ultimate backstop against memory-DoS even if the decompression guard is bypassed. Default 512 MiB.

Trait Implementations§

Source§

impl Clone for DocumentConfig

Source§

fn clone(&self) -> DocumentConfig

Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for DocumentConfig

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for DocumentConfig

Source§

fn default() -> Self

Returns the “default value” for a type. Read more
Source§

impl<'de> Deserialize<'de> for DocumentConfig

Source§

fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>
where __D: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,

Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Sized + Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Sized + Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more