pub struct DocumentConfig {
pub enabled: bool,
pub max_pages: usize,
pub attempt_scanned: bool,
pub max_upload_bytes: usize,
pub upload_concurrency: usize,
pub max_concurrent_parses: usize,
pub parse_timeout_ms: u64,
pub max_decompressed_bytes: usize,
pub sandbox: bool,
pub sandbox_memory_bytes: u64,
}Expand description
[document] section — controls PDF (and future binary-document) parsing.
All knobs honor CRW_DOCUMENT__* env overrides.
Fields§
§enabled: boolMaster switch for document parsing at runtime (independent of the
compile-time pdf cargo feature). When false, PDFs are left unparsed.
max_pages: usizeCap on the number of pages converted per document. 0 = no limit.
attempt_scanned: boolBest-effort extraction from scanned/image PDFs (no OCR; usually empty).
max_upload_bytes: usizeMaximum upload size in bytes for POST /v2/parse. Defaults to 50 MB,
matching the HTTP renderer’s response cap.
upload_concurrency: usizeMaximum number of concurrent uploads being parsed at once — bounds peak
memory (each in-flight upload buffers up to max_upload_bytes).
max_concurrent_parses: usizeProcess-wide cap on concurrent PDF parses across ALL surfaces (URL
scrape, crawl, batch, upload). Bounds peak CPU + decompressed memory: a
malicious PDF can decompress far beyond its on-wire size, so this is the
primary memory-DoS guard. Independent of upload_concurrency (which
only bounds upload body buffering).
parse_timeout_ms: u64Wall-clock timeout (ms) for a single PDF parse. A parse exceeding this
returns a timeout error to the caller; protects against pathological
documents that spin the parser. 0 disables the timeout.
max_decompressed_bytes: usizeDecompression-bomb guard: maximum total DECOMPRESSED bytes a document’s
FlateDecode streams may inflate to. Checked in bounded memory BEFORE the
parser runs, so a small file that explodes to many GB is rejected with
pdf_too_large having allocated only kilobytes. This is the primary
guard against OOM-crashing the host. 0 disables it. Default 100 MiB —
huge for text extraction (millions of words) yet tiny next to host RAM.
Raise only if you must parse image-heavy PDFs.
sandbox: boolRun each PDF parse in an isolated child PROCESS (Unix only) instead of
in-process. The child gets a hard OS memory ceiling (RLIMIT_AS) and CPU
limit, inherits no env/secrets, and is killed on timeout. A crash, OOM,
or even a hypothetical parser RCE is contained to the child — the main
server (scrape/crawl) keeps running. Costs ~1-3ms spawn overhead per
parse. Recommended for hosts that accept untrusted uploads. Default off.
sandbox_memory_bytes: u64Hard address-space limit (bytes) for a sandbox child (RLIMIT_AS). The
child is aborted by the OS if it allocates beyond this — the ultimate
backstop against memory-DoS even if the decompression guard is bypassed.
Default 512 MiB.
Trait Implementations§
Source§impl Clone for DocumentConfig
impl Clone for DocumentConfig
Source§fn clone(&self) -> DocumentConfig
fn clone(&self) -> DocumentConfig
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more