Dongler is built around a path-first workflow: load a file, inspect the document object when you need to, then render the output format your pipeline wants. One Rust core powers the CLI, Python, TypeScript, and Rust APIs, so the extraction model is identical everywhere.
Install
For the Rust library, depend on dongler-core. The public dongler crate is the CLI package.
Parse a PDF
Python
=
=
=
=
TypeScript
import { load } from "@cristianexer/dongler";
const doc = load("report.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
const data = doc.toObject();
Rust
use load_path;
let doc = load_path?;
println!;
What you get
๐ Markdown ยท LaTeX ยท JSON Three renderers from one document object โ headings, tables, lists, figures, and emphasis.
โก Native speed, local runtime
A custom Rust PDF parser with rayon page-parallelism. No hosted service, API key, LLM, or OCR for born-digital PDFs.
๐งฑ Structured document model Page, block, table, image, span, warning, and metadata fields โ with source anchors back to PDF objects.
๐งฉ One API across stacks The same extraction model in Python, Node.js, Rust, and the CLI.
๐ฆ Pipeline-friendly batches Batch APIs return one result per file โ a single bad document never stops the job.
๐ Beyond PDF Native extraction for DOCX/XLSX/PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, images, and archives.
Why Dongler
Use Dongler when the job starts with a document path and the next step needs useful text quickly:
- Convert PDFs to Markdown for indexing, review, or RAG ingestion.
- Keep page/block/table/image metadata available through JSON.
- Run locally in scripts, services, queues, notebooks, and shell workflows.
- Use the same extraction model across Python, Node.js, Rust, and the CLI.
Supported inputs
Dongler focuses on digitally born PDFs and also supports native extraction for DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX. It also reads gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages. Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land.
Batch processing
One result per file โ a bad or unsupported document does not stop the batch.
CLI
PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.
Documentation
Benchmarks
Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset.
Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.
| Dataset | Status | Local data | Docs eval | Coverage | Pages/sec | GT accuracy |
|---|---|---|---|---|---|---|
| DocLayNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PubLayNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| DocBank | ok | 735.6 MB | 200 | 100.0% / 100.0% / 100.0% | 81.94 | 89.5% |
| PubTabNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PubTables-1M | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| TableBank | ok | 1.6 MB | 10 | 100.0% / 100.0% / 100.0% | 193.45 | 100.0% |
| FUNSD | ok | 42.6 MB | 200 | 100.0% / 48.9% / 100.0% | 96.09 | 100.0% |
| SROIE | ok | 627.3 MB | 1264 | 100.0% / 92.7% / 100.0% | 231.85 | 100.0% |
| RVL-CDIP | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| READoc | ok | 39.9 MB | 959 | 100.0% / n/a / n/a | 96.86 | 100.0% |
| OmniDocBench | ok | 40.3 MB | 1 | 100.0% / 100.0% / 100.0% | 1030.96 | 88.5% |
| olmOCR-Bench | ok | 340.5 MB | 1403 | 100.0% / 100.0% / 100.0% | 20.97 | 20.3% |
| ckorzen benchmark | ok | 67.1 MB | 192 | 100.0% / 15.4% / 100.0% | 100.37 | 88.4% |
| S2ORC | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PMC OA | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| arXiv source/PDF | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
License
Dongler is MIT licensed. Copyright (c) 2026 Daniel Fat. See LICENSE and NOTICE
for the full notice text.