dongler-core 0.3.10

Dongler is built around a path-first workflow: load a file, inspect the document object when you need to, then render the output format your pipeline wants. One Rust core powers the CLI, Python, TypeScript, and Rust APIs, so the extraction model is identical everywhere.

Install

cargo install dongler                  # CLI + Rust
pip install dongler                    # Python
npm install @cristianexer/dongler      # Node / TypeScript

For the Rust library, depend on dongler-core. The public dongler crate is the CLI package.

Parse a PDF

Python

import dongler

doc = dongler.load("report.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
data = doc.to_dict()

TypeScript

import { load } from "@cristianexer/dongler";

const doc = load("report.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
const data = doc.toObject();

Rust

use dongler_core::load_path;

let doc = load_path("report.pdf")?;
println!("{}", doc.to_markdown()?);

What you get

📄 Markdown · LaTeX · JSON Three renderers from one document object — headings, tables, lists, figures, and emphasis.

⚡ Native speed, local runtime A custom Rust PDF parser with rayon page-parallelism. No hosted service, API key, LLM, or OCR for born-digital PDFs.

🧱 Structured document model Page, block, table, image, span, warning, and metadata fields — with source anchors back to PDF objects.

🧩 One API across stacks The same extraction model in Python, Node.js, Rust, and the CLI.

📦 Pipeline-friendly batches Batch APIs return one result per file — a single bad document never stops the job.

🔌 Beyond PDF Native extraction for DOCX/XLSX/PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, images, and archives.

Why Dongler

Use Dongler when the job starts with a document path and the next step needs useful text quickly:

Convert PDFs to Markdown for indexing, review, or RAG ingestion.
Keep page/block/table/image metadata available through JSON.
Run locally in scripts, services, queues, notebooks, and shell workflows.
Use the same extraction model across Python, Node.js, Rust, and the CLI.

Supported inputs

Dongler focuses on digitally born PDFs and also supports native extraction for DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX. It also reads gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages. Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land.

Batch processing

One result per file — a bad or unsupported document does not stop the batch.

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

CLI

dongler --version
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx   --format json
dongler extract notes.txt   --format latex

PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.

Documentation

Benchmarks

Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset.

Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.

Dataset	Status	Local data	Docs eval	Coverage	Pages/sec	GT accuracy
DocLayNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PubLayNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
DocBank	ok	735.6 MB	200	100.0% / 100.0% / 100.0%	81.94	89.5%
PubTabNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PubTables-1M	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
TableBank	ok	1.6 MB	10	100.0% / 100.0% / 100.0%	193.45	100.0%
FUNSD	ok	42.6 MB	200	100.0% / 48.9% / 100.0%	96.09	100.0%
SROIE	ok	627.3 MB	1264	100.0% / 92.7% / 100.0%	231.85	100.0%
RVL-CDIP	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
READoc	ok	39.9 MB	959	100.0% / n/a / n/a	96.86	100.0%
OmniDocBench	ok	40.3 MB	1	100.0% / 100.0% / 100.0%	1030.96	88.5%
olmOCR-Bench	ok	340.5 MB	1403	100.0% / 100.0% / 100.0%	20.97	20.3%
ckorzen benchmark	ok	67.1 MB	192	100.0% / 15.4% / 100.0%	100.37	88.4%
S2ORC	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PMC OA	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
arXiv source/PDF	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a

Extraction-quality improvements

A controlled A/B of the current parser against the previous release baseline — run on the full olmOCR-Bench corpus (1403 real PDFs, identical benchmark harness and release build) — isolates the gains from the recent reading-order, heading, and ligature work:

Signal	Before	After
olmOCR reading-order checks passed	320 / 1061 (30.2%)	327 / 1061 (30.8%)
Documents improved vs. regressed	—	5 improved, 0 regressed
Unexpanded ligature glyphs left in text (ﬁ ﬂ ﬃ …)	2586 across 101 docs	0
Headings emitted with a semantic level (H1/H2/H3)	0 (single flat kind)	1851 across 327 docs
Parse success	1403 / 1403	1403 / 1403

Every other olmOCR check type (text presence, tables, math, absences) is identical between the two builds, confirming the performance refactors are output-preserving.