Dongler

Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.

Install

cargo install dongler
pip install dongler
npm install @cristianexer/dongler

For Rust library usage, depend on dongler-core. The public dongler crate is the CLI package.

How to Use

Dongler supports native extraction for PDFs, DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX today, including gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages. Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land. The same API works across supported formats, so you can use the same code to extract Markdown from a PDF invoice, spreadsheet, web page, email, dataset annotation, or plain text note.

Python:

import dongler

doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()

TypeScript:

import { load } from "@cristianexer/dongler";

const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();

Rust:

use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("invoice.pdf")?;
    println!("{}", doc.to_markdown()?);
    Ok(())
}

Batch Processing

Batch processing returns one result per file. One bad or unsupported document does not stop the batch.

Python:

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

TypeScript:

import { loadMany } from "@cristianexer/dongler";

for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
  if (result.ok) {
    console.log(result.document!.toMarkdown());
  } else {
    console.error(`${result.path}: ${result.error}`);
  }
}

Rust:

use dongler_core::load_many;

for result in load_many(["notes.txt", "invoice.pdf"]) {
    if result.ok {
        println!("{}", result.document.unwrap().to_markdown().unwrap());
    } else {
        eprintln!("{}: {}", result.path, result.error.unwrap());
    }
}

CLI

dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx --format json
dongler extract deck.pptx --format markdown
dongler extract notes.odt --format markdown
dongler extract annotations.json --format markdown
dongler extract boxes.csv --format json
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json

PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.

Benchmarks

Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset.

Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.

Dataset	Status	Local data	Docs eval	Coverage	Pages/sec	GT accuracy
DocLayNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PubLayNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
DocBank	ok	735.6 MB	200	100.0% / 100.0% / 100.0%	81.94	89.5%
PubTabNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PubTables-1M	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
TableBank	ok	1.6 MB	10	100.0% / 100.0% / 100.0%	193.45	100.0%
FUNSD	ok	42.6 MB	200	100.0% / 48.9% / 100.0%	96.09	100.0%
SROIE	ok	627.3 MB	1264	100.0% / 92.7% / 100.0%	231.85	100.0%
RVL-CDIP	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
READoc	ok	39.9 MB	959	100.0% / n/a / n/a	96.86	100.0%
OmniDocBench	ok	40.3 MB	1	100.0% / 100.0% / 100.0%	1030.96	88.5%
olmOCR-Bench	ok	340.5 MB	1403	100.0% / 100.0% / 100.0%	20.97	20.3%
ckorzen benchmark	ok	67.1 MB	192	100.0% / 15.4% / 100.0%	100.37	88.4%
S2ORC	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PMC OA	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
arXiv source/PDF	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a

License

Dongler is licensed under the MIT License. See LICENSE and NOTICE.