dongler-core 0.3.2

Rust-native PDF and document extraction core for Markdown, LaTeX, and JSON output.
Documentation
<p align="center">
  <img src="https://cristianexer.github.io/dongler/img/dongler-logo.png" alt="Dongler logo" width="132">
</p>

# Dongler

Dongler is a fast, Rust-native document extraction package for developers who
need to parse PDFs and other documents into Markdown, LaTeX, or structured
JSON.

It is designed around the practical path-first workflow: load a file, inspect
the document object, then render the output format your pipeline needs. The same
core engine powers the CLI, Python package, TypeScript package, and Rust API.

## Install

```bash
cargo install dongler
pip install dongler
npm install @cristianexer/dongler
```

For Rust library usage, depend on `dongler-core`. The public `dongler` crate is
the CLI package.

## Parse a PDF

Python:

```python
import dongler

doc = dongler.load("report.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
data = doc.to_dict()
```

TypeScript:

```ts
import { load } from "@cristianexer/dongler";

const doc = load("report.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
const data = doc.toObject();
```

Rust:

```rust
use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("report.pdf")?;
    println!("{}", doc.to_markdown()?);
    Ok(())
}
```

## What You Get

- Markdown, LaTeX, and JSON renderers from the same document object.
- Page, block, table, image, warning, and metadata fields for downstream code.
- Rust-native PDF extraction with no hosted service dependency.
- Python and TypeScript bindings over the same Rust core.
- Batch APIs that return one result per file, so one unsupported document does
  not stop a job.

## Supported Inputs

Dongler supports native extraction for PDFs, DOCX, XLSX, PPTX, ODT/ODS/ODP,
HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain
text/Markdown/TeX today. It also supports gzip-compressed text/JSON/XML/CSV
corpus files, bare gzip source files, and zip/tar/tar.gz source packages.

Legacy binary Office and Outlook containers are detected and return explicit
planned-format errors until their engines land.

## More Examples

Plain text, Markdown, office files, and data files use the same API:

```python
import dongler

doc = dongler.load("invoice.docx")
markdown = doc.to_markdown()
latex = doc.to_latex()
```

## Batch Processing

Batch processing returns one result per file. One bad or unsupported document
does not stop the batch.

Python:

```python
import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")
```

TypeScript:

```ts
import { loadMany } from "@cristianexer/dongler";

for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
  if (result.ok) {
    console.log(result.document!.toMarkdown());
  } else {
    console.error(`${result.path}: ${result.error}`);
  }
}
```

Rust:

```rust
use dongler_core::load_many;

for result in load_many(["notes.txt", "invoice.pdf"]) {
    if result.ok {
        println!("{}", result.document.unwrap().to_markdown().unwrap());
    } else {
        eprintln!("{}: {}", result.path, result.error.unwrap());
    }
}
```

## CLI

```bash
dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx --format json
dongler extract deck.pptx --format markdown
dongler extract notes.odt --format markdown
dongler extract annotations.json --format markdown
dongler extract boxes.csv --format json
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json
```

PDF extraction through the CLI uses the same Rust-native engine as the Rust,
Python, and TypeScript packages.

## Developer Docs

- [Documentation]https://cristianexer.github.io/dongler/docs/intro
- [Quick start]https://cristianexer.github.io/dongler/docs/quickstart
- [Developer guide]https://cristianexer.github.io/dongler/docs/developer-guide
- [API reference]https://cristianexer.github.io/dongler/docs/api

## Benchmarks

<!-- BENCHMARKS:START -->
_Generated by `scripts/run-benchmarks.py` on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset._

Coverage is `parse / bbox / anchors`. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; `n/a` means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in `eval/out/benchmarks/latest.json`.

| Dataset | Status | Local data | Docs eval | Coverage | Pages/sec | GT accuracy |
| --- | --- | ---: | ---: | --- | ---: | ---: |
| DocLayNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PubLayNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| DocBank | ok | 735.6 MB | 200 | 100.0% / 100.0% / 100.0% | 81.94 | 89.5% |
| PubTabNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PubTables-1M | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| TableBank | ok | 1.6 MB | 10 | 100.0% / 100.0% / 100.0% | 193.45 | 100.0% |
| FUNSD | ok | 42.6 MB | 200 | 100.0% / 48.9% / 100.0% | 96.09 | 100.0% |
| SROIE | ok | 627.3 MB | 1264 | 100.0% / 92.7% / 100.0% | 231.85 | 100.0% |
| RVL-CDIP | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| READoc | ok | 39.9 MB | 959 | 100.0% / n/a / n/a | 96.86 | 100.0% |
| OmniDocBench | ok | 40.3 MB | 1 | 100.0% / 100.0% / 100.0% | 1030.96 | 88.5% |
| olmOCR-Bench | ok | 340.5 MB | 1403 | 100.0% / 100.0% / 100.0% | 20.97 | 20.3% |
| ckorzen benchmark | ok | 67.1 MB | 192 | 100.0% / 15.4% / 100.0% | 100.37 | 88.4% |
| S2ORC | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PMC OA | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| arXiv source/PDF | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
<!-- BENCHMARKS:END -->

## License

Dongler is MIT licensed. Copyright (c) 2026 Daniel Fat. See `LICENSE` and
`NOTICE` for the full notice text.