Dongler
Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.
Created by Daniel Fat.
Status
Dongler ships .txt extraction and a native Rust PDF extraction path with page
geometry, text source anchors, basic table reconstruction, and image object
positions.
| Format | Detection | Extraction |
|---|---|---|
.txt, .text |
yes | supported |
.pdf |
yes | supported |
| Word, Excel, HTML, images, email | yes | planned |
Current outputs:
- Markdown
- LaTeX
- JSON
- Dongler's typed document IR
Install
For Rust library usage, depend on dongler-core. The public dongler crate is
the CLI package.
API
Document extraction is a two-step process: load a document path, then render the extracted structure in the format you need. The same document object can be rendered to Markdown, LaTeX, or JSON without re-extracting the document.
Python:
=
=
=
TypeScript:
import { load } from "@cristianexer/dongler";
const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
Rust:
use load_path;
Batch Processing
Batch processing returns one result per file. One bad or unsupported document does not stop the batch.
Python:
TypeScript:
import { loadMany } from "@cristianexer/dongler";
for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
if (result.ok) {
console.log(result.document!.toMarkdown());
} else {
console.error(`${result.path}: ${result.error}`);
}
}
Rust:
use load_many;
for result in load_many
CLI
PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.
API Surface
The high-level object API:
- Rust:
load_path(path),load_many(paths),doc.to_markdown(),doc.to_latex(),doc.to_json() - Python:
dongler.load(path),dongler.load_many(paths),doc.to_markdown(),doc.to_latex(),doc.to_json() - TypeScript:
load(path),loadMany(paths),doc.toMarkdown(),doc.toLatex(),doc.toJson()
Compatibility functions remain available:
parse_textto_markdownto_latexto_jsondetect_format
Documentation
The Docusaurus documentation site lives in website/ and builds from docs/.
Development
Focused commands:
License
Dongler is licensed under the MIT License. See LICENSE and NOTICE.