Dongler
Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.
Created by Daniel Fat.
Status
Dongler 0.1.0 ships the stable package shape and a real .txt extraction
path. PDF is the primary product target and the public API is designed for that
workflow, but PDF extraction is not implemented yet.
| Format | Detection | Extraction |
|---|---|---|
.txt, .text |
yes | supported |
.pdf |
yes | planned |
| Word, Excel, HTML, images, email | yes | planned |
Current outputs:
- Markdown
- LaTeX
- JSON
- Dongler's typed document IR
Install
For Rust library usage, depend on dongler-core. The public dongler crate is
the CLI package.
Planned PDF Workflow
This is the API Dongler is building toward. Today, the same calls detect PDFs and return a clear planned-format error until the PDF engine lands.
Python:
=
=
=
TypeScript:
import { load } from "dongler";
const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
Rust:
use load_path;
Works Today
The same object API works today for text files.
Python:
=
TypeScript:
import { load } from "dongler";
const doc = load("notes.txt");
console.log(doc.metadata.block_count);
console.log(doc.toMarkdown());
console.log(doc.toLatex());
Rust:
use load_path;
Batch Processing
Batch processing returns one result per file. One bad or unsupported document does not stop the batch.
Python:
TypeScript:
import { loadMany } from "dongler";
for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
if (result.ok) {
console.log(result.document!.toMarkdown());
} else {
console.error(`${result.path}: ${result.error}`);
}
}
Rust:
use load_many;
for result in load_many
CLI
PDF extraction through the CLI will use the same engine as the Rust, Python, and TypeScript packages once it is implemented.
API Surface
The high-level object API:
- Rust:
load_path(path),load_many(paths),doc.to_markdown(),doc.to_latex(),doc.to_json() - Python:
dongler.load(path),dongler.load_many(paths),doc.to_markdown(),doc.to_latex(),doc.to_json() - TypeScript:
load(path),loadMany(paths),doc.toMarkdown(),doc.toLatex(),doc.toJson()
Compatibility functions remain available:
parse_textto_markdownto_latexto_jsondetect_format
Architecture
Rust is the source of truth. Python and TypeScript are thin native bindings over the Rust core.
flowchart LR
Path["Document path"] --> Format["Format detection"]
Format --> Loader["Source loader"]
Loader --> Engine["Extraction engine"]
Engine --> IR["Document IR"]
IR --> Markdown["Markdown"]
IR --> Latex["LaTeX"]
IR --> Json["JSON"]
IR --> Python["Python object API"]
IR --> TypeScript["TypeScript object API"]
IR --> CLI["CLI"]
The current text engine proves the pipeline. The PDF engine will plug into the same loader, engine, IR, and renderer boundaries.
Documentation
The Docusaurus documentation site lives in website/ and builds from docs/.
Development
Focused commands:
License
Dongler is licensed under the MIT License. See LICENSE and NOTICE.