dongler-core 0.2.0

Created by Daniel Fat. Rust-native document extraction core for structured Markdown and LaTeX output.
Documentation

Dongler

Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.

Created by Daniel Fat.

Status

Dongler ships .txt extraction and a native Rust PDF extraction path with page geometry, text source anchors, basic table reconstruction, and image object positions.

Format Detection Extraction
.txt, .text yes supported
.pdf yes supported
Word, Excel, HTML, images, email yes planned

Current outputs:

  • Markdown
  • LaTeX
  • JSON
  • Dongler's typed document IR

Install

cargo install dongler
pip install dongler
npm install @cristianexer/dongler

For Rust library usage, depend on dongler-core. The public dongler crate is the CLI package.

API

Document extraction is a two-step process: load a document path, then render the extracted structure in the format you need. The same document object can be rendered to Markdown, LaTeX, or JSON without re-extracting the document.

Python:

import dongler

doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()

TypeScript:

import { load } from "@cristianexer/dongler";

const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();

Rust:

use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("invoice.pdf")?;
    println!("{}", doc.to_markdown()?);
    Ok(())
}

Batch Processing

Batch processing returns one result per file. One bad or unsupported document does not stop the batch.

Python:

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

TypeScript:

import { loadMany } from "@cristianexer/dongler";

for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
  if (result.ok) {
    console.log(result.document!.toMarkdown());
  } else {
    console.error(`${result.path}: ${result.error}`);
  }
}

Rust:

use dongler_core::load_many;

for result in load_many(["notes.txt", "invoice.pdf"]) {
    if result.ok {
        println!("{}", result.document.unwrap().to_markdown().unwrap());
    } else {
        eprintln!("{}: {}", result.path, result.error.unwrap());
    }
}

CLI

dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json

PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.

API Surface

The high-level object API:

  • Rust: load_path(path), load_many(paths), doc.to_markdown(), doc.to_latex(), doc.to_json()
  • Python: dongler.load(path), dongler.load_many(paths), doc.to_markdown(), doc.to_latex(), doc.to_json()
  • TypeScript: load(path), loadMany(paths), doc.toMarkdown(), doc.toLatex(), doc.toJson()

Compatibility functions remain available:

  • parse_text
  • to_markdown
  • to_latex
  • to_json
  • detect_format

Documentation

The Docusaurus documentation site lives in website/ and builds from docs/.

cd website
npm install
npm run start
npm run build

Development

make test
make build

Focused commands:

make test-rust
make test-python
make test-js
make build-docs

License

Dongler is licensed under the MIT License. See LICENSE and NOTICE.