mdka 2.0.2

HTML to Markdown converter
Documentation
# Architecture

## Workspace Layout

```
mdka/
├── src/               mdka library crate (lib only)
│   ├── lib.rs             Public API surface
│   ├── options.rs         ConversionMode, ConversionOptions
│   ├── traversal.rs       Markdown conversion traversal
│   ├── renderer.rs        MarkdownRenderer state machine
│   ├── utils.rs           Whitespace normalisation + escaping
│   └── alloc_counter.rs   Custom allocator (for benchmarks)
├── tests/             integration test modules
│   └── utils/preprocessor.rs    DOM pre-processing pipeline
├── cli/               mdka-cli binary crate
│   └── src/main.rs        Argument parsing + dispatch
├── node/              Node.js bindings (napi-rs v3)
├── python/            Python bindings (PyO3 v0)
├── benches/           criterion benchmarks
└── examples/          Allocation measurement tool
```

## Conversion Pipeline

Each call to `html_to_markdown_with` follows these steps:

```
HTML string
[1] Parse          scraper::Html::parse_document()
    │               → html5ever DOM tree (tolerant HTML5 parsing)
[2] Pre-process    preprocessor::preprocess(&doc, opts)
    │               → filtered HTML string
    │               Non-recursive DFS over ego-tree nodes
    │               Drops: script, style, iframe, …
    │               Filters attributes per ConversionOptions
    │               Removes shell elements (if opted in)
    │               Unwraps anonymous wrappers (if opted in)
[3] Re-parse       scraper::Html::parse_document(&cleaned)
    │               → clean DOM for conversion
[4] Convert        traversal::traverse(&doc)
    │               → Markdown string
    │               Non-recursive DFS with Enter/Leave events
    │               Drives MarkdownRenderer via event callbacks
[5] Finalise       renderer.finish()
                    → trim leading/trailing whitespace
                    → ensure single trailing newline
```

## MarkdownRenderer

`MarkdownRenderer` is a state machine that maintains:

- **`output`**: the accumulated Markdown string
- **`list_stack`**: tracks nested ordered/unordered lists
- **`blockquote_depth`**: counts blockquote nesting level
- **`in_pre`**: whether inside a `<pre>` block
- **`at_line_start`**: deferred prefix flag for blockquote `> ` emission
- **`newlines_emitted`**: prevents double-blank-line accumulation

The `at_line_start` flag is key: rather than emitting `> ` prefixes
immediately when entering a blockquote, the renderer defers them until
actual content is written. This ensures nested blockquotes emit the
correct number of `>` characters regardless of how many block elements
intervene.

## Language Bindings

Both the Node.js and Python bindings are thin wrappers:

- **Node.js** (napi-rs): exposes sync and async (`tokio::spawn_blocking`)
  variants. The async variants release the Node.js event loop during conversion.
- **Python** (PyO3): exposes `py.detach()` on the batch function
  `html_to_markdown_many`, releasing the GIL for rayon parallel conversion.

The binding crates (`mdka-node`, `mdka-python`) have no conversion logic
of their own — they call the same Rust functions as the library and CLI.