markdown2pdf 0.4.0

Create PDF with Markdown files (a md to pdf transpiler)
Documentation
# PDF Renderer Handoff — bringing the renderer up to the lexer's coverage

## Mission

The markdown lexer is now feature-complete: every CommonMark 0.31.2
construct plus GFM tables / task lists / strikethrough parses cleanly
into a well-typed AST. The PDF rendering layer at `src/lib/pdf.rs`
has not kept up. **Your job is to design and implement the PDF-side
work so every `Token` variant the lexer emits has a faithful PDF
rendering.**

This document is the brief. It is **not** an implementation plan —
producing the plan is part of the task. Before writing any code you
must read both the lexer's public API (`src/lib/markdown.rs` — the
`Token` enum and its variants) **and** the underlying PDF library
(`genpdfi`, our fork of `genpdf` that ships drawable primitives:
`Document`, `Paragraph`, `Style`, list layouts, page breaks, etc.).
The decisions about how to represent each construct in a PDF — what
spacing, what fonts, what fallbacks, what gets dropped vs. preserved —
are yours to make based on what `genpdfi` can actually do.

## Current state of the lexer (what you can rely on)

- **CommonMark 0.31.2 spec compliance: 100% (652/652 examples).**
- **GFM extensions:** tables, strikethrough, task list items,
  autolinks (URL + email). All at 100% non-HTML.
- **Test coverage:** ~750 inline integration tests + 652 spec
  examples + 43 stress / robustness tests. All green, zero warnings.
- **Public AST entry-point:**
  - `markdown2pdf::markdown::Lexer::new(input: String)` — constructor.
  - `Lexer::parse() -> Result<Vec<Token>, LexerError>` — main entry.
- **Token variants** (all `pub` — read the doc-comments in
  `src/lib/markdown.rs` for full semantics):
  - `Text(String)`
  - `Heading(Vec<Token>, usize)` — level 1–6
  - `Emphasis { level, content }` — level 1 (italic) or 2 (bold)
  - `StrongEmphasis(Vec<Token>)`
  - `Code { language, content, block }` — inline (`block: false`) or
    fenced/indented block (`block: true`)
  - `BlockQuote(Vec<Token>)` — body is itself parsed tokens, can nest
  - `ListItem { content, ordered, number, marker, checked, loose }`    `checked: Option<bool>` for GFM task lists; `loose` is the
    blank-line-separation flag
  - `Link { content, url, title }`
  - `Image { alt, url, title }` — already parsed; renderer currently
    flattens to a styled link
  - `Table { headers, aligns, rows }` — GFM tables with column
    alignment
  - `Strikethrough(Vec<Token>)`
  - `HtmlBlock(String)` — verbatim block-level HTML
  - `HtmlInline(String)` — inline raw HTML tag, comment, etc.
  - `HtmlComment(String)` — inline HTML comment payload (no
    delimiters)
  - `HardBreak`, `Newline`, `HorizontalRule`, `DelimRun`,
    `TableAlignment`, `Unknown`

There is also a `Token::collect_all_text(tokens)` utility that
recursively pulls every visible character out of a token slice — use
it for font-subsetting and for fallback "render as plain text"
behavior on tokens you don't yet support.

## How to inspect what the lexer produces

There is an AST dumper in `examples/dump_ast.rs`:

```bash
cargo run --example dump_ast -- path/to/file.md
```

It prints one token per line, with nested bodies indented, plus a
tally of HTML constructs. Use it to verify the AST for any test
fixture before deciding how the renderer should handle it.

Fixture documents covering every construct live in `test_fixtures/`
(numbered `01_*.md` through `19_*.md`). Render any of them with:

```bash
cargo run --bin markdown2pdf -- -p test_fixtures/19_html_kitchen_sink.md -o out.pdf && open out.pdf
```

That command pair (dump_ast + render) is your feedback loop.

## What the PDF renderer does today (high-level, not prescriptive)

Located at `src/lib/pdf.rs` (~900 lines).

- Has dedicated rendering for: paragraph text, headings, code blocks,
  emphasis, strong emphasis, links, inline code, horizontal rules,
  bullet / ordered / task lists, GFM tables, blockquotes (with
  caveats), and a small subset of inline HTML (`<b>`, `<i>`, `<u>`,
  `<s>` toggles plus `<br>`).
- Reads styling from `Pdf::style` (see `src/lib/styling.rs`). Every
  visual decision (font size, spacing, colors, italic/bold flags,
  underline) is driven by configurable style fields. Preserve that
  pattern — don't hard-code aesthetics into the renderer.

## Known gaps (the starting points — there will be more)

You should **not** treat this list as a spec. It's what surfaced
during the lexer's verification pass. Your own audit will likely
turn up more.

1. **`Token::HtmlBlock` has no renderer arm.** Falls into the
   catch-all `_ => {}` and is silently dropped. So every `<div>`,
   `<table>`, `<script>`, `<!DOCTYPE>`, `<![CDATA[…]]>`, etc. that
   the lexer correctly parses ends up invisible in PDF output.
2. **`Token::Image` renders as a styled link**, not as an embedded
   image. Tracked by [#13]https://github.com/theiskaa/markdown2pdf/issues/13.
   Open question: where does the image come from (local path,
   URL fetch, configurable defaults), how is size determined, and
   how does it interact with page layout / page breaks?
3. **Nested `BlockQuote` inside `BlockQuote` is dropped.**
   `render_blockquote` only recognizes `Heading`, `Code{block:true}`,
   `HorizontalRule`, `ListItem` as block-level inside a quote. A
   nested quote (or a `Token::HtmlBlock`, or a `Token::Table`) falls
   into the inline buffer and renders as nothing (or just the `> `
   prefix).
4. **`Token::Table` inside blockquotes and list items is dropped.**
   Same root cause as the nested-blockquote issue.
5. **`Token::HardBreak` inside a paragraph renders as a space.**
   `genpdfi`'s `Paragraph` has no forced-linefeed primitive. Verify
   whether this is acceptable, or whether the renderer should
   close+reopen paragraphs to force a real break.
6. **`Token::HtmlInline` for tags outside the small whitelist
   (`<b>`, `<i>`, `<u>`, `<s>`/`<del>`/`<strike>`, `<strong>`,
   `<em>`, `<br>`) falls through to literal text.** Decide whether to
   render unknown inline HTML verbatim (current behavior), strip it,
   or grow the whitelist.
7. **`Token::HtmlComment` inline is silently dropped on render.**
   The text-renderer doesn't have an arm for it. Decide whether to
   keep that (sensible for "comments are invisible") or expose it
   somehow (e.g. as a note).
8. **`Token::Link.title` is parsed but never displayed.** PDF has no
   hover-tooltip equivalent. Decide: ignore, render as superscript,
   render as marginalia, or invent a config flag.

## What you must do before writing any code

1. **Read the full `Token` enum** in `src/lib/markdown.rs`
   (lines ~52–156 for the enum; ~178–268 for `collect_all_text`;
   the `#[cfg(test)]` blocks are in `tests/markdown/`). Make sure
   you understand every variant's invariants.
2. **Read the entire `src/lib/pdf.rs`** end to end. Note which
   variants have dedicated arms, which fall through, and which are
   handled inconsistently across contexts (the same `Token` may
   render correctly at top level but be dropped inside a blockquote
   or list item — find every such asymmetry).
3. **Audit `genpdfi`** — the crate is on crates.io as `genpdfi`
   (this project's fork of `genpdf`). Find the
   `Document`, `Paragraph`, `Style`, `Image`, `LinearLayout`,
   `OrderedList` / `UnorderedList`, `TableLayout`, etc. primitives.
   List which AST tokens have a clean mapping and which require
   workarounds, composition, or genpdfi extensions.
4. **Build a coverage matrix.** Rows = every `Token` variant; columns
   = `Root`, `inside BlockQuote`, `inside ListItem`, `inside Table
   cell` (for inline tokens). Mark every cell: implemented, dropped,
   inconsistent. The matrix is your spec for what's missing.
5. **Inspect the existing styling surface** in `src/lib/styling.rs`
   so any new rendering you add plugs into the same config story.
   Don't bypass it.

## What to deliver (the plan)

A document or PR description containing:

1. The coverage matrix from step 4 above, with every gap labeled.
2. Per gap: the proposed `genpdfi` primitive(s) to use, any
   composition logic required, and which `Pdf::style` fields are
   relevant.
3. An ordering: which gap to close first, dependencies between gaps,
   and an estimate of how many gaps are quick wins vs. structural
   work. (Image embedding is structural — fetching, decoding, sizing,
   pagination. HtmlBlock is closer to a quick win — verbatim
   monospace passthrough is one option.)
4. A verification plan. Reuse the existing fixtures in
   `test_fixtures/` and `examples/dump_ast.rs`. Add new fixtures or
   golden PDFs as needed. Whatever you propose, every renderer change
   should be backed by a fixture you can visually diff before and
   after.
5. Test additions. The lexer side has ~750 integration tests and a
   stress suite — the renderer should have comparable depth. Today
   the renderer's test coverage is much thinner. Identify what kind
   of tests (snapshot, golden-pdf, layout assertion, smoke) make
   sense here, and put a number on how many you intend to add.

## Out-of-scope (intentionally)

- **Changing the lexer.** Parsing is done. If a renderer concern
  surfaces a real lexer bug, file an issue — don't paper over it in
  the renderer.
- **New markdown extensions** (math, footnotes, mermaid,
  admonitions). The branch that just shipped takes the lexer to a
  clean 100% CommonMark + GFM. New constructs are a different
  conversation; resist scope creep.
- **Migrating off `genpdfi`.** If `genpdfi` can't do something we
  need, the answer is to extend `genpdfi` (it's our fork) or to
  layer composition on top in this crate — not to swap libraries.

## Final checklist before you say "ready to merge"

- [ ] Coverage matrix is committed in the PR description.
- [ ] Every `Token` variant has documented PDF behavior, even if that
      behavior is "intentionally skipped" (record that decision).
- [ ] All fixtures in `test_fixtures/` produce a PDF that visually
      represents the markdown content the user wrote.
- [ ] `cargo test --all-targets` is green with zero warnings.
- [ ] `cargo run --example dump_ast -- <fixture>` and
      `cargo run --bin markdown2pdf -- -p <fixture> -o <out>` give
      consistent output (token tally aligns with what's in the PDF).
- [ ] The PR description tells reviewers, in plain language, which
      gaps closed, which were deliberately left open, and why.

---

**Bottom line:** the parser produces clean, faithful tokens. The
PDF renderer is the next surface. Read both layers, build the
matrix, decide the rendering story per construct, then implement.
Don't accept any inline raw-HTML or block-HTML construct silently
disappearing from the output by default — at minimum, render
verbatim or document the skip.