# PDF Renderer Handoff — bringing the renderer up to the lexer's coverage
## Mission
The markdown lexer is now feature-complete: every CommonMark 0.31.2
construct plus GFM tables / task lists / strikethrough parses cleanly
into a well-typed AST. The PDF rendering layer at `src/lib/pdf.rs`
has not kept up. **Your job is to design and implement the PDF-side
work so every `Token` variant the lexer emits has a faithful PDF
rendering.**
This document is the brief. It is **not** an implementation plan —
producing the plan is part of the task. Before writing any code you
must read both the lexer's public API (`src/lib/markdown.rs` — the
`Token` enum and its variants) **and** the underlying PDF library
(`genpdfi`, our fork of `genpdf` that ships drawable primitives:
`Document`, `Paragraph`, `Style`, list layouts, page breaks, etc.).
The decisions about how to represent each construct in a PDF — what
spacing, what fonts, what fallbacks, what gets dropped vs. preserved —
are yours to make based on what `genpdfi` can actually do.
## Current state of the lexer (what you can rely on)
- **CommonMark 0.31.2 spec compliance: 100% (652/652 examples).**
- **GFM extensions:** tables, strikethrough, task list items,
autolinks (URL + email). All at 100% non-HTML.
- **Test coverage:** ~750 inline integration tests + 652 spec
examples + 43 stress / robustness tests. All green, zero warnings.
- **Public AST entry-point:**
- `markdown2pdf::markdown::Lexer::new(input: String)` — constructor.
- `Lexer::parse() -> Result<Vec<Token>, LexerError>` — main entry.
- **Token variants** (all `pub` — read the doc-comments in
`src/lib/markdown.rs` for full semantics):
- `Text(String)`
- `Heading(Vec<Token>, usize)` — level 1–6
- `Emphasis { level, content }` — level 1 (italic) or 2 (bold)
- `StrongEmphasis(Vec<Token>)`
- `Code { language, content, block }` — inline (`block: false`) or
fenced/indented block (`block: true`)
- `BlockQuote(Vec<Token>)` — body is itself parsed tokens, can nest
- `ListItem { content, ordered, number, marker, checked, loose }` —
`checked: Option<bool>` for GFM task lists; `loose` is the
blank-line-separation flag
- `Link { content, url, title }`
- `Image { alt, url, title }` — already parsed; renderer currently
flattens to a styled link
- `Table { headers, aligns, rows }` — GFM tables with column
alignment
- `Strikethrough(Vec<Token>)`
- `HtmlBlock(String)` — verbatim block-level HTML
- `HtmlInline(String)` — inline raw HTML tag, comment, etc.
- `HtmlComment(String)` — inline HTML comment payload (no
delimiters)
- `HardBreak`, `Newline`, `HorizontalRule`, `DelimRun`,
`TableAlignment`, `Unknown`
There is also a `Token::collect_all_text(tokens)` utility that
recursively pulls every visible character out of a token slice — use
it for font-subsetting and for fallback "render as plain text"
behavior on tokens you don't yet support.
## How to inspect what the lexer produces
There is an AST dumper in `examples/dump_ast.rs`:
```bash
cargo run --example dump_ast -- path/to/file.md
```
It prints one token per line, with nested bodies indented, plus a
tally of HTML constructs. Use it to verify the AST for any test
fixture before deciding how the renderer should handle it.
Fixture documents covering every construct live in `test_fixtures/`
(numbered `01_*.md` through `19_*.md`). Render any of them with:
```bash
cargo run --bin markdown2pdf -- -p test_fixtures/19_html_kitchen_sink.md -o out.pdf && open out.pdf
```
That command pair (dump_ast + render) is your feedback loop.
## What the PDF renderer does today (high-level, not prescriptive)
Located at `src/lib/pdf.rs` (~900 lines).
- Has dedicated rendering for: paragraph text, headings, code blocks,
emphasis, strong emphasis, links, inline code, horizontal rules,
bullet / ordered / task lists, GFM tables, blockquotes (with
caveats), and a small subset of inline HTML (`<b>`, `<i>`, `<u>`,
`<s>` toggles plus `<br>`).
- Reads styling from `Pdf::style` (see `src/lib/styling.rs`). Every
visual decision (font size, spacing, colors, italic/bold flags,
underline) is driven by configurable style fields. Preserve that
pattern — don't hard-code aesthetics into the renderer.
## Known gaps (the starting points — there will be more)
You should **not** treat this list as a spec. It's what surfaced
during the lexer's verification pass. Your own audit will likely
turn up more.
1. **`Token::HtmlBlock` has no renderer arm.** Falls into the
catch-all `_ => {}` and is silently dropped. So every `<div>`,
`<table>`, `<script>`, `<!DOCTYPE>`, `<![CDATA[…]]>`, etc. that
the lexer correctly parses ends up invisible in PDF output.
2. **`Token::Image` renders as a styled link**, not as an embedded
image. Tracked by [#13](https://github.com/theiskaa/markdown2pdf/issues/13).
Open question: where does the image come from (local path,
URL fetch, configurable defaults), how is size determined, and
how does it interact with page layout / page breaks?
3. **Nested `BlockQuote` inside `BlockQuote` is dropped.**
`render_blockquote` only recognizes `Heading`, `Code{block:true}`,
`HorizontalRule`, `ListItem` as block-level inside a quote. A
nested quote (or a `Token::HtmlBlock`, or a `Token::Table`) falls
into the inline buffer and renders as nothing (or just the `> `
prefix).
4. **`Token::Table` inside blockquotes and list items is dropped.**
Same root cause as the nested-blockquote issue.
5. **`Token::HardBreak` inside a paragraph renders as a space.**
`genpdfi`'s `Paragraph` has no forced-linefeed primitive. Verify
whether this is acceptable, or whether the renderer should
close+reopen paragraphs to force a real break.
6. **`Token::HtmlInline` for tags outside the small whitelist
(`<b>`, `<i>`, `<u>`, `<s>`/`<del>`/`<strike>`, `<strong>`,
`<em>`, `<br>`) falls through to literal text.** Decide whether to
render unknown inline HTML verbatim (current behavior), strip it,
or grow the whitelist.
7. **`Token::HtmlComment` inline is silently dropped on render.**
The text-renderer doesn't have an arm for it. Decide whether to
keep that (sensible for "comments are invisible") or expose it
somehow (e.g. as a note).
8. **`Token::Link.title` is parsed but never displayed.** PDF has no
hover-tooltip equivalent. Decide: ignore, render as superscript,
render as marginalia, or invent a config flag.
## What you must do before writing any code
1. **Read the full `Token` enum** in `src/lib/markdown.rs`
(lines ~52–156 for the enum; ~178–268 for `collect_all_text`;
the `#[cfg(test)]` blocks are in `tests/markdown/`). Make sure
you understand every variant's invariants.
2. **Read the entire `src/lib/pdf.rs`** end to end. Note which
variants have dedicated arms, which fall through, and which are
handled inconsistently across contexts (the same `Token` may
render correctly at top level but be dropped inside a blockquote
or list item — find every such asymmetry).
3. **Audit `genpdfi`** — the crate is on crates.io as `genpdfi`
(this project's fork of `genpdf`). Find the
`Document`, `Paragraph`, `Style`, `Image`, `LinearLayout`,
`OrderedList` / `UnorderedList`, `TableLayout`, etc. primitives.
List which AST tokens have a clean mapping and which require
workarounds, composition, or genpdfi extensions.
4. **Build a coverage matrix.** Rows = every `Token` variant; columns
= `Root`, `inside BlockQuote`, `inside ListItem`, `inside Table
cell` (for inline tokens). Mark every cell: implemented, dropped,
inconsistent. The matrix is your spec for what's missing.
5. **Inspect the existing styling surface** in `src/lib/styling.rs`
so any new rendering you add plugs into the same config story.
Don't bypass it.
## What to deliver (the plan)
A document or PR description containing:
1. The coverage matrix from step 4 above, with every gap labeled.
2. Per gap: the proposed `genpdfi` primitive(s) to use, any
composition logic required, and which `Pdf::style` fields are
relevant.
3. An ordering: which gap to close first, dependencies between gaps,
and an estimate of how many gaps are quick wins vs. structural
work. (Image embedding is structural — fetching, decoding, sizing,
pagination. HtmlBlock is closer to a quick win — verbatim
monospace passthrough is one option.)
4. A verification plan. Reuse the existing fixtures in
`test_fixtures/` and `examples/dump_ast.rs`. Add new fixtures or
golden PDFs as needed. Whatever you propose, every renderer change
should be backed by a fixture you can visually diff before and
after.
5. Test additions. The lexer side has ~750 integration tests and a
stress suite — the renderer should have comparable depth. Today
the renderer's test coverage is much thinner. Identify what kind
of tests (snapshot, golden-pdf, layout assertion, smoke) make
sense here, and put a number on how many you intend to add.
## Out-of-scope (intentionally)
- **Changing the lexer.** Parsing is done. If a renderer concern
surfaces a real lexer bug, file an issue — don't paper over it in
the renderer.
- **New markdown extensions** (math, footnotes, mermaid,
admonitions). The branch that just shipped takes the lexer to a
clean 100% CommonMark + GFM. New constructs are a different
conversation; resist scope creep.
- **Migrating off `genpdfi`.** If `genpdfi` can't do something we
need, the answer is to extend `genpdfi` (it's our fork) or to
layer composition on top in this crate — not to swap libraries.
## Final checklist before you say "ready to merge"
- [ ] Coverage matrix is committed in the PR description.
- [ ] Every `Token` variant has documented PDF behavior, even if that
behavior is "intentionally skipped" (record that decision).
- [ ] All fixtures in `test_fixtures/` produce a PDF that visually
represents the markdown content the user wrote.
- [ ] `cargo test --all-targets` is green with zero warnings.
- [ ] `cargo run --example dump_ast -- <fixture>` and
`cargo run --bin markdown2pdf -- -p <fixture> -o <out>` give
consistent output (token tally aligns with what's in the PDF).
- [ ] The PR description tells reviewers, in plain language, which
gaps closed, which were deliberately left open, and why.
---
**Bottom line:** the parser produces clean, faithful tokens. The
PDF renderer is the next surface. Read both layers, build the
matrix, decide the rendering story per construct, then implement.
Don't accept any inline raw-HTML or block-HTML construct silently
disappearing from the output by default — at minimum, render
verbatim or document the skip.