rbook-utils 0.0.1

A high-level wrapper over `rbook` for easy ebook parsing/conversion/rendering
Documentation
# EPUB Parser TODO

This file is the execution board for the lockstep roadmap across:
- `epub-utils`
- `rbook-utils`

Legend:
- `[x]` done
- `[ ]` pending
- `~` partial/in progress

## M0: Milestone Skeleton

- `[x]` Converted roadmap into milestone IDs (`M1..M8`)
- `[x]` Added explicit parity requirement for each milestone
- `[x]` Added acceptance criteria per milestone

## M1: Internal Link Rewrite + Validation

[P0]

Status:
- `epub-utils`: `[x]`
- `rbook-utils`: `[x]`
- Tests/parity: `~`

Scope:
- Rewrite internal links for single-file and split-chapter outputs.
- Validate unresolved targets and report broken internal links.
- Keep unknown links unchanged (warn + report).

Acceptance:
- Internal anchor links resolve in split output.
- Cross-chapter links are rewritten to chapter-relative paths in split mode.
- Unresolved links are counted in `report.v1.json`.

## M2: Footnotes/Endnotes Modes

[P0]

Status:
- `epub-utils`: `[x]` (markdown-footnote pattern support)
- `rbook-utils`: `[x]` (markdown-footnote pattern support)
- Tests/parity: `~`

Scope:
- Add `--notes-mode inline|chapter-end|global`.
- `inline`: leave notes unchanged.
- `chapter-end`: collect markdown note definitions at chapter end.
- `global`: collect markdown note definitions into `results/<book_slug>/notes.md`.
- Expand note detection beyond markdown-footnote patterns:
  - EPUB semantic note refs (`epub:type="noteref"`, common note/backlink conventions)
  - endnote sections that are not converted into markdown-footnote syntax

Acceptance:
- No-op with `inline`.
- Chapter-end/global outputs produce stable note IDs and backlinks for markdown footnote refs.
- Semantic note patterns (when present in source XHTML) resolve into the selected notes mode.

## M3: TOC + Landmarks + Page-List Export

[P1]

Status:
- `epub-utils`: `[x]` (`manifest.v1.json`)
- `rbook-utils`: `[x]` (`manifest.v1.json`)
- Tests/parity: `~`

Scope:
- Add `--export-manifest off|v1`.
- Emit `manifest.v1.json` with:
  - `schema_version`, `book`, `spine`, `toc_tree`, `sections`,
  - `landmarks`, `page_list`, `assets`, `build`.
- Include stable `section_id`, source boundary mapping, and output path.
- Add schema contract docs + compatibility policy:
  - versioning rules (`v1` evolution)
  - required/optional fields
  - forward/backward compatibility expectations
- Replace placeholder-empty `landmarks`/`page_list` with parsed values when source EPUB provides them.

Acceptance:
- Manifest exists when enabled and contains deterministic section IDs.

## M4: Navigation Dedupe + OCR Cleanup

[P1]

Status:
- `epub-utils`: `[x]`
- `rbook-utils`: `[x]`
- Tests/parity: `~`

Scope:
- Add `--nav-cleanup off|auto` (default `auto`).
- Add `--ocr-cleanup off|basic|aggressive` (default `off`).
- Dedupe noisy/duplicate TOC entries.
- Apply optional OCR cleanup (dehyphenation + repeated/noise line suppression).

Acceptance:
- TOC duplicate suppression counts are reported.
- OCR cleanup changes are counted in `report.v1.json`.

## M5: Asset Completeness + EPUB3 Media Handling

[P1]

Status:
- `epub-utils`: `~` (images + manifest audio/video extraction)
- `rbook-utils`: `~` (images + manifest audio/video extraction)
- Tests/parity: `~`

Scope:
- Preserve existing image behavior.
- Expand `--media` to support optional audio/video extraction.
- Keep external/data URIs unchanged.
- Add richer in-content asset reference handling:
  - `srcset`
  - `<picture><source ...>`
  - SVG-linked image references
  - CSS `url(...)` references where applicable in rich/external style paths

Acceptance:
- Image extraction remains backward compatible.
- Audio/video extracted when discoverable from manifest and `--media all` is enabled.
- Richer asset refs resolve/extract with path rewriting in both split and single-file modes.

## M6: Typography/Semantics Normalization

[P2]

Status:
- `epub-utils`: `[ ]`
- `rbook-utils`: `[ ]`
- Tests/parity: `[ ]`

Scope:
- Improve normalization for lists/quotes/code/tables/epigraph conventions.
- Preserve safe rich fallback where markdown conversion is lossy.

Acceptance:
- No regressions in existing books while improving semantic fidelity.

## M7: Deterministic IDs + Stable Filename Scheme

[P1]

Status:
- `epub-utils`: `[x]`
- `rbook-utils`: `[x]`
- Tests/parity: `~`

Scope:
- Canonical section boundary hashing -> stable `section_id`.
- Add `--filename-scheme index|hash` (default `index`) for split output.

Acceptance:
- Stable mode produces deterministic chapter filenames independent of label order churn.

## M8: Quality Report + Parity Gate

[P0]

Status:
- `epub-utils`: `[x]` (`report.v1.json`)
- `rbook-utils`: `[x]` (`report.v1.json`)
- Tests/parity gate: `~`

Scope:
- Add `--quality-report off|v1`.
- Emit report including:
  - TOC/fallback/link/asset/OCR/cleanup/notes stats
  - warnings/errors arrays
- Add parity checks across both implementations.
- Add automated parity fixture runner in CI:
  - deterministic fixture corpus
  - golden checks for section counts/titles/IDs
  - unresolved-link and missing-asset thresholds
- Track unresolved-link regressions:
  - baseline known unresolveds per fixture
  - fail CI on unexpected increases

Acceptance:
- Both parsers generate machine-readable report files with matching key metrics.
- Parity runs compare section counts + unresolved link counts + missing asset counts.

## Final Integration Gate

- `[ ]` [P0] Run lockstep commands for both parsers on reference books (`Alice`, `Meditations`, `Algebra`, `CBT`, `Ultralearning`).
- `[ ]` [P0] Compare:
  - section counts
  - first N section titles/IDs
  - unresolved link counts
  - missing asset counts
- `[ ]` [P1] Document known parity differences.
- `[ ]` [P1] Publish schema docs for `manifest.v1.json` and `report.v1.json` as part of release notes.
- `[ ]` [P0] Run fixture-based CI parity job successfully on target platforms.