BookForge
BookForge is the EPUB translation engine that keeps the LLM away from your document structure. It parses EPUBs into validated JSON payloads, checkpoints every segment, preserves markup/footnotes/links, and rebuilds valid EPUBs.
I built this to translate books for my partner. It's MIT-licensed in case it's useful to you.
Status
MVP functionality is implemented:
- EPUB inspect, parse, segment, and rebuild
- Plain, marker-safe, and run-preserving translation contracts
- Mock provider for deterministic tests
- OpenAI-compatible provider
- DeepSeek and OpenRouter presets
- Bounded parallel segment translation with
--concurrency - SQLite checkpoint store
- Resume and retry commands
- Status and tail commands for persisted jobs
- Segment-level cache reuse for compatible prior translations
- Static side-by-side review HTML with flag export/import
- QA reports in JSON and Markdown
- Optional LLM QA review pass
- Cost estimates for known provider/model pairs
Install
The binary is:
For development, use:
Commands
Convert a PDF to a translatable EPUB (requires poppler
command-line tools on PATH, or POPPLER_PATH pointing at their bin
directory; on Windows use the poppler-windows release zip):
The converter detects two-column layouts per page (scientific papers),
repairs hyphenated line breaks, joins paragraphs across pages, and maps
oversized fonts to headings. It prints a fidelity report comparing the
reconstructed text against the raw pdftotext baseline — check that
coverage number (and inspect on the result) before spending tokens.
Figures, tables-as-images, and low-confidence page fallbacks are
roadmap items (ROADMAP §9b, phases P2–P4); for image-heavy PDFs expect
text-only output for now.
Inspect an EPUB:
The inspect output includes a text-coverage metric: the percentage of
visible body text that lands in translatable blocks. Files with low
coverage (text in unsupported markup such as bare <div>s) are listed
individually — that text would ship untranslated, so check coverage
before spending tokens on a full run.
Estimate tokens and approximate cost:
Translate with OpenRouter:
Translate with the default fast profile:
Translate with a glossary:
Check provider and storage health:
Translate with DeepSeek:
Use any OpenAI-compatible endpoint:
Resume a job:
Generate a side-by-side review page:
Ingest exported review flags and mark bad translations for retry:
Manage glossary terms:
Inspect persisted job state and recent events:
Retry failed or review-needed segments:
Validate a translated EPUB and report:
QA Modes
Translation always runs hard validators before committing a segment. The optional LLM QA pass is controlled with:
off is the default. Reports still include deterministic soft warnings such as changed URLs, changed numbers, suspicious length ratios, model commentary, and repeated text.
Two structural defaults to know about:
pre/codeblocks are never sent to the model. They are copied through to the output byte-for-byte, preserving internal whitespace.- The sliding context window (
--context-window, default 3) is best-effort: a segment uses whichever predecessors have already finished and never waits for them. Pass--context-strictto restore the v1.3 fence behavior, which guarantees a complete context block but serializes segments within the context scope.
Checkpoints And Cache
Runtime state is stored in:
.bookforge/jobs.sqlite
That path is ignored by git. Segment translations are persisted as each segment completes. New jobs reuse compatible cached translations when the source hash, prompt version, provider, model, source language, and target language match.
Progress events can be written in every UI mode:
Review artifacts contain the full source and translated text of the book. They are written locally under .bookforge/runs/<job-id>/review/; treat them as private user data.
Known limitations: provider API keys are read from environment variables, and validation is intentionally pragmatic rather than a full EPUBCheck replacement.
Benchmarks
Run the mock release smoke benchmark with:
See docs/benchmarks.md for metrics to capture in real-provider runs.
Secrets And Local Tests
Do not commit API keys or ad hoc test books. The repository ignores:
test/
.bookforge/
.claude/
.codex
*.env
*.key
key.txt
For local OpenRouter testing, place the key outside tracked paths or export it directly:
Development Checks
See CONTRIBUTING.md for what's expected in issues
and pull requests, and the architectural invariants any change has to
respect.
Repository Layout
crates/bookforge-core IR, segmentation, shared config
crates/bookforge-epub EPUB inspect/read/rebuild
crates/bookforge-llm prompts, providers, scheduler, validators
crates/bookforge-llm/prompts Versioned prompt templates
crates/bookforge-store SQLite checkpoint store
crates/bookforge-cli CLI commands and reports
docs/ Architecture notes