bookforge-cli 1.8.0

CLI-first EPUB translation engine with deterministic structure rebuild and review loop.
bookforge-cli-1.8.0 is not a library.

BookForge

BookForge is the EPUB translation engine that keeps the LLM away from your document structure. It parses EPUBs into validated JSON payloads, checkpoints every segment, preserves markup/footnotes/links, and rebuilds valid EPUBs.

I built this to translate books for my partner. It's MIT-licensed in case it's useful to you.

Why BookForge

EPUB structure is program-owned. Models receive prose-only JSON payloads and never see or regenerate raw XHTML. Inline markers, protected spans, package metadata, resources, and document ordering are validated and reassembled by deterministic Rust code.

That boundary is the point of the project: malformed model output can be retried without asking another model to repair the book. The result is a checkpointed translation workflow whose structure can be tested independently of translation quality.

Status

BookForge v1.8 is usable for EPUB translation and PDF-to-EPUB ingestion:

  • EPUB inspect, parse, segment, and rebuild
  • EPUBCheck-backed standalone and post-translation validation
  • EPUBCheck-clean structural regression against a pinned nine-book Standard Ebooks corpus; see docs/corpus.md
  • Plain, marker-safe, and run-preserving translation contracts
  • Mock provider for deterministic tests
  • OpenAI-compatible provider
  • DeepSeek and OpenRouter presets
  • Ollama and llama.cpp local-model presets
  • Bounded parallel segment translation with --concurrency
  • SQLite checkpoint store
  • Resume and retry commands
  • Status and tail commands for persisted jobs
  • Segment-level cache reuse for compatible prior translations
  • Static side-by-side review HTML with flag export/import
  • QA reports in JSON and Markdown
  • Optional LLM QA review pass
  • Cost estimates for known provider/model pairs
  • Externalized, overridable provider pricing

Install

Install from crates.io:

cargo install bookforge-cli

Or build the current checkout:

cargo build --release

The binary is:

target/release/bookforge

For development, use:

cargo run -p bookforge-cli -- <command>

Quick start

export DEEPSEEK_API_KEY=...
bookforge inspect book.epub
bookforge translate book.epub --target Italian --provider-preset deep-seek-paid --validate-output

Use cargo run -p bookforge-cli -- in front of commands when running from a source checkout. Provider preset names are shown by bookforge translate --help.

Commands

Convert a PDF to a translatable EPUB (requires poppler command-line tools on PATH, or POPPLER_PATH pointing at their bin directory; on Windows use the poppler-windows release zip):

cargo run -p bookforge-cli -- convert paper.pdf --out paper.epub

The converter detects two-column layouts per page (scientific papers), repairs hyphenated line breaks, joins paragraphs across pages, and maps oversized fonts to headings. It prints a fidelity report comparing the reconstructed text against the raw pdftotext baseline — check that coverage number (and inspect on the result) before spending tokens. Figures, tables-as-images, and low-confidence page fallbacks are roadmap items (ROADMAP §9b, phases P2–P4); for image-heavy PDFs expect text-only output for now.

Inspect an EPUB:

cargo run -p bookforge-cli -- inspect book.epub

The inspect output includes a text-coverage metric: the percentage of visible body text that lands in translatable blocks. Files with low coverage (text in unsupported markup such as bare <div>s) are listed individually — that text would ship untranslated, so check coverage before spending tokens on a full run.

Estimate tokens and approximate cost:

cargo run -p bookforge-cli -- estimate book.epub \

  --source English \

  --target Italian \

  --provider openrouter \

  --model deepseek/deepseek-v4-flash

Pricing is loaded from the bundled pricing/providers.json. Override it with --pricing custom.json or BOOKFORGE_PRICING_PATH.

Translate with OpenRouter:

export OPENROUTER_API_KEY=sk-or-...


cargo run -p bookforge-cli -- translate book.epub \

  --source English \

  --target Italian \

  --provider openrouter \

  --model deepseek/deepseek-v4-flash \

  --concurrency 4 \

  --timeout-seconds 120 \

  --qa off \

  --out book.it.epub

Translate with the default fast profile:

cargo run -p bookforge-cli -- translate book.epub \

  --target Italian \

  --provider-preset open-router-paid-fast \
  --ui progress \

  --out book.it.epub

Translate with a glossary:

cargo run -p bookforge-cli -- glossary import glossary.series.toml

cargo run -p bookforge-cli -- translate book.epub \

  --source English \

  --target Italian \

  --provider-preset open-router-paid-fast \
  --book-id fellowship \

  --series-id lord-of-the-rings \

  --glossary glossary.series.toml \

  --glossary-budget-tokens 800 \

  --glossary-format json \

  --prompt-extra "Maintain a literary register." \

  --out book.it.epub

Check provider and storage health:

cargo run -p bookforge-cli -- doctor --storage

cargo run -p bookforge-cli -- doctor \

  --provider openrouter \

  --model google/gemini-2.5-flash-lite

Translate with DeepSeek:

export DEEPSEEK_API_KEY=...


cargo run -p bookforge-cli -- translate book.epub \

  --source English \

  --target Italian \

  --provider deepseek \

  --model deepseek-v4-flash \

  --concurrency 4 \

  --out book.it.epub

Use any OpenAI-compatible endpoint:

export OPENAI_API_KEY=...


cargo run -p bookforge-cli -- translate book.epub \

  --source English \

  --target Italian \

  --provider openai-compatible \

  --base-url https://api.example.com/v1 \

  --api-key-env OPENAI_API_KEY \

  --model provider/model \

  --timeout-seconds 120 \

  --out book.it.epub

Local Ollama and llama.cpp recipes are documented in docs/local-models.md.

Resume a job:

cargo run -p bookforge-cli -- resume <job-id> --timeout-seconds 120

Generate a side-by-side review page:

cargo run -p bookforge-cli -- review <job-id> --open

Ingest exported review flags and mark bad translations for retry:

cargo run -p bookforge-cli -- ingest-flags <job-id> --flags flags.json

cargo run -p bookforge-cli -- retry <job-id> --only needs-review

Manage glossary terms:

cargo run -p bookforge-cli -- glossary list --language 'English->Italian'

cargo run -p bookforge-cli -- glossary add "Aragorn" "Aragorn" \

  --category person \

  --scope series \

  --scope-id lord-of-the-rings \

  --source-lang English \

  --target-lang Italian \

  --case-sensitive

cargo run -p bookforge-cli -- glossary export glossary.series.toml \

  --scope series \

  --scope-id lord-of-the-rings \

  --language 'English->Italian'

Inspect persisted job state and recent events:

cargo run -p bookforge-cli -- status <job-id>
cargo run -p bookforge-cli -- tail <job-id> --lines 40

Retry failed or review-needed segments:

cargo run -p bookforge-cli -- retry <job-id> --only failed

cargo run -p bookforge-cli -- retry <job-id> --only needs-review

cargo run -p bookforge-cli -- retry <job-id> --only all

Validate a translated EPUB and report:

cargo run -p bookforge-cli -- validate book.it.epub \
  --report book.it.validation.json

BookForge invokes EPUBCheck when it is available. Set BOOKFORGE_EPUBCHECK to an executable, its containing directory, or an epubcheck.jar. Missing EPUBCheck is reported as status: unavailable and is non-fatal. Use --strict-epubcheck to make warnings fail validation.

QA Modes

Translation always runs hard validators before committing a segment. The optional LLM QA pass is controlled with:

--qa off

--qa suspicious

--qa all

off is the default. Reports still include deterministic soft warnings such as changed URLs, changed numbers, suspicious length ratios, model commentary, and repeated text.

Two structural defaults to know about:

  • pre/code blocks are never sent to the model. They are copied through to the output byte-for-byte, preserving internal whitespace.
  • The sliding context window (--context-window, default 3) is best-effort: a segment uses whichever predecessors have already finished and never waits for them. Pass --context-strict to restore the v1.3 fence behavior, which guarantees a complete context block but serializes segments within the context scope.

Checkpoints And Cache

Runtime state is stored in:

.bookforge/jobs.sqlite

That path is ignored by git. Segment translations are persisted as each segment completes. New jobs reuse compatible cached translations when the source hash, prompt version, provider, model, source language, and target language match.

Progress events can be written in every UI mode:

cargo run -p bookforge-cli -- translate book.epub \

  --target Italian \

  --provider mock \

  --model mock-prefix-target \

  --ui json \

  --progress-jsonl .bookforge/runs/example/events.jsonl

Review artifacts contain the full source and translated text of the book. They are written locally under .bookforge/runs/<job-id>/review/; treat them as private user data.

Known limitations: provider API keys are read from environment variables. PDF ingestion currently prioritizes text reconstruction; complex figures and tables may require review of the conversion report.

Benchmarks

Run the mock release smoke benchmark with:

scripts/bench-mock.sh

See docs/benchmarks.md for metrics to capture in real-provider runs.

Run the pinned structural corpus with:

bash scripts/corpus-fetch.sh small
bash scripts/corpus-smoke.sh small

Secrets And Local Tests

Do not commit API keys or ad hoc test books. The repository ignores:

test/
.bookforge/
.claude/
.codex
*.env
*.key
key.txt

For local OpenRouter testing, place the key outside tracked paths or export it directly:

export OPENROUTER_API_KEY=...

Development Checks

cargo fmt

cargo test

cargo clippy --all-targets --all-features

See CONTRIBUTING.md for what's expected in issues and pull requests, and the architectural invariants any change has to respect.

Repository Layout

crates/bookforge-core   IR, segmentation, shared config
crates/bookforge-epub   EPUB inspect/read/rebuild
crates/bookforge-llm    prompts, providers, scheduler, validators
crates/bookforge-llm/prompts  Versioned prompt templates
crates/bookforge-store  SQLite checkpoint store
crates/bookforge-cli    CLI commands and reports
docs/                   Architecture notes
pricing/                bundled provider/model pricing
tests/corpus/            pinned Standard Ebooks corpus manifest

BookForge remains a tool built for one reader and shared under MIT. Bug reports should include the BookForge version, operating system, provider/model, and a redacted validation or QA report where possible. The sequenced project plan is in docs/ROADMAP.md.