# CLAUDE.md
## Project Overview
**anytomd** (repository: `anytomd-rs`) is a pure Rust reimplementation of Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) Python library. It converts various document formats (DOCX, PPTX, XLSX, PDF, HTML, CSV, JSON, etc.) into Markdown, targeting LLM consumption. A single `cargo add anytomd` with zero external runtime.
**Current phase:** 9 converters (DOCX, PPTX, XLSX, XLS, HTML, CSV, JSON, XML, Images, Plain Text) + async image description + Gemini integration. See [TECH_SPEC.md](./TECH_SPEC.md) for full architecture and milestones.
---
## Reference-Driven Development
MarkItDown Python source is at `reference/markitdown/` (gitignored) for analysis during development.
**Reference paths:**
- Converters: `reference/markitdown/packages/markitdown/src/markitdown/converters/`
- Utilities: `reference/markitdown/packages/markitdown/src/markitdown/converter_utils/`
- Main engine: `reference/markitdown/packages/markitdown/src/markitdown/_markitdown.py`
**Per-converter workflow:** Read the Python converter → identify extracted elements → create test fixtures → TDD (red/green/refactor) → compare output against MarkItDown for content parity (not exact match).
Do NOT translate Python line-by-line — understand the *intent*, then implement idiomatically in Rust.
---
## Critical Principle: Native Rust Implementation
**Every converter MUST be pure Rust. No Python. No C bindings. No subprocess calls. No external runtime.**
- DOCX/PPTX: `zip` + `quick-xml` (direct OOXML→Markdown, no intermediate HTML)
- XLSX: `calamine` — PDF: `pdf-extract`/`lopdf` — HTML: `scraper`
- If no pure Rust solution exists, **implement in Rust** or **defer** — never add a non-Rust dependency
**Output goal:** LLM-readable text extraction, not pixel-perfect Markdown. Simpler output that captures content is preferred. Must preserve Unicode (CJK, emoji) without corruption. `ConversionResult` provides both `.markdown` and `.plain_text` fields — each converter populates plain text directly from the source document (no post-processing).
---
## LLM Integration — Gemini
Optional LLM-based image description via the `ImageDescriber` trait (Technical Spec §4.9). The library makes no HTTP calls — callers inject their own implementation. Trait is provider-agnostic; Gemini is the default.
- Default model: **`gemini-3-flash-preview`** (production) / **`gemini-2.5-flash-lite`** (CI, cost savings)
- Always consult the [official Gemini API docs](https://ai.google.dev/gemini-api/docs) — do NOT rely on cached knowledge
- `GeminiDescriber`: `new(api_key)` or `from_env()` (reads `GEMINI_API_KEY`). Never hardcode/log/persist API keys.
- `AsyncGeminiDescriber`: async counterpart using `reqwest` (behind `async-gemini` feature). Same API: `new()`, `from_env()`, `with_model()`.
### Async Image Description
The library supports concurrent image description via feature flags:
| `async` | `futures-util` | `AsyncImageDescriber` trait, `AsyncConversionOptions`, `convert_file_async()`, `convert_bytes_async()` |
| `async-gemini` | `async` + `reqwest` | `AsyncGeminiDescriber` for concurrent Gemini API calls |
**Architecture:** Two-phase conversion — `convert_inner()` parses the document and collects image placeholders, then `resolve_image_placeholders_async()` resolves all descriptions concurrently via `futures_util::future::join_all`. No `tokio` in library deps — caller provides the runtime.
**Key types:**
- `AsyncImageDescriber` trait: uses `Pin<Box<dyn Future>>` for dyn-compatibility (async fn in traits not dyn-safe in Rust 1.90)
- `AsyncConversionOptions`: wraps `ConversionOptions` with `async_image_describer: Option<Arc<dyn AsyncImageDescriber>>`
### CI Gemini Testing
Gemini CI tests do NOT run on every PR to prevent API quota abuse.
| `push` (any branch) | Yes | Owner/collaborators only — trusted |
| `pull_request` (default) | No | External PRs — gated |
| `pull_request` + `ci:gemini` label | Yes | Owner explicitly approved after code review |
**Key rules:**
- `GEMINI_API_KEY` stored as GitHub Actions repository secret
- CI condition: `if: github.event_name == 'push' || contains(github.event.pull_request.labels.*.name, 'ci:gemini')`
- Fork PRs with `ci:gemini`: use `pull_request_target` with `ref: ${{ github.event.pull_request.head.sha }}` — only after code review
- CI tests use `GeminiDescriber::new(api_key).with_model("gemini-2.5-flash-lite".to_string())` and only assert non-empty response (LLM output is non-deterministic)
- Gemini tests must be **additive** — existing tests must pass without the secret
- Gemini test failures (rate limits, transient errors) must NOT block CI — allowed-to-fail
- **Never add `ci:gemini` label without reviewing the PR diff first**
---
## Language Rules
**All project artifacts MUST be written in English.** No exceptions — source code, comments, commit messages, docs, error messages, test names, issues, and PRs.
---
## Git Configuration
- All commits must use the local git config `user.name` and `user.email` for both author and committer. Verify with `git config user.name` and `git config user.email` before committing.
- All commits must include `Signed-off-by` line to pass DCO check (always use `git commit -s`). The `Signed-off-by` name must match the commit author.
- The expected git `user.name` is `Yonghye Kwon`. If the local git config `user.name` does not match, you **MUST** ask the user to confirm their identity before the first commit or push in the session. Once confirmed, do not ask again for the rest of the session.
## Branching & PR Workflow
- **All changes MUST go through a PR** — never commit directly to `main`, including doc-only edits
- Branch naming: `<type>/<short-description>` (e.g., `feat/add-docx-converter`, `fix/table-parsing-bug`)
- One focused unit of work per branch. For existing PRs, push to that branch instead of creating a new PR.
- For fork PRs: `git remote add <user> <fork-url>` and push to that remote's branch
**Worktree workflow (mandatory for PR branch changes):**
- Create: `git worktree add ../anytomd-rs-<branch-name> -b <type>/<short-description>`
- Work and run all PR commands (`gh pr create`, `git push`, etc.) **from inside the worktree**, not the main repo
- Do NOT remove a worktree while your working directory is inside it — return to main repo first: `cd /Users/yhkwon/Documents/Projects/anytomd-rs && git worktree remove ../anytomd-rs-<branch-name>`
- Do NOT remove a worktree immediately after completing a task — only when starting a new task or user confirms
- `git checkout`/`git switch` may be used only for local-only inspection tasks (no PR changes)
### PR Merge Procedure
Follow all steps in order — do not skip any.
1. **Review PR description** — rewrite with `gh pr edit` if empty/lacking. Include what changed, why, key changes.
2. **Search related issues** — `gh issue list`, reference with "Related: #N" (no auto-close keywords unless instructed)
3. **Check conflicts** — if `main` advanced, use `git merge-tree` to check; rebase/merge to resolve if needed
4. **Wait for CI** — `gh pr checks <number> --watch`. If CI fails, do NOT merge.
5. **Final review** — `gh pr diff <number>`, check for debug code, hardcoded paths, secrets, unused imports. Mandatory even if CI is green.
6. **Merge** — `gh pr merge <number> --merge` (**NEVER** use `--delete-branch` — worktree still uses the branch)
7. **Update local main** — `cd /Users/yhkwon/Documents/Projects/anytomd-rs && git pull`
---
## Toolchain
- MSRV is pinned by `rust-version` in `Cargo.toml` — stable only, no nightly features
- Do not bump `rust-version` in unrelated PRs — use a dedicated chore PR
### MSRV Policy — 6-Month Rolling Minimum
This project follows a **6-month rolling MSRV policy** (aligned with [tokio](https://crates.io/crates/tokio) and other major crates):
- The `rust-version` in `Cargo.toml` MUST target a Rust stable release that was published **at least 6 months ago**
- Rust stable releases ship every 6 weeks — consult [releases.rs](https://releases.rs/) for exact dates
- When a newer Rust version crosses the 6-month threshold, updating the MSRV is **allowed but not required** — only bump when a newer language feature or dependency demands it
- **Floor:** the MSRV can never go below the minimum required by `edition` in `Cargo.toml` (edition 2024 = Rust 1.85)
**Before any MSRV change:**
1. Verify no language features or APIs exclusive to versions above the target are used
2. Confirm all dependencies compile on the target version (`cargo check` with the target toolchain, or review dependency MSRV metadata)
3. Update CI matrix to include the new MSRV version
4. Update Dockerfile `RUST_VERSION` ARG to match
---
## Docker Development Environment
Optional Docker setup for reproducible Linux builds. Native `cargo` is the primary workflow; Docker is for cross-platform verification and CI simulation.
**Services:** `docker compose run --rm <service>`
| `verify` | Full loop: fmt + clippy + test + release build |
| `test` / `test-lib` / `test-integration` | All / unit / integration tests |
| `lint` / `fmt` | clippy+fmt check / auto-format |
| `build` / `release` | Debug / release build |
| `shell` | Interactive bash |
**Key details:**
- Source is bind-mounted at `/app`; `cargo-chef` caches deps; named volumes persist `target/`, cargo registry/git
- Dockerfile `RUST_VERSION` ARG **MUST match** `rust-version` in `Cargo.toml` — update both in the same commit
- Cleanup: `docker compose down -v` / rebuild: `docker compose build --no-cache`
---
## WebAssembly (WASM) Support
anytomd compiles to `wasm32-unknown-unknown` for client-side document conversion in browsers and edge runtimes.
### Architecture — Native vs WASM
| `convert_bytes` / `convert_bytes_async` | Yes | Yes |
| `convert_file` / `convert_file_async` | Yes | No (no filesystem) |
| `GeminiDescriber` (sync, uses `ureq`) | Yes | No |
| `AsyncGeminiDescriber` / `convertBytesWithGemini` | Yes | Yes (`wasm` + `async-gemini`) |
| JS bindings (`convertBytes`) | No | Yes (`wasm` feature) |
### WASM-Critical Dependency Rules
- `zip` crate uses `default-features = false, features = ["deflate"]` — default features pull in C-binding codecs (`zstd-sys`, `lzma-sys`) that break WASM
- `ureq` and `clap` are behind `[target.'cfg(not(target_arch = "wasm32"))'.dependencies]`
- Native-only dev-deps (`assert_cmd`, `tempfile`, `tokio`) are behind `[target.'cfg(not(target_arch = "wasm32"))'.dev-dependencies]`
- **New dependencies must be verified for `wasm32-unknown-unknown` compatibility before merging** — run `cargo check --lib --target wasm32-unknown-unknown --no-default-features`
### Build & Test Commands
```bash
# WASM compilation checks
cargo check --lib --target wasm32-unknown-unknown --no-default-features
cargo check --lib --target wasm32-unknown-unknown --no-default-features --features wasm
cargo clippy --lib --target wasm32-unknown-unknown --no-default-features --features wasm -- -D warnings
# WASM tests (requires wasm-pack)
wasm-pack test --node --no-default-features --features wasm
# Build WASM package for distribution
wasm-pack build --target web --no-default-features --features wasm
```
### CI WASM Checks
The `wasm` job in CI runs on every push/PR:
- `cargo check` with no features, `async`, and `wasm` features
- `cargo clippy` for WASM target
- `wasm-pack test` for integration tests
All checks use `--no-default-features` because the default `async-gemini` feature pulls in `reqwest` with `rustls-tls` which cannot compile for WASM.
---
## Code Conventions
### Rust Style
- Follow standard Rust conventions (`rustfmt` defaults, `clippy` clean)
- Use `thiserror` for error types — see `ConvertError` in Technical Spec Section 7
- Prefer returning `Result<T, ConvertError>` over panicking
- Conversion should be **best-effort**: if a single element (e.g., one corrupted table) fails to parse, skip it and continue — do not fail the entire document
- Best-effort behavior must be observable: append structured warnings to `ConversionResult.warnings` instead of silently dropping parse failures
### Rustdoc
- The crate has `#![warn(missing_docs)]` enabled — all new public items (modules, structs, traits, enums, functions, fields) **MUST** have `///` or `//!` doc comments
- When adding a new converter, include: module-level `//!` doc explaining the format and approach, and a `///` doc on the converter struct
- When adding new public types or enum variants, add per-item `///` docs
- Feature-gated items (behind `async`, `async-gemini`): use backtick-quoted names instead of `[`link`]` syntax in doc comments to avoid unresolved link warnings
- After adding new public items, verify with `cargo doc --no-deps 2>&1 | grep warning` — must produce zero warnings
### Crate Structure
- `src/lib.rs` — public API (`convert_file`, `convert_bytes`)
- `src/converter/` — one module per format (`docx.rs`, `pptx.rs`, `xlsx.rs`, ...)
- Each converter implements the `Converter` trait (see Technical Spec Section 3.2)
- Public API must include conversion options (resource limits, strict mode) and warning output
- `src/markdown.rs` — shared Markdown generation utilities (table builder, heading formatter)
- `src/detection.rs` — file format detection by extension and magic bytes
- `src/error.rs` — `ConvertError` enum
### Dependencies
- MVP: `zip`, `quick-xml`, `calamine`, `csv`, `serde_json`, `thiserror` — all pure Rust (no C bindings)
- Minimize deps — do not add a crate for something achievable in <50 lines
- **Always verify latest stable version** on [crates.io](https://crates.io/) before adding/upgrading. If latest fails (MSRV, breaking changes), downgrade and add a comment in `Cargo.toml` explaining why.
- **Opportunistic upgrades:** when touching a module, check if its deps have newer stable versions and upgrade as part of the work
### Testing — TDD Required
**TDD is mandatory for all features and bug fixes:** write failing test → implement minimum code to pass → refactor. Bug fixes must always include a regression test.
**Test integrity:** NEVER delete/modify passing tests or use `#[ignore]` to work around failures — fix the code. Obsolete tests require documented justification in the commit message.
**Naming:** `test_<what>_<condition>_<expected>` or `test_<what>_<scenario>` (e.g., `test_table_parsing_empty_cells_preserved`)
**Unit tests** (`#[cfg(test)] mod tests` in each module):
- Every converter: heading extraction, table parsing, bold/italic, images, links, lists, etc.
- Edge cases: empty docs, malformed content, deeply nested structures, CJK/emoji Unicode
- Every public function and non-trivial private function must have at least one test
**Integration tests** (`tests/` with fixtures in `tests/fixtures/`):
- One test file per format minimum, testing end-to-end: file in → Markdown out
- **Golden tests:** expected output in `tests/fixtures/expected/`. Normalize whitespace before `assert_eq!`. Update golden files with documented reason when output intentionally changes.
- **Content coverage tests:** use `contains`/regex for key elements (more resilient to formatting changes)
- Include comparison against MarkItDown output for content parity (not exact match)
**Fixtures:** minimal, representative files per format. Binary formats (DOCX/PPTX/XLSX): build programmatically or create in LibreOffice/Google Docs. Document contents in test comments.
**Commands:** `cargo test` (all) / `cargo test --lib` (unit) / `cargo test --test '*'` (integration)
---
## Development Workflow — Build-Test-Verify Loop
**Run after every code change in `src/` or `tests/`** (optional for doc-only `*.md` changes):
- For every feature update, review `README.md` and update it in the same PR. Analyze what must be modified, added, or removed so README stays aligned with the current behavior.
```bash
cargo build && cargo test && cargo clippy -- -D warnings
```
After completing a full converter or adding new public items, also run `cargo fmt --check`, `cargo build --release`, and `cargo doc --no-deps` (verify zero warnings).
**Non-negotiable:** Do NOT proceed if any step fails — fix first, re-run, then continue. Never delete/ignore/weaken tests to pass the loop.
---
## Release Management
GitHub releases and crates.io versions **MUST stay in sync**. The single source of truth for the version is `Cargo.toml`.
### Release Procedure
Follow all steps in order — do not skip any.
1. **Verify unreleased changes exist** — `git log v<latest-tag>..HEAD --oneline`. If empty, there is nothing to release.
2. **Determine version bump** — follow [SemVer](https://semver.org/):
- **patch** (`0.x.Y`): bug fixes, test-only changes, doc fixes
- **minor** (`0.X.0`): new features, new converters, new public API, non-breaking changes
- **major** (`X.0.0`): breaking API changes (defer until post-1.0)
3. **Bump version in `Cargo.toml`** — update `version = "..."` in `[package]`
4. **Create PR** — branch `chore/release-vX.Y.Z`, commit message `chore: release vX.Y.Z`
5. **Merge PR** — follow standard PR Merge Procedure above
6. **Publish to crates.io** — from `main` after merge: `cargo publish`
7. **Create GitHub release** — `gh release create vX.Y.Z --title "vX.Y.Z" --generate-notes --latest` (from `main` at the merge commit)
8. **Verify sync** — confirm `cargo search anytomd` matches the GitHub release tag and `Cargo.toml`
### Version Sync Rules
- **One version, three places**: `Cargo.toml` (source of truth) = crates.io = GitHub release tag
- **Never publish to crates.io without a matching GitHub release** and vice versa
- **Never manually edit crates.io metadata** — always go through `Cargo.toml` + `cargo publish`
- **Tag format**: always `vX.Y.Z` (e.g., `v0.6.0`), created automatically by `gh release create`
- **Cargo.lock**: committed to the repo for reproducible builds; updated automatically by version bump
- If a release is partially completed (e.g., crates.io published but GitHub release missing), fix immediately — do not leave versions out of sync
- **Yanking**: if a broken version is published to crates.io, yank it with `cargo yank --version X.Y.Z` and document in the next release notes
---
## CI — GitHub Actions
CI must pass on every push/PR. Matrix: `ubuntu-latest`, `macos-latest`, `windows-latest`. Stable Rust matching `rust-version`.
**Required checks:** `cargo fmt --check` → `cargo clippy -- -D warnings` → `cargo test` → `cargo build --release`
**Async feature checks** (every push/PR):
`cargo clippy --features async -- -D warnings` → `cargo test --features async` → `cargo clippy --features async-gemini -- -D warnings` → `cargo test --features async-gemini`
**Gemini checks** (on `push` or `ci:gemini` labeled PRs only):
`cargo test --test test_gemini_live` (allowed-to-fail — depends on `GEMINI_API_KEY` secret)
**Rules:** Never merge code that breaks CI. Gemini live API failures do not block merging. New converters without tests = incomplete CI.