# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Panlabel is a Rust library and CLI tool for converting between different object detection annotation formats (COCO, TensorFlow Object Detection, etc.). The project is structured as both a library (`src/lib.rs`) and a binary (`src/main.rs`), allowing use as a dependency or standalone CLI.
**Status:** Active development (v0.6.0) - Full CLI with convert, validate, stats, diff, sample, and list-formats commands. Supports COCO JSON, CVAT XML, Label Studio JSON, LabelMe JSON, CreateML JSON, KITTI, VIA JSON, RetinaNet Keras CSV, TFOD CSV, YOLO directory format (flat Darknet-style and split-aware layouts, with optional confidence token), Pascal VOC XML directory format, HF ImageFolder, and IR JSON with lossiness tracking.
## Common Commands
### Build & Run
```bash
cargo build # Build debug version
cargo build --release # Build optimized release version
cargo run # Run the CLI
cargo run -- -V # Run with arguments (e.g., version flag)
```
### Testing
```bash
cargo test # Run all tests (unit + integration + proptests)
cargo test --test cli # Run only CLI integration tests
cargo test --test proptest_ir_json
cargo test --test proptest_coco
cargo test --test proptest_tfod
cargo test --test proptest_label_studio
cargo test --test proptest_voc
cargo test --test proptest_yolo
cargo test --test proptest_labelme
cargo test --test proptest_createml
cargo test --test proptest_cross_format
cargo test runs # Run a single test by name
PROPTEST_CASES=1000 cargo test --test proptest_ir_json # deeper local run
```
### Development
```bash
cargo check # Fast syntax/type checking without building
cargo fmt # Format code
cargo clippy # Run linter
cargo doc --open # Generate and view documentation
```
### Benchmarking
```bash
cargo bench # Run all Criterion benchmarks
cargo bench -- --test # Smoke test benchmarks (no timing)
```
### Fuzzing (requires nightly)
```bash
cargo +nightly fuzz run coco_json_parse # Fuzz COCO JSON parser
cargo +nightly fuzz run voc_xml_parse # Fuzz VOC XML parser
cargo +nightly fuzz run tfod_csv_parse # Fuzz TFOD CSV parser
cargo +nightly fuzz run label_studio_json_parse # Fuzz Label Studio parser
cargo +nightly fuzz run ir_json_parse # Fuzz IR JSON parser
cargo +nightly fuzz run yolo_label_line_parse # Fuzz YOLO line parser
cargo +nightly fuzz run labelme_json_parse # Fuzz LabelMe JSON parser
cargo +nightly fuzz run createml_json_parse # Fuzz CreateML JSON parser
cargo +nightly fuzz run kitti_txt_parse # Fuzz KITTI parser
cargo +nightly fuzz run via_json_parse # Fuzz VIA JSON parser
cargo +nightly fuzz run retinanet_csv_parse # Fuzz RetinaNet CSV parser
```
`fuzz/Cargo.toml` enables panlabel's `fuzzing` feature so the fuzz-only YOLO parser wrapper is available from the fuzz crate.
### Releasing
```bash
# Check what dist will build (dry run)
dist plan
# Regenerate release workflow after config changes
dist generate
# To release:
# 1. Bump version in Cargo.toml, update CHANGELOG.md
# 2. IMPORTANT: run `dist generate` to ensure release.yml is in sync
# 3. Commit, tag, push:
git tag vX.Y.Z
git push && git push --tags
# The .github/workflows/release.yml workflow handles the rest:
# - Builds binaries for 5 platforms
# - Creates GitHub Release with archives + checksums
# - Publishes Homebrew formula to strickvl/homebrew-tap
```
Release infrastructure:
- **cargo-dist** (`dist-workspace.toml`) manages cross-platform binary builds and GitHub Releases
- **Homebrew tap**: `strickvl/homebrew-tap` — auto-updated by release CI (requires `HOMEBREW_TAP_TOKEN` secret)
- **Tag convention**: `vX.Y.Z` (e.g., `v0.6.0`) — annotated tags trigger the release workflow
### Generate Test Data
```bash
# Requires numpy: pip install numpy (or uv pip install numpy)
python scripts/dataset_generator.py --num_images 1000 --annotations_per_image 10 --output_dir ./assets
```
## Architecture
```
src/
├── lib.rs # Library entry point - CLI parsing and command dispatch
├── main.rs # CLI binary - thin wrapper calling lib.rs
├── error.rs # Error types (PanlabelError)
├── ir/ # Intermediate Representation module
│ ├── mod.rs # IR module exports
│ ├── model.rs # Core types: Dataset, Image, Annotation, Category
│ ├── bbox.rs # BBoxXYXY with coordinate space type safety
│ ├── coord.rs # Coord type for 2D points
│ ├── space.rs # Pixel/Normalized coordinate space markers
│ ├── ids.rs # Strongly-typed IDs (ImageId, AnnotationId, etc.)
│ ├── io_coco_json.rs # COCO JSON reader/writer
│ ├── io_label_studio_json.rs # Label Studio JSON reader/writer
│ ├── io_labelme_json.rs # LabelMe JSON reader/writer (file + directory)
│ ├── io_createml_json.rs # Apple CreateML JSON reader/writer
│ ├── io_kitti.rs # KITTI object detection reader/writer (directory-based)
│ ├── io_via_json.rs # VGG Image Annotator (VIA) JSON reader/writer
│ ├── io_retinanet_csv.rs # RetinaNet Keras CSV reader/writer
│ ├── io_tfod_csv.rs # TFOD CSV reader/writer
│ ├── io_yolo.rs # Ultralytics YOLO reader/writer (directory-based)
│ ├── io_voc_xml.rs # Pascal VOC XML reader/writer (directory-based)
│ └── io_json.rs # IR JSON format (canonical serialization)
├── validation/ # Dataset validation
│ ├── mod.rs # validate_dataset() function
│ └── report.rs # ValidationReport formatting
├── conversion/ # Format conversion reporting
│ ├── mod.rs # build_conversion_report(), Format enum, IrLossiness
│ └── report.rs # ConversionReport with lossiness warnings
└── stats/ # Dataset statistics + HTML/text/JSON reporting
├── mod.rs # stats_dataset() function
├── report.rs # StatsReport with terminal formatting
└── html.rs # Self-contained HTML report renderer
tests/
├── cli.rs # CLI integration tests using assert_cmd
├── common/mod.rs # Shared BMP helpers for YOLO-related tests
├── proptest_helpers/mod.rs # Shared proptest strategies + semantic assertions
├── proptest_*.rs # Property tests per adapter + cross-format subset checks
├── tfod_csv_roundtrip.rs # TFOD format roundtrip tests
├── yolo_roundtrip.rs # YOLO format roundtrip tests
├── voc_roundtrip.rs # VOC format roundtrip tests
├── label_studio_roundtrip.rs # Label Studio format roundtrip tests
├── labelme_roundtrip.rs # LabelMe format roundtrip tests
├── createml_roundtrip.rs # CreateML format roundtrip tests
├── kitti_roundtrip.rs # KITTI format roundtrip tests
├── via_roundtrip.rs # VIA JSON format roundtrip tests
├── retinanet_csv_roundtrip.rs # RetinaNet CSV format roundtrip tests
└── fixtures/ # Test fixture files
proptest-regressions/
└── ... # Committed proptest shrinking regressions
benches/
└── microbenches.rs # Criterion benchmarks for parsing/writing
fuzz/
├── corpus/ # Seed corpora per fuzz target
└── fuzz_targets/ # cargo-fuzz targets for parser fuzzing
scripts/
└── dataset_generator.py # Generates COCO and TFOD synthetic datasets
docs/
├── README.md # Documentation hub
├── cli.md # CLI contract and examples
├── formats.md # Supported format behavior reference
├── tasks.md # Task/use-case support matrix
└── conversion.md # Lossiness + report schema and issue codes
```
## Documentation Topology
- Root `README.md` is a quick project gateway.
- `CONTRIBUTING.md` lives at repo root (GitHub auto-links it in issues/PRs).
- User-facing reference docs live in `docs/`.
- Forward-looking priorities live in `ROADMAP.md`.
- `design/` documents are historical context only and may be stale after implementation.
- Source of truth for docs accuracy:
- CLI and auto-detection: `src/lib.rs`
- Format adapters: `src/ir/io_*.rs`
- Lossiness/report codes: `src/conversion/*`
- User-visible behavior checks: `tests/cli.rs`, `tests/yolo_roundtrip.rs`, `tests/voc_roundtrip.rs`, `tests/label_studio_roundtrip.rs`, `tests/labelme_roundtrip.rs`, `tests/createml_roundtrip.rs`, `tests/kitti_roundtrip.rs`, `tests/via_roundtrip.rs`, `tests/retinanet_csv_roundtrip.rs`
If command behavior, format semantics, or conversion issue codes change, update `docs/` in the same change.
## CLI Commands
| `validate` | Check dataset for errors (duplicate IDs, missing refs, invalid bboxes) |
| `convert` | Convert between formats with lossiness tracking |
| `stats` | Display statistics (counts, label histogram, bbox quality metrics) |
| `diff` | Compare two datasets semantically |
| `sample` | Create subset datasets (random or stratified), with JSON report output available |
| `list-formats` | Show supported formats with read/write and lossiness info, including JSON discovery output |
### Machine-readable output
- `--output-format json` is the consistent cross-command spelling for structured stdout.
- Read-only commands (`validate`, `stats`, `diff`, `list-formats`) also accept `--output json`.
- `convert` and `sample` keep `--report json` as a compatibility alias.
- JSON/report payloads go to stdout; fatal errors go to stderr.
### Convert with Auto-Detection
The `--from auto` flag detects format from file extension/content for files and layout markers for directories:
- `.csv` → content-based: 8 columns → TFOD, 6 columns → RetinaNet (or detected by header match)
- `.json`:
- empty array-root JSON (`[]`) → ambiguous (Label Studio or CreateML); requires explicit `--from`
- non-empty array-root: Label Studio task shape (`data.image`) → Label Studio; CreateML shape (`image` + `annotations`) → CreateML
- object-root with `shapes` array → LabelMe
- object-root with entries containing `filename` + `regions` → VIA
- object-root: requires a non-empty `annotations` array, then peek at `annotations[0].bbox`: array = COCO, object = IR JSON (empty datasets cannot be auto-detected)
- directory with `labels/` containing `.txt` files AND sibling `images/` → YOLO (labels without images is reported as an incomplete layout)
- directory with `Annotations/` containing top-level `.xml` files → VOC (`JPEGImages/` is optional, matching the reader)
- directory with `annotations.xml` at root → CVAT
- directory with `annotations/` containing LabelMe `.json` files (or co-located `.json` files with `shapes` key) → LabelMe
- directory with `label_2/` containing `.txt` files AND sibling `image_2/` → KITTI
- directory with `metadata.jsonl`, `metadata.parquet`, or parquet shard files → HF
- detection uses evidence-based probing (`FormatProbe` + `probe_dir_formats()`) that reports what was found/missing
- `stats` falls back to `ir-json` for parseable JSON files but surfaces malformed JSON errors directly
**Key design:** The CLI binary (`main.rs`) is intentionally minimal—it calls `panlabel::run()` from the library and handles errors. All business logic belongs in `lib.rs` (or modules it imports). The IR module uses Rust's type system (phantom types for coordinate spaces, newtypes for IDs) to prevent common annotation bugs at compile time.
## Annotation Format Reference
Use `docs/formats.md` as the primary format reference.
Quick links:
- [`docs/formats.md`](docs/formats.md)
- [`docs/tasks.md`](docs/tasks.md)
- [`docs/conversion.md`](docs/conversion.md)
- [`docs/cli.md`](docs/cli.md)
Scope reminder: current support is object detection bboxes (not segmentation, keypoints/pose, OBB, or classification-only pipelines).
See `scripts/dataset_generator.py` for synthetic data generation.
## Testing Notes
- Integration tests use `assert_cmd` crate for CLI testing
- Test data goes in `assets/` directory (gitignored)
- Use deterministic seed (42) for reproducible test data generation