hotcoco

Fast enough for every epoch, lean enough for every dataset. A drop-in replacement for pycocotools that doesn't become the bottleneck — in your training loop or at foundation model scale. Up to 23× faster on standard COCO, 39× faster on Objects365, and fits comfortably in memory where alternatives run out.

Available as a Python package, CLI tool, and Rust library. Pure Rust — no Cython, no C compiler, no Microsoft Build Tools. Prebuilt wheels for Linux, macOS, and Windows.

Documentation | Changelog | Roadmap

Performance

Benchmarked on COCO val2017 (5,000 images, 36,781 synthetic detections), Apple M1 MacBook Air:

Eval Type	pycocotools	faster-coco-eval	hotcoco
bbox	9.46s	2.45s (3.9x)	0.41s (23.0x)
segm	9.16s	4.36s (2.1x)	0.49s (18.6x)
keypoints	2.62s	1.78s (1.5x)	0.21s (12.7x)

Speedups in parentheses are vs pycocotools. Results verified against pycocotools on COCO val2017 with a 10,000+ case parity test suite — your AP scores won't change.

At scale (Objects365 val — 80k images, 365 categories, 1.2M detections), hotcoco completes in 18s vs 721s for pycocotools (39x) and 251s for faster-coco-eval (14x) — while using half the memory. See the full benchmarks.

Quick Start

Python

pip install hotcoco

from hotcoco import COCO, COCOeval

coco_gt = COCO("instances_val2017.json")
coco_dt = coco_gt.load_res("detections.json")

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()
ev.accumulate()
ev.summarize()

Drop-in replacement for pycocotools

If you use Detectron2, Ultralytics YOLO, mmdetection, or any other pycocotools-based pipeline, call init_as_pycocotools() once at startup — no other code changes needed:

from hotcoco import init_as_pycocotools
init_as_pycocotools()

# Existing code works unchanged
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
from pycocotools import mask

LVIS evaluation

hotcoco supports LVIS federated evaluation with all 13 metrics (AP, APr, APc, APf, AR@300, and more). Use LVISeval directly or call init_as_lvis() to drop into any existing lvis-api pipeline:

from hotcoco import COCO, LVISeval

lvis_gt = COCO("lvis_v1_val.json")
lvis_dt = lvis_gt.load_res("detections.json")

ev = LVISeval(lvis_gt, lvis_dt, "segm")
ev.run()
print(ev.get_results())  # {"AP": ..., "APr": ..., "APc": ..., "APf": ..., "AR@300": ...}

# Or as a drop-in for Detectron2 / MMDetection lvis-api pipelines
from hotcoco import init_as_lvis
init_as_lvis()

from lvis import LVIS, LVISEval, LVISResults  # resolves to hotcoco

Format conversion

Convert between COCO JSON and YOLO label format in either direction:

from hotcoco import COCO

# COCO → YOLO
coco = COCO("instances_val2017.json")
stats = coco.to_yolo("labels/val2017/")
print(stats)  # {'images': 5000, 'annotations': 36781, 'skipped_crowd': 12, 'missing_bbox': 0}

# YOLO → COCO (with Pillow to read image dims)
coco2 = COCO.from_yolo("labels/val2017/", images_dir="images/val2017/")
coco2.save("reconstructed.json")

Or from the CLI:

coco convert --from coco --to yolo --input annotations.json --output labels/
coco convert --from yolo --to coco --input labels/ --output annotations.json --images-dir images/

F-scores

f_scores() computes F-beta scores from the precision/recall curves. For each IoU threshold and category it finds the operating point that maximises F-beta, then averages — analogous to mAP:

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.run()

ev.f_scores()          # {"F1": 0.523, "F150": 0.712, "F175": 0.581}
ev.f_scores(beta=0.5)  # precision-weighted F-score
ev.f_scores(beta=2.0)  # recall-weighted F-score

Logging metrics

get_results() accepts an optional prefix and per-class flag, returning a flat dict that plugs directly into any experiment tracker:

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.run()

import wandb
wandb.log(ev.get_results(prefix="val/bbox", per_class=True), step=epoch)
# {"val/bbox/AP": 0.578, ..., "val/bbox/AP/person": 0.82, "val/bbox/AP/car": 0.71, ...}

Saving results

results() returns a serializable dict; save_results() writes it as JSON:

ev.run()
ev.save_results("results.json", per_class=True)
# {"params": {"iou_type": "bbox", ...}, "metrics": {"AP": 0.378, ...}, "per_class": {...}}

TIDE error analysis

tide_errors() decomposes every false positive and false negative into six error types — Localization, Classification, Duplicate, Background, Both, and Miss — and reports the ΔAP for each. Use it to understand why your model falls short, not just how much:

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()

result = ev.tide_errors()
for name, delta in sorted(result["delta_ap"].items(), key=lambda x: -x[1]):
    if name not in ("FP", "FN"):
        print(f"  {name}: ΔAP={delta:.4f}  n={result['counts'].get(name, '—')}")