module-harness 0.2.0

# module-harness Specification

## Purpose

Provide deterministic tooling for managing module-level harness context — structured documentation that bridges the gap between source code and AI agent understanding.

## Architecture

```
module-harness (binary — deterministic)
├── parse <file>          → JSON of existing harness sections
├── inventory [dir]       → list all evals across modules
├── diff <file>           → show harness summary / drift
├── coverage [dir]        → match evals to test functions (planned)
└── format <file> --lang  → write harness from JSON input (planned)

module-harness (skill — non-deterministic, LLM-powered)
├── /module-harness <file>        → generate/update harness for one module
├── /module-harness batch <dir>   → all modules in directory
└── /module-harness audit <dir>   → report drift and coverage
```

**Binary** handles structure: parsing, validation, inventory, metrics.
**Skill** handles understanding: reading code, writing specs, inferring contracts.

## Harness Sections

### Spec

Behavioral specification. One entry per public function, type, or trait. Describes *what* the module does, not *how*.

- Must cover every public symbol
- Should be verifiable (testable claims, not aspirations)
- Ordered by importance / call frequency

### Agentic Contracts

Invariants and guarantees that callers can rely on without reading implementation.

- Error handling behavior (panics, returns, logs)
- Atomicity and concurrency guarantees
- Idempotency claims
- Side effect boundaries (what the module touches)

### Evals

Named testable scenarios. Each eval has:
- **Name**: snake_case identifier (could be a test function name)
- **Description**: scenario → expected outcome

Evals serve dual purpose:
1. **Documentation** — what should be tested
2. **Metrics** — quantifiable coverage for comparing harness contexts

## Eval Metrics (Planned)

Evals create a quantifiable signal for comparing different harness contexts via backtesting:

- **Coverage ratio**: evals with matching tests / total evals
- **Drift score**: public symbols not mentioned in spec / total public symbols
- **Completeness**: modules with harness / total modules

These metrics enable A/B comparison: given two harness context versions, which one leads to better agent outcomes when working on the codebase?

## Future: Backtesting

The backtesting workflow (not yet implemented):

1. Select a git revision as baseline
2. Present the agent with a target spec (feature to implement, bug to fix)
3. Run with harness context A, measure outcome
4. Run with harness context B, measure outcome
5. Compare: which harness led to fewer errors, less drift, faster completion?

This creates a feedback loop: harness quality improves agent performance, which is measurable.