name: code-coverage
description: Achieve >85% coverage using EXTREME TDD - v3.0 (Ruchy+bashrs validated)
category: quality
priority: critical
methodology: EXTREME TDD + Toyota Way + Compiler-Grade Testing
autonomous: "true"
version: 3.0
empirical_validation: |
- Depyler: 67.83% → 71.15%, +136 tests, 100x efficiency gain
- bashrs: ~90% coverage, 7,321 inline tests, 542 files
- Ruchy: 70.31% → 90%+ target, 5-category strategy
constraints:
- make coverage <10min
- make test-fast <5min
- pre-commit test <30s
- PROPTEST_CASES=100 (not 5, statistically valid)
heuristics:
- module type classification (LOGIC vs UI/CLI)
- category-based targets (Frontend 80%, Backend 80%, Runtime 80%)
- ROI tracking (tests-per-percentage)
- auto-pivot on diminishing returns (<0.05%/test)
- uncovered code first (in LOGIC modules)
- property testing mandatory (100+ cases per property)
- golden file testing for transpilers/compilers
- mutation testing (≥75% mutation score)
coverage_target: 85
testing_approaches:
- mutation testing (cargo-mutants, pytest-mutpy)
- property-based testing (PROPTEST_CASES=100, hypothesis)
- golden file testing (compiler/transpiler output validation)
- integration testing via cargo run --example
- inline unit tests (≥10-15 per file, bashrs pattern)
- negative testing (all Result error paths)
- fuzz testing for parsers (cargo-fuzz, atheris)
prompt: |
# PMAT Code Coverage Protocol - v3.0 (Compiler-Grade Quality)
**CRITICAL**: You are expected to make intelligent decisions based on context and ACT autonomously.
Do NOT ask the user to choose - analyze the situation and execute the best action immediately.
## Code Coverage Target
All code coverage must be greater than 85%. This protocol integrates proven strategies from:
- **bashrs**: 90%+ coverage, 7,321 inline tests (compiler quality)
- **Ruchy**: 70% → 90% five-category strategy
- **Depyler**: +3.32% empirical efficiency gains
## Research-Validated Insights (2021-2025)
### Scientific Foundation
1. **IEEE Software 2023**: Projects maintaining >85% coverage demonstrate:
- 35% fewer production defects (p < 0.001)
- 58% faster defect detection times
- 42% reduction in post-release critical bugs
2. **PLDI 2021**: Property-based testing discovered:
- 325+ bugs in production compilers (GCC, LLVM, ICC)
- 89% of bugs unreachable by example-based tests alone
- **Optimal: 100+ cases per property** (not 5 - statistical significance)
3. **SQLite Testing (ACM Queue 2022)**: 100% MC/DC coverage via:
- 100% branch coverage (mandatory baseline)
- 1,000:1 test-to-code ratio (1000 lines test per 1 line source)
- Target: 10:1 ratio initially, scale to 100:1 long-term
4. **ICSE 2023 Mutation Testing**: Effective mutation testing requires:
- Mutation score ≥75% for production code quality
- Equivalent mutant detection (automatic filtering)
- Incremental mutation (file-by-file, not whole codebase)
5. **Compiler Construction 2020**: Compiler-specific coverage requirements:
- Parser: 95%+ coverage (syntax specification completeness)
- Semantic analysis: 90%+ coverage (type system soundness)
- Code generation: 85%+ coverage (backend variation tolerance)
## Autonomous Decision Framework (v3.0 - Compiler-Grade)
### Step 0: Module Type Classification + Category Assignment
**CRITICAL**: Before targeting any module, classify both TYPE and CATEGORY.
#### Type Classification (ROI Prediction)
**LOGIC Modules (HIGH ROI: 0.08-0.15%/test)**:
- Pure functions, algorithms, analysis engines, parsers, type checkers
- No terminal interaction, no UI prompts, no pretty printing
- Pattern examples: `*_analysis.rs`, `*_inference.rs`, `*_optimizer.rs`, `*_parser.rs`, `*_engine.rs`
- Rust: modules with `impl` blocks but no `dialoguer`, `colored`, `comfy-table`
- Python: modules with functions/classes but no `rich`, `click.prompt`, `questionary`
- TypeScript: `.ts` files with business logic, not `.tsx` React components
**UI/CLI Modules (LOW ROI: <0.03%/test - SKIP UNLESS REQUESTED)**:
- Interactive prompts (dialoguer, inquire, questionary)
- Pretty printing and formatting (colored, rich, chalk)
- Terminal interaction (Confirm, MultiSelect, Select)
- Pattern examples: `interactive.rs`, `cli.rs`, `*_cmd.rs`, `repl.rs`
- Requires complex mocking (stdin/stdout), low test value
#### Category Classification (Specialized Testing Strategy)
**Five-Category Strategy** (from Ruchy compiler):
| Category | Target | Modules | Specialized Techniques |
|----------|--------|---------|------------------------|
| **Frontend** | 95% | lexer, parser, ast, diagnostics | Property tests (100 cases), fuzz testing, error recovery |
| **Backend** | 85% | transpiler, codegen, optimizer | Golden file tests, semantic preservation |
| **Runtime** | 90% | interpreter, repl, actors | Integration tests, state machine validation |
| **API/CLI** | 80% | handlers, commands, endpoints | assert_cmd tests, API contract tests |
| **Quality** | 80% | testing, utils, validation | Self-testing, mutation testing |
**DECISION RULE**:
```
IF module_type == LOGIC AND category == Frontend AND coverage < 50%:
→ CRITICAL PRIORITY (parser bugs = user-facing failures)
→ Apply: Property tests (100 cases), fuzz testing
ELSE IF module_type == LOGIC AND category == Backend AND coverage < 60%:
→ HIGH PRIORITY (codegen bugs = runtime failures)
→ Apply: Golden file tests, semantic preservation tests
ELSE IF module_type == LOGIC AND category == Runtime AND coverage < 40%:
→ HIGH PRIORITY (execution bugs = semantic errors)
→ Apply: Integration tests, state validation
ELSE IF module_type == UI/CLI:
→ SKIP (expected ROI: LOW, unless explicitly requested by user)
→ Alternative: assert_cmd for CLI binaries
ELSE IF module_type == LOGIC AND coverage >= 60%:
→ PROCEED WITH CAUTION (edge cases, expected ROI: 0.02-0.05%/test)
```
### Step 1: Assess Current State
Analyze the project state by gathering:
- **Module type** (LOGIC vs UI/CLI - classify FIRST)
- **Category** (Frontend, Backend, Runtime, API/CLI, Quality)
- Coverage baseline vs current coverage
- Number of tests written in current session
- Test pass rate (must be 100%)
- Functions tested vs total functions in module
- Time invested in current module (~estimate from conversation)
- **ROI (tests-per-percentage)** for last batch
- **Mutation score** (if applicable)
### Step 2: Apply Intelligent Heuristics (Make Decision)
#### Heuristic 1: Coverage Progress Assessment (v3.0 - Research-Backed)
```
IF coverage_improvement >= 2%:
→ AUTO: Commit progress (meaningful gain achieved)
RATIONALE: "2%+ coverage improvement = measurable quality gain (empirically validated)"
EVIDENCE: "Depyler Phase 2-1: +2.00% gain justified commit (not 5%)"
ELSE IF test_count >= 20 AND test_pass_rate == 100%:
→ AUTO: Commit infrastructure (reusable foundation)
RATIONALE: "20+ passing tests = infrastructure threshold for incremental expansion"
EVIDENCE: "bashrs pattern: 13.5 tests per file average"
ELSE IF ROI < 0.05% per test FOR 2 consecutive batches:
→ AUTO: Commit and PIVOT to new module (diminishing returns detected)
RATIONALE: "ROI decline signals edge case territory, pivot for better efficiency"
EVIDENCE: "Depyler Phase 1-3: 0.004%/test triggered pivot to 0.105%/test (26x improvement)"
ELSE IF test_count < 10:
→ AUTO: Continue adding tests (insufficient for commit)
RATIONALE: "Less than 10 tests insufficient for meaningful commit"
ELSE IF time_spent > 90_minutes:
→ AUTO: Commit current work (time box reached)
RATIONALE: "90-minute time box prevents over-investment"
```
#### Heuristic 1b: ROI Tracking & Auto-Pivot
```
AFTER EVERY 15-20 TESTS (batch):
1. Calculate batch ROI: (current_coverage - batch_start_coverage) / test_count
2. Compare to previous batch ROI
3. Make decision:
IF batch_ROI > 0.08% per test:
→ CONTINUE current module (excellent ROI maintained)
ELSE IF batch_ROI 0.05-0.08% per test:
→ EVALUATE (check time spent, consider pivot if >60 minutes invested)
ELSE IF batch_ROI < 0.05% per test FOR 2 batches:
→ AUTO-PIVOT to new module immediately
→ Target: LOW coverage (<40%) LOGIC module in CRITICAL category (Frontend)
RATIONALE: "Diminishing returns detected, strategic pivot recovers efficiency"
EVIDENCE: "Depyler: 86% ROI decline over 3 batches triggered pivot, recovered 26x ROI"
```
#### Heuristic 2: Module Selection (v3.0 - Category + Type Aware)
```
IF current_module_type == UI/CLI:
→ AUTO: SKIP and switch to LOGIC module
RATIONALE: "UI/CLI modules require complex mocking, LOW ROI (<0.03%/test)"
TARGET: Find LOW coverage (<40%) LOGIC module in Frontend category
ELSE IF current_module_type == LOGIC AND category == Frontend AND coverage < 50%:
→ AUTO: CRITICAL PRIORITY - Continue current module
RATIONALE: "Parser/lexer bugs are user-facing, highest severity"
TECHNIQUES: Property tests (100 cases), fuzz testing, error recovery tests
ELSE IF current_module_type == LOGIC AND category == Backend AND coverage < 60%:
→ AUTO: HIGH PRIORITY - Continue current module
RATIONALE: "Codegen bugs cause runtime failures, semantic preservation critical"
TECHNIQUES: Golden file tests, output comparison, semantic equivalence
ELSE IF current_module_coverage >= 60%:
→ AUTO: Switch to next low-coverage LOGIC module
RATIONALE: "60%+ coverage = diminishing returns, pivot to maximize impact"
TARGET: Find <40% coverage LOGIC module in CRITICAL category (avoid UI/CLI)
ELSE IF current_module has_existing_tests AND coverage_unchanged_after_20_tests:
→ AUTO: Switch to module with no test infrastructure
RATIONALE: "Coverage plateau detected, target untested LOGIC modules for higher ROI"
TARGET: Find 0-test LOGIC module with <40% coverage in Frontend/Backend
```
#### Heuristic 3: Test Type Selection (v3.0 - Research-Guided)
```
IF module_type == "parser" OR module_type == "lexer":
→ AUTO: Write fuzz tests (cargo-fuzz, atheris) + property tests (100 cases)
RATIONALE: "Parser bugs discovered by property testing (PLDI 2021: 89% missed by examples)"
EXAMPLE: proptest! { fn parse_roundtrip(input: ArbitrarySource) { ... } }
IF module_type == "transpiler" OR module_type == "codegen":
→ AUTO: Write golden file tests (known-good input/output pairs)
RATIONALE: "Compiler construction 2020: 85%+ coverage via output validation"
EXAMPLE: Compare transpile(input.ruchy) == expected_output.rs
ELSE IF module_type == "pure_functions" (no I/O, no state mutation):
→ AUTO: Write property-based tests (PROPTEST_CASES=100, not 5)
RATIONALE: "Property testing ideal for pure functions (mathematical invariants)"
EXAMPLE: proptest! { fn commutative(a: i32, b: i32) { ... } }
ELSE IF module_type == "state_mutation" (structs, methods, mutable operations):
→ AUTO: Write integration tests + mutation tests
RATIONALE: "State-dependent code requires integration testing + mutation validation"
EXAMPLE: cargo mutants --file src/state.rs
ELSE IF module_type == "I/O_operations" (filesystem, network, external deps):
→ AUTO: Write unit tests with mocks
RATIONALE: "I/O operations require mocking for deterministic testing"
EXAMPLE: Mock filesystem via tempfile, mock HTTP via wiremock
```
#### Heuristic 4: Language-Specific Tooling
**Rust**:
```bash
# Property testing
PROPTEST_CASES=100 cargo test # Not 5! Statistical significance
# Mutation testing
cargo mutants --file src/module.rs --timeout 60
# Fuzz testing (parsers)
cargo fuzz run parser_target -- -max_total_time=300
# Coverage (avoid mold linker)
make coverage # Temporarily disables ~/.cargo/config.toml mold linker
# Golden file tests
insta::assert_snapshot!(output) # cargo-insta for snapshots
```
**Python**:
```bash
# Property testing
pytest --hypothesis-profile=ci # 100+ examples per test
# Mutation testing
mutmut run --paths-to-mutate src/
# Coverage
pytest --cov=src --cov-report=html --cov-report=term
# Golden file tests
pytest-golden for regression tests
```
**TypeScript**:
```bash
# Property testing
npm test -- --testMatch="**/*.prop.test.ts" # fast-check library
# Coverage
npm run test -- --coverage --coverageThreshold='{"global":{"lines":85}}'
# Golden file tests
jest snapshots for output validation
```
### Step 3: Execute Decision with Transparency
After applying heuristics, execute the chosen action immediately with brief explanation:
**Example Execution Pattern (Frontend Module)**:
```
DECISION: Applying property tests to parser module (Heuristic 3: parser detection)
RATIONALE: Parser at 42% coverage, parser bugs are user-facing (CRITICAL category).
Property testing discovered 89% of bugs missed by examples (PLDI 2021).
ACTION: Writing 5 property tests with PROPTEST_CASES=100 for parse_expression()
TECHNIQUES:
- Roundtrip: parse(ast.to_string()) == ast
- No panic: parse(arbitrary_input) never panics
- Precedence: parse("a + b * c") respects operator precedence
- Unicode: parse(unicode_ident) handles non-ASCII correctly
- Error recovery: parse(malformed) returns Err, not panic
```
**Another Example (Backend Module)**:
```
DECISION: Creating golden file test suite (Heuristic 3: transpiler detection)
RATIONALE: Transpiler at 56% coverage, codegen bugs cause runtime failures.
Compiler construction 2020: 85%+ via output validation.
ACTION: Creating tests/golden/ with 10 input/output pairs
STRUCTURE:
tests/golden/
├── 001_simple_function.input
├── 001_simple_function.expected.rs
├── 002_nested_loops.input
├── 002_nested_loops.expected.rs
└── ... (10+ pairs)
TEST: Compare transpile(input) == read_file(expected)
```
## Base Heuristics (Always Apply)
1. **Uncovered code first** - Prioritize functions with 0% coverage
2. **Low coverage + Low TDG score** - Target technical debt hotspots
3. **Stop the line** - If you spot a defect due to unimplemented or partially
implemented functionality, STOP THE LINE and implement using EXTREME TDD.
The concept of "pre-existing failure" is irrelevant. Fix it.
4. **Property tests use 100 cases** - PROPTEST_CASES=100 (not 5) for statistical significance
5. **Mutation score ≥75%** - Run `cargo mutants` on new modules, ensure high kill rate
6. **Inline tests per file** - Target 10-15 tests per file (bashrs pattern: 13.5 avg)
7. **Golden files for codegen** - Compiler/transpiler modules need output validation
8. **Fuzz parsers** - Use cargo-fuzz/atheris for parser modules (no panic guarantee)
## Override Mechanism
User can override autonomous behavior with explicit instruction:
- "Continue adding tests" → Ignore heuristics, keep writing tests
- "Switch to module X" → Target specific module
- "Commit and stop" → Force commit regardless of state
- "Skip module Y" → Exclude specific modules from consideration
If no override given, you MUST make autonomous decision using heuristics above.
## Example Autonomous Execution
```
User: pmat prompt code-coverage
[You analyze state automatically]
Analyzing coverage state...
- Current module: parser.rs at 42.24% (Frontend category)
- Module type: LOGIC (parser)
- Tests written: 8 unit tests (100% passing)
- Functions tested: 12/35 (34.28%)
- Time invested: ~30 minutes
[You apply Heuristic 3 - Test Type Selection]
DECISION: Apply property testing + fuzz testing (parser detected)
RATIONALE: Parser is CRITICAL category (user-facing), 42% coverage below 95% target.
PLDI 2021: Property testing found 89% of bugs missed by examples.
Current 8 tests insufficient (need 10-15 per file, bashrs pattern).
TECHNIQUES:
- Property tests: PROPTEST_CASES=100 (not 5)
- Fuzz testing: cargo fuzz run parser_target
- Error recovery: All malformed inputs return Err, never panic
ACTION: Creating tests/property_parser.rs with 5 property tests (100 cases each)...
[Executes property test creation]
✅ Created 5 property tests (500 total test cases)
📊 Coverage: 42.24% → 58.67% (+16.43%, excellent ROI: 3.29%/test)
🎯 Mutation score: 78% (3 of 14 mutants survived)
⏭️ Next action: Add 7 more inline tests to reach 15/file target
Continue? (y/n)
```
## Critical Reminders (v3.0 - Compiler-Grade)
- **NEVER ask the user "What would you like to do?"** - You decide using heuristics
- **ALWAYS classify module type AND category FIRST** (TYPE determines ROI, CATEGORY determines techniques)
- **ALWAYS use PROPTEST_CASES=100** (not 5) - statistical significance requires 100+ cases
- **ALWAYS track ROI** (tests-per-percentage) - auto-pivot if <0.05%/test for 2 batches
- **ALWAYS explain your decision** before executing (transparency + rationale + evidence)
- **ALWAYS commit on infrastructure threshold** (20+ tests, 100% passing)
- **ALWAYS commit on 2%+ coverage improvement** (empirically validated, not 5%)
- **ALWAYS commit on ROI decline** (<0.05%/test for 2 batches = diminishing returns)
- **ALWAYS respect time constraints** (make coverage <10min, test-fast <5min, 90-min time-box)
- **ALWAYS use golden files for codegen** (transpilers, compilers need output validation)
- **ALWAYS fuzz parsers** (cargo-fuzz, no panic guarantee, PLDI 2021 evidence)
- **ALWAYS target ≥10-15 tests per file** (bashrs pattern: 13.5 avg, 7,321 inline tests)
- **ALWAYS check mutation score** (≥75% for production quality, ICSE 2023)
## Language-Specific Quick Reference
### Rust
```bash
# Coverage (avoid mold linker interference)
make coverage # See note below about ~/.cargo/config.toml
# Property tests (100 cases)
PROPTEST_CASES=100 cargo test
# Mutation testing
cargo mutants --file src/module.rs --timeout 60
# Fuzz testing
cargo fuzz run target --jobs 4 -- -max_total_time=300
# Inline tests
cargo test --lib module_name::tests
```
**CRITICAL**: Mold linker breaks LLVM coverage! Makefile must temporarily disable:
```makefile
coverage:
@test -f ~/.cargo/config.toml && mv ~/.cargo/config.toml ~/.cargo/config.toml.cov-backup || true
@cargo llvm-cov --no-report nextest --all-features --workspace
@cargo llvm-cov report --html --output-dir target/coverage/html
@test -f ~/.cargo/config.toml.cov-backup && mv ~/.cargo/config.toml.cov-backup ~/.cargo/config.toml || true
```
### Python
```bash
# Coverage
pytest --cov=src --cov-report=html --cov-report=term --cov-fail-under=85
# Property tests (100 examples)
pytest --hypothesis-profile=ci # Set max_examples=100 in conftest.py
# Mutation testing
mutmut run --paths-to-mutate src/
# Inline tests
pytest tests/test_module.py -v
```
### TypeScript
```bash
# Coverage
npm test -- --coverage --coverageThreshold='{"global":{"lines":85}}'
# Property tests
npm test -- --testMatch="**/*.prop.test.ts" # fast-check library
# Golden file tests
npm test -- --updateSnapshot # Jest snapshots
```
toyota_way_principles:
jidoka: stop_the_line
andon_cord: "true"
genchi_genbutsu: verify_actual_state
hansei: deep_reflection_on_roi_decline
kaizen: continuous_improvement_via_empirical_feedback
autonomation: "true"
human_override_available: "true"
empirical_evidence:
depyler_validation: "67.83% → 71.15%, +136 tests, 100x efficiency (Depyler project, 2025-11-12)"
roi_improvement: "26x ROI gain via strategic pivot (0.004%/test → 0.105%/test)"
module_type_discovery: "UI/CLI modules <0.03%/test, LOGIC modules 0.08-0.15%/test"
bashrs_pattern: "~90% coverage, 7,321 inline tests (13.5 avg/file), 542 files"
ruchy_strategy: "70.31% → 90%+ target via five-category decomposition"
proptest_optimal: "100 cases (95% confidence), not 5 (60% confidence)"
mutation_threshold: "≥75% mutation score for production quality (ICSE 2023)"
research_citations:
ieee_2023: "35% fewer defects at >85% coverage (p < 0.001)"
pldi_2021: "Property testing found 89% of bugs missed by examples (GCC, LLVM, ICC)"
sqlite_2022: "100% MC/DC coverage via 1000:1 test-to-code ratio"
icse_2023: "Mutation score ≥75% for production code quality"
cc_2020: "Parser 95%, semantic 90%, codegen 85% coverage targets"