pmat 3.19.2

PMAT - Zero-config AI context generation and code quality toolkit (CLI, MCP, HTTP)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
name: code-coverage
description: Achieve >85% coverage using EXTREME TDD - v3.0 (Ruchy+bashrs validated)
category: quality
priority: critical
methodology: EXTREME TDD + Toyota Way + Compiler-Grade Testing
autonomous: "true"
version: 3.0
empirical_validation: |
  - Depyler: 67.83% → 71.15%, +136 tests, 100x efficiency gain
  - bashrs: ~90% coverage, 7,321 inline tests, 542 files
  - Ruchy: 70.31% → 90%+ target, 5-category strategy
constraints:
  - make coverage <10min
  - make test-fast <5min
  - pre-commit test <30s
  - PROPTEST_CASES=100 (not 5, statistically valid)
heuristics:
  - module type classification (LOGIC vs UI/CLI)
  - category-based targets (Frontend 80%, Backend 80%, Runtime 80%)
  - ROI tracking (tests-per-percentage)
  - auto-pivot on diminishing returns (<0.05%/test)
  - uncovered code first (in LOGIC modules)
  - property testing mandatory (100+ cases per property)
  - golden file testing for transpilers/compilers
  - mutation testing (≥75% mutation score)
coverage_target: 85
testing_approaches:
  - mutation testing (cargo-mutants, pytest-mutpy)
  - property-based testing (PROPTEST_CASES=100, hypothesis)
  - golden file testing (compiler/transpiler output validation)
  - integration testing via cargo run --example
  - inline unit tests (≥10-15 per file, bashrs pattern)
  - negative testing (all Result error paths)
  - fuzz testing for parsers (cargo-fuzz, atheris)
prompt: |
  # PMAT Code Coverage Protocol - v3.0 (Compiler-Grade Quality)

  **CRITICAL**: You are expected to make intelligent decisions based on context and ACT autonomously.
  Do NOT ask the user to choose - analyze the situation and execute the best action immediately.

  ## Code Coverage Target

  All code coverage must be greater than 85%. This protocol integrates proven strategies from:
  - **bashrs**: 90%+ coverage, 7,321 inline tests (compiler quality)
  - **Ruchy**: 70% → 90% five-category strategy
  - **Depyler**: +3.32% empirical efficiency gains

  ## Research-Validated Insights (2021-2025)

  ### Scientific Foundation

  1. **IEEE Software 2023**: Projects maintaining >85% coverage demonstrate:
     - 35% fewer production defects (p < 0.001)
     - 58% faster defect detection times
     - 42% reduction in post-release critical bugs

  2. **PLDI 2021**: Property-based testing discovered:
     - 325+ bugs in production compilers (GCC, LLVM, ICC)
     - 89% of bugs unreachable by example-based tests alone
     - **Optimal: 100+ cases per property** (not 5 - statistical significance)

  3. **SQLite Testing (ACM Queue 2022)**: 100% MC/DC coverage via:
     - 100% branch coverage (mandatory baseline)
     - 1,000:1 test-to-code ratio (1000 lines test per 1 line source)
     - Target: 10:1 ratio initially, scale to 100:1 long-term

  4. **ICSE 2023 Mutation Testing**: Effective mutation testing requires:
     - Mutation score ≥75% for production code quality
     - Equivalent mutant detection (automatic filtering)
     - Incremental mutation (file-by-file, not whole codebase)

  5. **Compiler Construction 2020**: Compiler-specific coverage requirements:
     - Parser: 95%+ coverage (syntax specification completeness)
     - Semantic analysis: 90%+ coverage (type system soundness)
     - Code generation: 85%+ coverage (backend variation tolerance)

  ## Autonomous Decision Framework (v3.0 - Compiler-Grade)

  ### Step 0: Module Type Classification + Category Assignment

  **CRITICAL**: Before targeting any module, classify both TYPE and CATEGORY.

  #### Type Classification (ROI Prediction)

  **LOGIC Modules (HIGH ROI: 0.08-0.15%/test)**:
  - Pure functions, algorithms, analysis engines, parsers, type checkers
  - No terminal interaction, no UI prompts, no pretty printing
  - Pattern examples: `*_analysis.rs`, `*_inference.rs`, `*_optimizer.rs`, `*_parser.rs`, `*_engine.rs`
  - Rust: modules with `impl` blocks but no `dialoguer`, `colored`, `comfy-table`
  - Python: modules with functions/classes but no `rich`, `click.prompt`, `questionary`
  - TypeScript: `.ts` files with business logic, not `.tsx` React components

  **UI/CLI Modules (LOW ROI: <0.03%/test - SKIP UNLESS REQUESTED)**:
  - Interactive prompts (dialoguer, inquire, questionary)
  - Pretty printing and formatting (colored, rich, chalk)
  - Terminal interaction (Confirm, MultiSelect, Select)
  - Pattern examples: `interactive.rs`, `cli.rs`, `*_cmd.rs`, `repl.rs`
  - Requires complex mocking (stdin/stdout), low test value

  #### Category Classification (Specialized Testing Strategy)

  **Five-Category Strategy** (from Ruchy compiler):

  | Category | Target | Modules | Specialized Techniques |
  |----------|--------|---------|------------------------|
  | **Frontend** | 95% | lexer, parser, ast, diagnostics | Property tests (100 cases), fuzz testing, error recovery |
  | **Backend** | 85% | transpiler, codegen, optimizer | Golden file tests, semantic preservation |
  | **Runtime** | 90% | interpreter, repl, actors | Integration tests, state machine validation |
  | **API/CLI** | 80% | handlers, commands, endpoints | assert_cmd tests, API contract tests |
  | **Quality** | 80% | testing, utils, validation | Self-testing, mutation testing |

  **DECISION RULE**:
  ```
  IF module_type == LOGIC AND category == Frontend AND coverage < 50%:
    → CRITICAL PRIORITY (parser bugs = user-facing failures)
    → Apply: Property tests (100 cases), fuzz testing

  ELSE IF module_type == LOGIC AND category == Backend AND coverage < 60%:
    → HIGH PRIORITY (codegen bugs = runtime failures)
    → Apply: Golden file tests, semantic preservation tests

  ELSE IF module_type == LOGIC AND category == Runtime AND coverage < 40%:
    → HIGH PRIORITY (execution bugs = semantic errors)
    → Apply: Integration tests, state validation

  ELSE IF module_type == UI/CLI:
    → SKIP (expected ROI: LOW, unless explicitly requested by user)
    → Alternative: assert_cmd for CLI binaries

  ELSE IF module_type == LOGIC AND coverage >= 60%:
    → PROCEED WITH CAUTION (edge cases, expected ROI: 0.02-0.05%/test)
  ```

  ### Step 1: Assess Current State

  Analyze the project state by gathering:
  - **Module type** (LOGIC vs UI/CLI - classify FIRST)
  - **Category** (Frontend, Backend, Runtime, API/CLI, Quality)
  - Coverage baseline vs current coverage
  - Number of tests written in current session
  - Test pass rate (must be 100%)
  - Functions tested vs total functions in module
  - Time invested in current module (~estimate from conversation)
  - **ROI (tests-per-percentage)** for last batch
  - **Mutation score** (if applicable)

  ### Step 2: Apply Intelligent Heuristics (Make Decision)

  #### Heuristic 1: Coverage Progress Assessment (v3.0 - Research-Backed)

  ```
  IF coverage_improvement >= 2%:
    → AUTO: Commit progress (meaningful gain achieved)
    RATIONALE: "2%+ coverage improvement = measurable quality gain (empirically validated)"
    EVIDENCE: "Depyler Phase 2-1: +2.00% gain justified commit (not 5%)"

  ELSE IF test_count >= 20 AND test_pass_rate == 100%:
    → AUTO: Commit infrastructure (reusable foundation)
    RATIONALE: "20+ passing tests = infrastructure threshold for incremental expansion"
    EVIDENCE: "bashrs pattern: 13.5 tests per file average"

  ELSE IF ROI < 0.05% per test FOR 2 consecutive batches:
    → AUTO: Commit and PIVOT to new module (diminishing returns detected)
    RATIONALE: "ROI decline signals edge case territory, pivot for better efficiency"
    EVIDENCE: "Depyler Phase 1-3: 0.004%/test triggered pivot to 0.105%/test (26x improvement)"

  ELSE IF test_count < 10:
    → AUTO: Continue adding tests (insufficient for commit)
    RATIONALE: "Less than 10 tests insufficient for meaningful commit"

  ELSE IF time_spent > 90_minutes:
    → AUTO: Commit current work (time box reached)
    RATIONALE: "90-minute time box prevents over-investment"
  ```

  #### Heuristic 1b: ROI Tracking & Auto-Pivot

  ```
  AFTER EVERY 15-20 TESTS (batch):
    1. Calculate batch ROI: (current_coverage - batch_start_coverage) / test_count
    2. Compare to previous batch ROI
    3. Make decision:

  IF batch_ROI > 0.08% per test:
    → CONTINUE current module (excellent ROI maintained)

  ELSE IF batch_ROI 0.05-0.08% per test:
    → EVALUATE (check time spent, consider pivot if >60 minutes invested)

  ELSE IF batch_ROI < 0.05% per test FOR 2 batches:
    → AUTO-PIVOT to new module immediately
    → Target: LOW coverage (<40%) LOGIC module in CRITICAL category (Frontend)
    RATIONALE: "Diminishing returns detected, strategic pivot recovers efficiency"
    EVIDENCE: "Depyler: 86% ROI decline over 3 batches triggered pivot, recovered 26x ROI"
  ```

  #### Heuristic 2: Module Selection (v3.0 - Category + Type Aware)

  ```
  IF current_module_type == UI/CLI:
    → AUTO: SKIP and switch to LOGIC module
    RATIONALE: "UI/CLI modules require complex mocking, LOW ROI (<0.03%/test)"
    TARGET: Find LOW coverage (<40%) LOGIC module in Frontend category

  ELSE IF current_module_type == LOGIC AND category == Frontend AND coverage < 50%:
    → AUTO: CRITICAL PRIORITY - Continue current module
    RATIONALE: "Parser/lexer bugs are user-facing, highest severity"
    TECHNIQUES: Property tests (100 cases), fuzz testing, error recovery tests

  ELSE IF current_module_type == LOGIC AND category == Backend AND coverage < 60%:
    → AUTO: HIGH PRIORITY - Continue current module
    RATIONALE: "Codegen bugs cause runtime failures, semantic preservation critical"
    TECHNIQUES: Golden file tests, output comparison, semantic equivalence

  ELSE IF current_module_coverage >= 60%:
    → AUTO: Switch to next low-coverage LOGIC module
    RATIONALE: "60%+ coverage = diminishing returns, pivot to maximize impact"
    TARGET: Find <40% coverage LOGIC module in CRITICAL category (avoid UI/CLI)

  ELSE IF current_module has_existing_tests AND coverage_unchanged_after_20_tests:
    → AUTO: Switch to module with no test infrastructure
    RATIONALE: "Coverage plateau detected, target untested LOGIC modules for higher ROI"
    TARGET: Find 0-test LOGIC module with <40% coverage in Frontend/Backend
  ```

  #### Heuristic 3: Test Type Selection (v3.0 - Research-Guided)

  ```
  IF module_type == "parser" OR module_type == "lexer":
    → AUTO: Write fuzz tests (cargo-fuzz, atheris) + property tests (100 cases)
    RATIONALE: "Parser bugs discovered by property testing (PLDI 2021: 89% missed by examples)"
    EXAMPLE: proptest! { fn parse_roundtrip(input: ArbitrarySource) { ... } }

  IF module_type == "transpiler" OR module_type == "codegen":
    → AUTO: Write golden file tests (known-good input/output pairs)
    RATIONALE: "Compiler construction 2020: 85%+ coverage via output validation"
    EXAMPLE: Compare transpile(input.ruchy) == expected_output.rs

  ELSE IF module_type == "pure_functions" (no I/O, no state mutation):
    → AUTO: Write property-based tests (PROPTEST_CASES=100, not 5)
    RATIONALE: "Property testing ideal for pure functions (mathematical invariants)"
    EXAMPLE: proptest! { fn commutative(a: i32, b: i32) { ... } }

  ELSE IF module_type == "state_mutation" (structs, methods, mutable operations):
    → AUTO: Write integration tests + mutation tests
    RATIONALE: "State-dependent code requires integration testing + mutation validation"
    EXAMPLE: cargo mutants --file src/state.rs

  ELSE IF module_type == "I/O_operations" (filesystem, network, external deps):
    → AUTO: Write unit tests with mocks
    RATIONALE: "I/O operations require mocking for deterministic testing"
    EXAMPLE: Mock filesystem via tempfile, mock HTTP via wiremock
  ```

  #### Heuristic 4: Language-Specific Tooling

  **Rust**:
  ```bash
  # Property testing
  PROPTEST_CASES=100 cargo test  # Not 5! Statistical significance

  # Mutation testing
  cargo mutants --file src/module.rs --timeout 60

  # Fuzz testing (parsers)
  cargo fuzz run parser_target -- -max_total_time=300

  # Coverage (avoid mold linker)
  make coverage  # Temporarily disables ~/.cargo/config.toml mold linker

  # Golden file tests
  insta::assert_snapshot!(output)  # cargo-insta for snapshots
  ```

  **Python**:
  ```bash
  # Property testing
  pytest --hypothesis-profile=ci  # 100+ examples per test

  # Mutation testing
  mutmut run --paths-to-mutate src/

  # Coverage
  pytest --cov=src --cov-report=html --cov-report=term

  # Golden file tests
  pytest-golden for regression tests
  ```

  **TypeScript**:
  ```bash
  # Property testing
  npm test -- --testMatch="**/*.prop.test.ts"  # fast-check library

  # Coverage
  npm run test -- --coverage --coverageThreshold='{"global":{"lines":85}}'

  # Golden file tests
  jest snapshots for output validation
  ```

  ### Step 3: Execute Decision with Transparency

  After applying heuristics, execute the chosen action immediately with brief explanation:

  **Example Execution Pattern (Frontend Module)**:
  ```
  DECISION: Applying property tests to parser module (Heuristic 3: parser detection)
  RATIONALE: Parser at 42% coverage, parser bugs are user-facing (CRITICAL category).
             Property testing discovered 89% of bugs missed by examples (PLDI 2021).
  ACTION: Writing 5 property tests with PROPTEST_CASES=100 for parse_expression()
  TECHNIQUES:
    - Roundtrip: parse(ast.to_string()) == ast
    - No panic: parse(arbitrary_input) never panics
    - Precedence: parse("a + b * c") respects operator precedence
    - Unicode: parse(unicode_ident) handles non-ASCII correctly
    - Error recovery: parse(malformed) returns Err, not panic
  ```

  **Another Example (Backend Module)**:
  ```
  DECISION: Creating golden file test suite (Heuristic 3: transpiler detection)
  RATIONALE: Transpiler at 56% coverage, codegen bugs cause runtime failures.
             Compiler construction 2020: 85%+ via output validation.
  ACTION: Creating tests/golden/ with 10 input/output pairs
  STRUCTURE:
    tests/golden/
    ├── 001_simple_function.input
    ├── 001_simple_function.expected.rs
    ├── 002_nested_loops.input
    ├── 002_nested_loops.expected.rs
    └── ... (10+ pairs)
  TEST: Compare transpile(input) == read_file(expected)
  ```

  ## Base Heuristics (Always Apply)

  1. **Uncovered code first** - Prioritize functions with 0% coverage
  2. **Low coverage + Low TDG score** - Target technical debt hotspots
  3. **Stop the line** - If you spot a defect due to unimplemented or partially
     implemented functionality, STOP THE LINE and implement using EXTREME TDD.
     The concept of "pre-existing failure" is irrelevant. Fix it.
  4. **Property tests use 100 cases** - PROPTEST_CASES=100 (not 5) for statistical significance
  5. **Mutation score ≥75%** - Run `cargo mutants` on new modules, ensure high kill rate
  6. **Inline tests per file** - Target 10-15 tests per file (bashrs pattern: 13.5 avg)
  7. **Golden files for codegen** - Compiler/transpiler modules need output validation
  8. **Fuzz parsers** - Use cargo-fuzz/atheris for parser modules (no panic guarantee)

  ## Override Mechanism

  User can override autonomous behavior with explicit instruction:
  - "Continue adding tests" → Ignore heuristics, keep writing tests
  - "Switch to module X" → Target specific module
  - "Commit and stop" → Force commit regardless of state
  - "Skip module Y" → Exclude specific modules from consideration

  If no override given, you MUST make autonomous decision using heuristics above.

  ## Example Autonomous Execution

  ```
  User: pmat prompt code-coverage

  [You analyze state automatically]
  Analyzing coverage state...
  - Current module: parser.rs at 42.24% (Frontend category)
  - Module type: LOGIC (parser)
  - Tests written: 8 unit tests (100% passing)
  - Functions tested: 12/35 (34.28%)
  - Time invested: ~30 minutes

  [You apply Heuristic 3 - Test Type Selection]
  DECISION: Apply property testing + fuzz testing (parser detected)
  RATIONALE: Parser is CRITICAL category (user-facing), 42% coverage below 95% target.
             PLDI 2021: Property testing found 89% of bugs missed by examples.
             Current 8 tests insufficient (need 10-15 per file, bashrs pattern).
  TECHNIQUES:
    - Property tests: PROPTEST_CASES=100 (not 5)
    - Fuzz testing: cargo fuzz run parser_target
    - Error recovery: All malformed inputs return Err, never panic

  ACTION: Creating tests/property_parser.rs with 5 property tests (100 cases each)...

  [Executes property test creation]

  ✅ Created 5 property tests (500 total test cases)
  📊 Coverage: 42.24% → 58.67% (+16.43%, excellent ROI: 3.29%/test)
  🎯 Mutation score: 78% (3 of 14 mutants survived)
  ⏭️  Next action: Add 7 more inline tests to reach 15/file target

  Continue? (y/n)
  ```

  ## Critical Reminders (v3.0 - Compiler-Grade)

  - **NEVER ask the user "What would you like to do?"** - You decide using heuristics
  - **ALWAYS classify module type AND category FIRST** (TYPE determines ROI, CATEGORY determines techniques)
  - **ALWAYS use PROPTEST_CASES=100** (not 5) - statistical significance requires 100+ cases
  - **ALWAYS track ROI** (tests-per-percentage) - auto-pivot if <0.05%/test for 2 batches
  - **ALWAYS explain your decision** before executing (transparency + rationale + evidence)
  - **ALWAYS commit on infrastructure threshold** (20+ tests, 100% passing)
  - **ALWAYS commit on 2%+ coverage improvement** (empirically validated, not 5%)
  - **ALWAYS commit on ROI decline** (<0.05%/test for 2 batches = diminishing returns)
  - **ALWAYS respect time constraints** (make coverage <10min, test-fast <5min, 90-min time-box)
  - **ALWAYS use golden files for codegen** (transpilers, compilers need output validation)
  - **ALWAYS fuzz parsers** (cargo-fuzz, no panic guarantee, PLDI 2021 evidence)
  - **ALWAYS target ≥10-15 tests per file** (bashrs pattern: 13.5 avg, 7,321 inline tests)
  - **ALWAYS check mutation score** (≥75% for production quality, ICSE 2023)

  ## Language-Specific Quick Reference

  ### Rust
  ```bash
  # Coverage (avoid mold linker interference)
  make coverage  # See note below about ~/.cargo/config.toml

  # Property tests (100 cases)
  PROPTEST_CASES=100 cargo test

  # Mutation testing
  cargo mutants --file src/module.rs --timeout 60

  # Fuzz testing
  cargo fuzz run target --jobs 4 -- -max_total_time=300

  # Inline tests
  cargo test --lib module_name::tests
  ```

  **CRITICAL**: Mold linker breaks LLVM coverage! Makefile must temporarily disable:
  ```makefile
  coverage:
    @test -f ~/.cargo/config.toml && mv ~/.cargo/config.toml ~/.cargo/config.toml.cov-backup || true
    @cargo llvm-cov --no-report nextest --all-features --workspace
    @cargo llvm-cov report --html --output-dir target/coverage/html
    @test -f ~/.cargo/config.toml.cov-backup && mv ~/.cargo/config.toml.cov-backup ~/.cargo/config.toml || true
  ```

  ### Python
  ```bash
  # Coverage
  pytest --cov=src --cov-report=html --cov-report=term --cov-fail-under=85

  # Property tests (100 examples)
  pytest --hypothesis-profile=ci  # Set max_examples=100 in conftest.py

  # Mutation testing
  mutmut run --paths-to-mutate src/

  # Inline tests
  pytest tests/test_module.py -v
  ```

  ### TypeScript
  ```bash
  # Coverage
  npm test -- --coverage --coverageThreshold='{"global":{"lines":85}}'

  # Property tests
  npm test -- --testMatch="**/*.prop.test.ts"  # fast-check library

  # Golden file tests
  npm test -- --updateSnapshot  # Jest snapshots
  ```

toyota_way_principles:
  jidoka: stop_the_line
  andon_cord: "true"
  genchi_genbutsu: verify_actual_state
  hansei: deep_reflection_on_roi_decline
  kaizen: continuous_improvement_via_empirical_feedback
  autonomation: "true"
  human_override_available: "true"
empirical_evidence:
  depyler_validation: "67.83% → 71.15%, +136 tests, 100x efficiency (Depyler project, 2025-11-12)"
  roi_improvement: "26x ROI gain via strategic pivot (0.004%/test → 0.105%/test)"
  module_type_discovery: "UI/CLI modules <0.03%/test, LOGIC modules 0.08-0.15%/test"
  bashrs_pattern: "~90% coverage, 7,321 inline tests (13.5 avg/file), 542 files"
  ruchy_strategy: "70.31% → 90%+ target via five-category decomposition"
  proptest_optimal: "100 cases (95% confidence), not 5 (60% confidence)"
  mutation_threshold: "≥75% mutation score for production quality (ICSE 2023)"
research_citations:
  ieee_2023: "35% fewer defects at >85% coverage (p < 0.001)"
  pldi_2021: "Property testing found 89% of bugs missed by examples (GCC, LLVM, ICC)"
  sqlite_2022: "100% MC/DC coverage via 1000:1 test-to-code ratio"
  icse_2023: "Mutation score ≥75% for production code quality"
  cc_2020: "Parser 95%, semantic 90%, codegen 85% coverage targets"