# SWE-CI & Evolution
> Sub-spec of [pmat-spec.md](../pmat-spec.md) | Component 20
> Based on: SWE-CI (arxiv:2603.03823, March 2026)
## Overview
Evolution-based code quality evaluation that measures long-term maintainability
through iterative CI loops rather than one-shot functional correctness.
## Core Concepts
### Normalized Change Metric a(c)
Quantifies progress toward the oracle (ideal) codebase:
```
a(c) = {
[n(c) - n(c₀)] / [n(c*) - n(c₀)] if n(c) >= n(c₀) (improvement)
[n(c) - n(c₀)] / n(c₀) if n(c) < n(c₀) (regression)
}
```
Where:
- `n(c)` = number of passing tests in codebase state c
- `n(c₀)` = passing tests in base state (first commit in window)
- `n(c*)` = passing tests in oracle state (max observed across all commits)
**Properties**:
- `a(c) = 1` means complete gap closure (reached oracle level)
- `a(c) = 0` means no progress from baseline
- `a(c) = -1` means all originally-passing tests now fail
### EvoScore
Aggregates performance across N iterations using future-weighted mean:
```
e = [sum_{i=1}^{N} gamma^i * a(c_i)] / [sum_{i=1}^{N} gamma^i]
```
Where:
- `gamma >= 1` weights later iterations more heavily
- `gamma = 1` reduces to simple average
- Higher gamma favors long-term stability over short-term gains
**Interpretation**: A truly maintainable codebase remains easy to modify as
evolution progresses. EvoScore penalizes early gains that create technical debt.
### CI Loop Model
Each iteration follows the require-code cycle:
```
r_i = require(c_i, c*) -- derive requirements from test gap
c_{i+1} = code(r_i, c_i) -- implement changes based on requirements
```
This iterative loop ensures consequences of earlier modifications propagate
into subsequent iterations, making long-term decision quality observable.
## Architect-Programmer Protocol
**Status**: Planned (not yet implemented). Will be a dual-agent protocol.
### Architect Agent
Three-step analysis:
1. **Summarize**: Review failing tests, identify root causes
2. **Locate**: Examine source code, attribute failures to implementation gaps
3. **Design**: Devise improvement plan, produce requirements document
### Programmer Agent
Three-step implementation:
1. **Comprehend**: Translate requirements into code specifications
2. **Plan**: Outline programming effort needed
3. **Code**: Implement specifications
## PMAT Integration
### Data Recording
**Status**: No automated recording command exists yet. Data must be written
manually or by CI scripts.
Test results are stored as JSON files in `.pmat-metrics/`:
**Primary format** (`commit-<sha>-tests.json`):
```json
{"commit": "abc123", "pass": 19500, "total": 19700}
```
**Fallback format** (`commit-<sha>-meta.json` with `tests` key):
```json
{"commit": "abc123", "tests": {"pass": 19500, "total": 19700}}
```
Files are sorted lexicographically by filename to determine chronological
order. Use zero-padded or timestamp-prefixed names if order matters.
### Recording Methods
**CI pipeline (recommended)**:
```bash
# In .github/workflows/ci.yml or post-test hook:
SHA=$(git rev-parse --short HEAD)
FAIL=$(echo "$RESULT" | grep -oP '\d+ failed' | grep -oP '\d+')
TOTAL=$((PASS + FAIL))
echo "{\"commit\":\"$SHA\",\"pass\":$PASS,\"total\":$TOTAL}" \
> .pmat-metrics/commit-${SHA}-tests.json
```
**Manual seeding** (for bootstrapping):
```bash
# Record current test state
cargo test 2>&1 | grep "test result"
# → test result: ok. 21841 passed; 0 failed; 203 ignored (as of v3.11.1)
echo '{"commit":"current","pass":21841,"total":21841}' \
> .pmat-metrics/commit-$(date +%Y%m%d)-tests.json
```
### Configuration
CB-142 is registered in `comply_config_defaults.rs` with:
- **Severity**: Info
- **Threshold**: 0.5 (minimum EvoScore for Pass)
The following are hardcoded in the implementation (not yet configurable):
- **gamma**: 1.5
- **min_commits**: 3
## Comply Check: CB-142
### Data Loading
1. Scan `.pmat-metrics/` for `commit-*-tests.json` files
2. Parse each: extract `pass` and `total` fields, skip if `total == 0`
3. If no `-tests.json` found, fall back to `commit-*-meta.json` files
with a `tests` sub-object containing `pass` and `total`
4. Sort files by filename (lexicographic) for chronological order
5. If fewer than 3 data points, return Skip
### Computation
1. **Base state** (`c₀`): first file's pass count
2. **Oracle** (`c*`): max pass count across all files
3. For each subsequent commit `c_i` (i=1..N):
- If `pass >= base_pass`: `a(c) = (pass - base) / (oracle - base)`
(or 1.0 if oracle == base)
- If `pass < base_pass`: `a(c) = (pass - base) / base`
(or 0.0 if base == 0)
4. Weight: `w_i = gamma^i` where i is the 0-based index in the sequence
5. EvoScore: `sum(w_i * a_i) / sum(w_i)`
### Scoring
| >= 0.5 | Pass | Info |
| 0.0 - 0.5 | Warn | Warning |
| < 0.0 | Fail | Error |
| < 3 data points | Skip | Info |
### Score Interpretation
| 0.8 - 1.0 | Excellent: consistent improvement, no regressions |
| 0.5 - 0.8 | Good: net positive with minor regressions |
| 0.0 - 0.5 | Fair: improvements offset by regressions |
| -0.5 - 0.0 | Poor: net regression trend |
| -1.0 - -0.5 | Critical: systemic quality degradation |
### Numerical Example
5 commits: pass counts [18000, 18500, 18200, 19000, 19500], total=19700 each.
```
base = 18000, oracle = 19500, gap = 1500
a(c₁) = (18500 - 18000) / 1500 = 0.333
a(c₂) = (18200 - 18000) / 1500 = 0.133
a(c₃) = (19000 - 18000) / 1500 = 0.667
a(c₄) = (19500 - 18000) / 1500 = 1.000
weights (γ=1.5): γ¹=1.5, γ²=2.25, γ³=3.375, γ⁴=5.0625
numerator: 1.5(0.333) + 2.25(0.133) + 3.375(0.667) + 5.0625(1.000) = 8.114
denominator: 1.5 + 2.25 + 3.375 + 5.0625 = 12.1875
EvoScore: 8.114 / 12.1875 = 0.666 → Pass
```
### Gamma Selection Guide
| Greenfield | 1.0 | Equal weight, expect volatile early history |
| Growth | 1.2 | Slight forward bias, reward stabilization |
| Mature | 1.5 | Penalize regressions in established codebase |
| Legacy rescue | 2.0 | Heavily reward sustained improvement |
## Relationship to TDG
EvoScore and TDG are **independent, complementary metrics** — not candidates for merging.
| Question answered | "Is this code well-structured now?" | "Is the project improving over time?" |
| Granularity | Per-file (4435 files) | Per-project (single scalar) |
| Input | AST, source code | Test pass/fail across commits |
| Timescale | Instantaneous snapshot | Rolling window (90 days) |
| Determinism | Same source = same score | Depends on git history + CI state |
| Toyota Way | Jidoka (stop-the-line) | Kaizen (continuous improvement) |
**Why they must stay separate:**
1. TDG is per-file; EvoScore is per-project — no meaningful per-file EvoScore without per-file test attribution
2. TDG is deterministic and cached for O(1) pre-commit gates; EvoScore needs disk I/O across N commit files
3. Combining them loses both signals — a project can have excellent TDG but stagnant EvoScore (or vice versa)
### Future Cross-Metric Work
These extensions bridge the gap without merging the metrics:
**Churn-weighted TDG** — `tdg_priority = tdg_score * git_churn_factor`. High-TDG files
that change often are worse than high-TDG files that never change. Stays per-file and
deterministic for a given commit. Data source: `pmat query --churn`.
**Per-function EvoScore** — Track individual function test coverage trajectories across
commits. Requires mapping test failures to specific functions (via coverage data). Would
enable "this function's tests are regressing" alerts at the function level.
**Dashboard correlation** — Show TDG grade distribution trends alongside EvoScore in
`pmat rust-project-score` output. Let humans observe the relationship without
algorithmically coupling the metrics. E.g., "TDG avg improved 2.3 points while EvoScore
held at 0.7 — quality improving without regression."
## Planned: `pmat test --record`
> Tracked in PMAT-540 (SWE-CI evolution — pmat test --record implementation).
### Design
New subcommand that wraps `cargo test`, parses output, and writes test data:
```bash
pmat test --record # Run cargo test, write results
pmat test --record --dry-run # Show what would be recorded
pmat test --record --from-stdin # Parse piped cargo test output
```
### Implementation
1. Run `cargo test --no-fail-fast 2>&1` and capture output
2. Parse summary line: `test result: ok. N passed; M failed; K ignored`
3. Get current commit: `git rev-parse --short HEAD`
4. Write `.pmat-metrics/commit-<sha>-tests.json`:
```json
{"commit":"abc1234","pass":19795,"total":19795,"failed":0,"ignored":167,
"timestamp":"2026-03-09T14:00:00Z"}
```
5. Print summary: "Recorded: 19795/19795 pass (commit abc1234)"
### Makefile Integration
```makefile
test-record:
cargo test --no-fail-fast 2>&1 | pmat test --record --from-stdin
```
### Per-Function EvoScore (Phase 2)
Track coverage trajectory per function across commits:
1. After `pmat test --record`, also run `cargo llvm-cov --json`
2. For each function, record `{function, covered_lines, total_lines}`
3. Store in `.pmat-metrics/commit-<sha>-coverage.json`
4. CB-142 computes per-function EvoScore: which functions are losing coverage?
5. Surface via `pmat query --coverage --evoscore` — shows functions trending down
This bridges the granularity gap: project EvoScore says "regressing",
per-function EvoScore says "parser::tokenize lost 12 lines of coverage in 3 commits".
### Configurable Gamma (Phase 2)
Read from `.pmat.yaml` options map instead of hardcoding:
```rust
let gamma = comply_config
.and_then(|c| c.checks.get("cb-142"))
.and_then(|c| c.options.get("gamma"))
.and_then(|v| v.parse::<f64>().ok())
.unwrap_or(1.5);
```
Requires threading `comply_config` into `check_swe_ci_evoscore()`.
## Implementation Status
| EvoScore math | Implemented | `check_mono_spec.rs:296-332` |
| `-tests.json` loading | Implemented | `check_mono_spec.rs:224-251` |
| `-meta.json` fallback | Implemented | `check_mono_spec.rs:254-281` |
| CB-142 comply check | Implemented | `check_mono_spec.rs:217-359` |
| Unit tests (10 cases) | Implemented | `check_mono_spec.rs:366-500` |
| `pmat test --record` | Implemented | `src/cli/command_dispatcher/test_record.rs` |
| Configurable gamma | Not implemented | Hardcoded to 1.5 |
| Configurable window | Not implemented | No time-based filtering |
| CI source: github | Not implemented | Local files only |
| Architect-Programmer | Planned | Dual-agent protocol (future work) |
## Key Files
| `src/cli/handlers/comply_handlers/check_handlers/check_mono_spec.rs` | CB-142 implementation |
| `src/models/comply_config_defaults.rs` | CB-142 registration (severity: Info, threshold: 0.5) |
| `.pmat-metrics/commit-*-tests.json` | Primary test data (not yet generated) |
| `.pmat-metrics/commit-*-meta.json` | Fallback data (exists but lacks `tests` key) |
## References
- SWE-CI: arxiv:2603.03823 (Sun Yat-sen University, Alibaba Group, March 2026)
- SWE-EVO: arxiv:2512.18470 (Long-horizon software evolution scenarios)
- SWE-Bench: https://www.swebench.com/ (Original SWE benchmark)