# Evaluation Methodology
P112 expands the paper artifact beyond deterministic smoke workloads. The goal
is still conservative: produce reproducible reviewer-facing evidence without
pretending the CPU reference backend is a production performance engine.
## Workloads
The benchmark harness is `examples/paper_benchmarks.rs`.
Current workload groups:
- `ordinary_tensor_add`: small dense integer add reference workload.
- `ordinary_tensor_add_large`: larger dense integer add workload with 64
elements to show non-smoke tensor scale.
- `padic_matmul_valuation_skip`: fixed-precision p-adic matmul workload with
valuation-based skipped terms.
- `padic_matmul_valuation_skip_large`: larger 4x4x4 p-adic matmul workload
with reference and optimized plans.
- `rocm_padic_stratified_benchmarks`: feature-gated certified p-adic GEMM
benchmark matrix with dense CPU, certified sparse CPU, HIP stratified, and
fallback rows across deterministic valuation distributions.
- `sheaf_cover_glue`: finite-cover sheaf compatibility and glue check.
- `sheaf_cover_glue_large`: larger three-open finite-cover sheaf compatibility
and glue check for controlled cover scale evidence.
## Controlled Baseline And Scale Comparisons
P128 adds `baseline-scale-comparison.csv` as the reviewer-facing baseline and
scale table. The table is intentionally controlled rather than a production
performance claim: every row names a baseline family, the candidate path, the
scale point, proof overhead, and a bounded interpretation.
Baseline families:
- `ordinary_tensor`: dense CPU reference plan versus registry-aware CPU
execution for `ordinary_tensor_add` and `ordinary_tensor_add_large`.
- `padic_valuation`: dense p-adic matmul baseline versus the valuation-skip
plan for both small and `shape=4x4x4` p-adic workloads.
- `sheaf_glue`: finite-cover compatibility baseline versus cover-local glue
checking for two-open and three-open finite sheaf workloads.
- `proof_trace`: digest-only manifest checking versus per-theorem checker-index
validation for the native trace path, including a scaled
`theorem_count=3;artifact_count=11` proxy row.
- `selected_witness_metadata`: no selected witness metadata versus the shape
and p-adic selected Lean witness metadata path, including a scaled
`witnesses=6` proxy row.
Scale points are explicit strings such as `elements=64`, `shape=4x4x4`,
`cover_opens=2`, `cover_opens=3`, `theorem_count=1;artifact_count=8`,
`theorem_count=3;artifact_count=11`, and
`families=2;metadata_fields=5;witnesses=6`. The `ratio_milli` column reports
`candidate_ns / baseline_ns * 1000`, keeping the output integer and
deterministic for artifact review.
The harness also reports:
- `reference_duration_ns`: deterministic reference-plan proxy cost.
- `optimized_duration_ns`: deterministic optimized-plan proxy cost.
- `proof_trace_validation_ns`: deterministic proof-trace validation proxy cost.
- `proof_trace_validation_scale`: deterministic proxy measurement for scaled
theorem-count and artifact-count validation.
- `measurement-summary.csv`: repeated summaries that distinguish deterministic
proxy rows from repeated wall-clock validation rows.
The p-adic rows compare a baseline matmul plan with the valuation-skip plan and
report `skipped_terms`, `evaluated_terms`, and shape metadata.
The witness rows measure the selected Lean witness metadata path added for
shape equality and p-adic valuation-cutoff evidence. This is reported as a
separate `selected_witness_metadata_overhead` deterministic proxy row so the
paper can discuss external-checker witness visibility without folding it into
ordinary operator runtime cost.
P149 adds controlled scale rows for sheaf cover size, proof-trace theorem and
artifact count, and selected witness metadata count. These rows are still
deterministic artifact-review evidence. They strengthen the experimental
narrative without claiming Criterion-grade or production benchmark results.
## P150 Claim-To-Result Interpretation Matrix
P150 hardens how generated rows should be interpreted in the manuscript. Each
generated table category has a scoped claim, a reader-facing interpretation,
and an explicit limitation.
| `ordinary_tensor_add`, `ordinary_tensor_add_large` | controlled ordinary tensor baseline | Registry-aware CPU execution can be compared with a reference CPU plan under a stable deterministic proxy. | Not a claim of superiority over XLA, TVM, Triton, JAX, PyTorch, TensorFlow, or production tensor runtimes. |
| `padic_matmul_valuation_skip`, `padic_matmul_valuation_skip_large` | nonstandard-domain planning evidence | p-adic valuation metadata changes evaluated terms while preserving precision-bounded equality checks. | Fixed-precision p-adic arithmetic is demonstrated, not a full computer algebra system. |
| `padic-stratified-benchmarks.csv` rows | certified p-adic GEMM evidence | Dense CPU, certified sparse CPU, HIP stratified, and fallback rows expose skipped/evaluated products, precision margins, oracle status, certificate coverage, transfer time, kernel time, wall time, and speed-claim blockers. | Q_5[3] 2x3x2 HIP evidence is local and smoke-scale; unsupported prime, precision, and shape rows are fallback evidence, not portability or speedup claims. |
| `sheaf_cover_glue`, `sheaf_cover_glue_large` | finite sheaf locality evidence | Cover-local compatibility and gluing checks are visible at two-open and three-open scale points. | Finite-site examples do not claim broad sheaf-theoretic automation. |
| `proof_trace_validation_scale` and proof-trace baseline rows | trace validation overhead evidence | Digest-only manifest checking and per-theorem checker-index validation are separated by theorem and artifact count. | Deterministic proxy costs are not Criterion-grade validation benchmarks. |
| `selected_witness_metadata_overhead` and selected witness metadata scale rows | reviewer-visible witness metadata evidence | Shape and p-adic selected witness metadata have explicit generated rows and tamper-detection evidence. | Metadata overhead is not an external Lean runtime benchmark and does not prove all Tokitai obligations in Lean. |
| `tamper-matrix.csv` rows | semantic tamper detection evidence | Manifest, native-count, theorem-result, checker-index, admission, and witness metadata mutations show which validation layer detects drift. | The table is a validation ablation, not a cryptographic novelty claim. |
| `trace-architecture.csv` rows | artifact binding architecture evidence | Plan fingerprints, artifact digests, checker-output digests, adapter capability fingerprints, and canonical JSON normalization are reviewer-visible. | Architecture rows summarize bindings; the formal assumptions remain in `docs/paper/formal_model.md`. |
| `claim-to-test-matrix.csv` rows | reproducibility evidence | Every manuscript claim points to code, tests, and generated experiment artifacts. | The matrix supports traceability, not exhaustive proof of all future backend implementations. |
The evaluation section should cite this matrix when interpreting generated
tables. It should use phrases such as controlled baseline, deterministic proxy,
artifact-review evidence, trace validation overhead, and reviewer-visible
witness metadata. It should avoid phrases that imply production throughput,
full proof-assistant verification, verified compiler status, or mature tensor
compiler replacement.
## Metrics
Paper tables use:
- workload name;
- reference, optimized, and proof validation nanosecond columns;
- plan step count;
- proof object count;
- checked proof count;
- domain-specific metric string;
- measurement kind, run count, min, median, and max for repeated summaries.
- selected witness metadata overhead for reviewer-visible Lean witness family
fields.
- witness metadata tamper detection in `tamper-matrix.csv`.
- certified p-adic GEMM skipped/evaluated products, precision margin,
certificate coverage, dense/sparse CPU oracle status, HIP transfer time, HIP
kernel time, and fallback reason in `padic-stratified-benchmarks.csv`.
## Reproducibility Commands
Run:
```bash
cargo run --quiet --example paper_benchmarks
```
Expected outputs:
- `target/paper-results/paper-results.json`
- `target/paper-results/paper-results.csv`
- `target/paper-results/measurement-summary.csv`
- `target/paper-results/baseline-scale-comparison.csv`
- `target/paper-results/tamper-matrix.csv`
- `target/paper-results/padic-stratified-benchmarks/padic-stratified-benchmarks.csv`
- `target/paper-results/claim-to-test-matrix.csv`
- `target/paper-results/trace-architecture.csv`
Run the full validation gate:
```bash
python -m json.tool todo.json >/dev/null
cargo fmt --check
cargo test
```
## Deterministic Versus Measured Results
`paper-results.csv` uses deterministic proxy timings. These are stable across
review machines and are intended to compare relative workload structure,
optimized terms, and proof validation scale.
`measurement-summary.csv` separates two measurement classes:
- `deterministic_proxy`: derived from the same deterministic model used by the
paper results table.
- `deterministic_proxy` for `selected_witness_metadata_overhead`: derived from
selected witness family count, metadata field count, and witness count.
- `repeated_wall_clock`: repeated wall-clock smoke measurement for the
validation path.
The wall-clock rows are intentionally labeled as smoke measurements. They should
not be read as Criterion-grade microbenchmarks.
## Witness Metadata Ablation
`tamper-matrix.csv` includes `witness_metadata`, a digest-retargeted mutation of
the Lean checkable artifact header. The manifest digest is recomputed, so this
row is not a simple stale-file digest check. Detection comes from Lean
checkable artifact validation, which requires selected witness family metadata
and matching witness counts. This gives reviewers a direct ablation for the
additional external-checker evidence surface introduced by P123-P125.
## P158 Manuscript Evaluation Prose
This section is intended as manuscript-ready evaluation prose. It connects the
planned tables and figures in `docs/paper/table_figure_plan.md` to research
questions, result interpretation, and threats to validity. The prose should be
read with the same scope as the generated artifacts: controlled artifact
evaluation, not production benchmarking.
Research questions:
- RQ1 Controlled baseline and scale: do generated Tables 1, 2, and 5 show that
Tokitai can expose ordinary tensor, p-adic valuation, finite sheaf,
proof-trace, and selected witness metadata costs under reproducible
artifact-review conditions?
- RQ2 Semantic tamper detection: does Table 3 show that per-theorem
checker-index validation detects semantic drift classes that digest-only or
count-only checks miss?
- RQ3 Trace-bound evidence architecture: does Figure 1 show how execution plans,
proof replay reports, selected witness artifacts, checker outputs, native
admissions, adapter registries, checker result indexes, canonical JSON imports,
and signed manifests are bound into one reviewable evidence graph?
- RQ4 Nonstandard-domain evidence: does Figure 2 show that fixed-precision
p-adic valuation cutoff evidence and finite-site sheaf glue evidence are
planner-visible and reviewer-visible rather than hidden dtype conventions?
- RQ5 Reproducibility and claim traceability: do Figure 3 and Table 4 show that
generated results, source files, tests, validation commands, and manuscript
claims can be regenerated and audited from the repository?
Result interpretation:
- Tables 1 and 5 support RQ1 by showing controlled baseline and scale rows for
ordinary tensor execution, p-adic valuation-aware execution, finite-site sheaf
cover glue, proof-trace validation, and selected witness metadata. These rows
demonstrate artifact structure, relative proof-aware cost accounting, and
nonstandard-domain planner visibility; they do not demonstrate production
throughput.
- Table 2 supports RQ1 by separating deterministic proxy rows from repeated
wall-clock smoke checks. This makes paper-table values stable across reviewer
machines while still preserving a small runtime sanity check.
- Table 3 supports RQ2 by showing that `manifest_digest`, `native_count`,
`theorem_result_fingerprint`, `checker_index_adapter`,
`checker_index_admission`, and `witness_metadata` mutations have distinct
detection surfaces. The strongest evidence is the per-theorem checker-index
path, not digest-only or count-only validation.
- Figure 1 supports RQ3 by turning trace-bound semantic evidence into an
inspectable graph. The figure should be read together with
`docs/paper/formal_model.md` because the graph summarizes bindings but does
not replace the trusted computing base assumptions.
- Figure 2 supports RQ4 by connecting `shape_equality_dimension_witness_v1` and
`padic_precision_cutoff_witness_v1` to generated p-adic and sheaf evidence.
The result is evidence that Tokitai can carry selected nonstandard-domain
obligations, not evidence that it formalizes complete p-adic algebra or sheaf
theory.
- Figure 3 and Table 4 support RQ5 by connecting artifacts, tests, validation
commands, and claims C1-C9. This is reviewer traceability evidence, not an
exhaustive proof that future backends or arbitrary kernels preserve semantics.
Threats to validity:
- Execution backend validity: current rows use CPU reference backend evidence.
They do not claim SIMD, GPU, distributed execution, or production accelerator
coverage.
- Measurement validity: deterministic proxy values are intentionally stable for
artifact review. Repeated wall-clock rows are smoke checks with limited run
counts. Neither category is a Criterion-grade benchmark or production
performance study.
- Proof-assistant validity: External Lean execution is optional in the default
validation gate. Selected witness metadata improves reviewer-visible evidence,
but the artifact does not generate full Lean/Coq/Isabelle proofs for every
obligation.
- Nonstandard-domain validity: p-adic evidence is fixed precision over
u128-backed residues, and sheaf evidence is a finite-site prototype. The
evaluation does not claim complete p-adic algebra, sheaf cohomology, or
general categorical formalization.
- Security and authenticity validity: signed trace checks, canonical JSON
imports, SHA-256 digests, and checker indexes provide reproducible artifact
binding. They are not cryptographic novelty claims and do not replace external
key management policy.
- Compiler-verification validity: Tokitai validates trace and artifact evidence
through tested Rust code, but the Rust validator is not mechanically verified
and the project is not a verified compiler.
Manuscript wording should use "results indicate", "controlled artifact
evidence", "reviewer-visible traceability", and "scoped nonstandard-domain
evidence". It should not use "outperforms mature tensor compilers",
"production-grade benchmark", "full theorem-prover replacement",
"cryptographically novel", or "complete p-adic/sheaf formalization".
Preferred compact wording: controlled artifact evidence, reviewer-visible
traceability, scoped nonstandard-domain evidence.
## Limitations
- CPU reference execution remains the semantic oracle.
- ROCm/HIP measurements are feature-gated, local-pilot smoke evidence for the
scoped p-adic GEMM path only.
- No SIMD measurements yet.
- Repeated wall-clock rows are smoke measurements with limited run count.
- The deterministic model is useful for artifact stability but not a substitute
for production benchmarking.
- Lean full proof generation is still outside the current automated benchmark.
- Witness metadata overhead is a deterministic proxy for artifact generation
and validation, not an external Lean runtime benchmark.
- P149 scale rows are controlled proxy evidence for SCI-Q2 evaluation, not
throughput claims against mature tensor compiler baselines.
Post-submission benchmark expansion is planned in
`docs/paper/post_submission_benchmarks.md`. That plan is optional revision or
camera-ready hardening; it does not change the current deterministic proxy and
controlled baseline claims.