tokitai-operator 0.1.0

# Evaluation Methodology

P112 expands the paper artifact beyond deterministic smoke workloads. The goal
is still conservative: produce reproducible reviewer-facing evidence without
pretending the CPU reference backend is a production performance engine.

## Workloads

The benchmark harness is `examples/paper_benchmarks.rs`.

Current workload groups:

- `ordinary_tensor_add`: small dense integer add reference workload.
- `ordinary_tensor_add_large`: larger dense integer add workload with 64
  elements to show non-smoke tensor scale.
- `padic_matmul_valuation_skip`: fixed-precision p-adic matmul workload with
  valuation-based skipped terms.
- `padic_matmul_valuation_skip_large`: larger 4x4x4 p-adic matmul workload
  with reference and optimized plans.
- `rocm_padic_stratified_benchmarks`: feature-gated certified p-adic GEMM
  benchmark matrix with dense CPU, certified sparse CPU, HIP stratified, and
  fallback rows across deterministic valuation distributions.
- `sheaf_cover_glue`: finite-cover sheaf compatibility and glue check.
- `sheaf_cover_glue_large`: larger three-open finite-cover sheaf compatibility
  and glue check for controlled cover scale evidence.

## Controlled Baseline And Scale Comparisons

P128 adds `baseline-scale-comparison.csv` as the reviewer-facing baseline and
scale table. The table is intentionally controlled rather than a production
performance claim: every row names a baseline family, the candidate path, the
scale point, proof overhead, and a bounded interpretation.

Baseline families:

- `ordinary_tensor`: dense CPU reference plan versus registry-aware CPU
  execution for `ordinary_tensor_add` and `ordinary_tensor_add_large`.
- `padic_valuation`: dense p-adic matmul baseline versus the valuation-skip
  plan for both small and `shape=4x4x4` p-adic workloads.
- `sheaf_glue`: finite-cover compatibility baseline versus cover-local glue
  checking for two-open and three-open finite sheaf workloads.
- `proof_trace`: digest-only manifest checking versus per-theorem checker-index
  validation for the native trace path, including a scaled
  `theorem_count=3;artifact_count=11` proxy row.
- `selected_witness_metadata`: no selected witness metadata versus the shape
  and p-adic selected Lean witness metadata path, including a scaled
  `witnesses=6` proxy row.

Scale points are explicit strings such as `elements=64`, `shape=4x4x4`,
`cover_opens=2`, `cover_opens=3`, `theorem_count=1;artifact_count=8`,
`theorem_count=3;artifact_count=11`, and
`families=2;metadata_fields=5;witnesses=6`. The `ratio_milli` column reports
`candidate_ns / baseline_ns * 1000`, keeping the output integer and
deterministic for artifact review.

The harness also reports:

- `reference_duration_ns`: deterministic reference-plan proxy cost.
- `optimized_duration_ns`: deterministic optimized-plan proxy cost.
- `proof_trace_validation_ns`: deterministic proof-trace validation proxy cost.
- `proof_trace_validation_scale`: deterministic proxy measurement for scaled
  theorem-count and artifact-count validation.
- `measurement-summary.csv`: repeated summaries that distinguish deterministic
  proxy rows from repeated wall-clock validation rows.

The p-adic rows compare a baseline matmul plan with the valuation-skip plan and
report `skipped_terms`, `evaluated_terms`, and shape metadata.

The witness rows measure the selected Lean witness metadata path added for
shape equality and p-adic valuation-cutoff evidence. This is reported as a
separate `selected_witness_metadata_overhead` deterministic proxy row so the
paper can discuss external-checker witness visibility without folding it into
ordinary operator runtime cost.

P149 adds controlled scale rows for sheaf cover size, proof-trace theorem and
artifact count, and selected witness metadata count. These rows are still
deterministic artifact-review evidence. They strengthen the experimental
narrative without claiming Criterion-grade or production benchmark results.

## P150 Claim-To-Result Interpretation Matrix

P150 hardens how generated rows should be interpreted in the manuscript. Each
generated table category has a scoped claim, a reader-facing interpretation,
and an explicit limitation.

| Generated rows | Manuscript claim | Interpretation | Limitation |
| --- | --- | --- | --- |
| `ordinary_tensor_add`, `ordinary_tensor_add_large` | controlled ordinary tensor baseline | Registry-aware CPU execution can be compared with a reference CPU plan under a stable deterministic proxy. | Not a claim of superiority over XLA, TVM, Triton, JAX, PyTorch, TensorFlow, or production tensor runtimes. |
| `padic_matmul_valuation_skip`, `padic_matmul_valuation_skip_large` | nonstandard-domain planning evidence | p-adic valuation metadata changes evaluated terms while preserving precision-bounded equality checks. | Fixed-precision p-adic arithmetic is demonstrated, not a full computer algebra system. |
| `padic-stratified-benchmarks.csv` rows | certified p-adic GEMM evidence | Dense CPU, certified sparse CPU, HIP stratified, and fallback rows expose skipped/evaluated products, precision margins, oracle status, certificate coverage, transfer time, kernel time, wall time, and speed-claim blockers. | Q_5[3] 2x3x2 HIP evidence is local and smoke-scale; unsupported prime, precision, and shape rows are fallback evidence, not portability or speedup claims. |
| `sheaf_cover_glue`, `sheaf_cover_glue_large` | finite sheaf locality evidence | Cover-local compatibility and gluing checks are visible at two-open and three-open scale points. | Finite-site examples do not claim broad sheaf-theoretic automation. |
| `proof_trace_validation_scale` and proof-trace baseline rows | trace validation overhead evidence | Digest-only manifest checking and per-theorem checker-index validation are separated by theorem and artifact count. | Deterministic proxy costs are not Criterion-grade validation benchmarks. |
| `selected_witness_metadata_overhead` and selected witness metadata scale rows | reviewer-visible witness metadata evidence | Shape and p-adic selected witness metadata have explicit generated rows and tamper-detection evidence. | Metadata overhead is not an external Lean runtime benchmark and does not prove all Tokitai obligations in Lean. |
| `tamper-matrix.csv` rows | semantic tamper detection evidence | Manifest, native-count, theorem-result, checker-index, admission, and witness metadata mutations show which validation layer detects drift. | The table is a validation ablation, not a cryptographic novelty claim. |
| `trace-architecture.csv` rows | artifact binding architecture evidence | Plan fingerprints, artifact digests, checker-output digests, adapter capability fingerprints, and canonical JSON normalization are reviewer-visible. | Architecture rows summarize bindings; the formal assumptions remain in `docs/paper/formal_model.md`. |
| `claim-to-test-matrix.csv` rows | reproducibility evidence | Every manuscript claim points to code, tests, and generated experiment artifacts. | The matrix supports traceability, not exhaustive proof of all future backend implementations. |

The evaluation section should cite this matrix when interpreting generated
tables. It should use phrases such as controlled baseline, deterministic proxy,
artifact-review evidence, trace validation overhead, and reviewer-visible
witness metadata. It should avoid phrases that imply production throughput,
full proof-assistant verification, verified compiler status, or mature tensor
compiler replacement.

## Metrics

Paper tables use:

- workload name;
- reference, optimized, and proof validation nanosecond columns;
- plan step count;
- proof object count;
- checked proof count;
- domain-specific metric string;
- measurement kind, run count, min, median, and max for repeated summaries.
- selected witness metadata overhead for reviewer-visible Lean witness family
  fields.
- witness metadata tamper detection in `tamper-matrix.csv`.
- certified p-adic GEMM skipped/evaluated products, precision margin,
  certificate coverage, dense/sparse CPU oracle status, HIP transfer time, HIP
  kernel time, and fallback reason in `padic-stratified-benchmarks.csv`.

## Reproducibility Commands

Run:

```bash
cargo run --quiet --example paper_benchmarks
```

Expected outputs:

- `target/paper-results/paper-results.json`
- `target/paper-results/paper-results.csv`
- `target/paper-results/measurement-summary.csv`
- `target/paper-results/baseline-scale-comparison.csv`
- `target/paper-results/tamper-matrix.csv`
- `target/paper-results/padic-stratified-benchmarks/padic-stratified-benchmarks.csv`
- `target/paper-results/claim-to-test-matrix.csv`
- `target/paper-results/trace-architecture.csv`

Run the full validation gate:

```bash
python -m json.tool todo.json >/dev/null
cargo fmt --check
cargo test
```

## Deterministic Versus Measured Results

`paper-results.csv` uses deterministic proxy timings. These are stable across
review machines and are intended to compare relative workload structure,
optimized terms, and proof validation scale.

`measurement-summary.csv` separates two measurement classes:

- `deterministic_proxy`: derived from the same deterministic model used by the
  paper results table.
- `deterministic_proxy` for `selected_witness_metadata_overhead`: derived from
  selected witness family count, metadata field count, and witness count.
- `repeated_wall_clock`: repeated wall-clock smoke measurement for the
  validation path.

The wall-clock rows are intentionally labeled as smoke measurements. They should
not be read as Criterion-grade microbenchmarks.

## Witness Metadata Ablation

`tamper-matrix.csv` includes `witness_metadata`, a digest-retargeted mutation of
the Lean checkable artifact header. The manifest digest is recomputed, so this
row is not a simple stale-file digest check. Detection comes from Lean
checkable artifact validation, which requires selected witness family metadata
and matching witness counts. This gives reviewers a direct ablation for the
additional external-checker evidence surface introduced by P123-P125.

## P158 Manuscript Evaluation Prose

This section is intended as manuscript-ready evaluation prose. It connects the
planned tables and figures in `docs/paper/table_figure_plan.md` to research
questions, result interpretation, and threats to validity. The prose should be
read with the same scope as the generated artifacts: controlled artifact
evaluation, not production benchmarking.

Research questions:

- RQ1 Controlled baseline and scale: do generated Tables 1, 2, and 5 show that
  Tokitai can expose ordinary tensor, p-adic valuation, finite sheaf,
  proof-trace, and selected witness metadata costs under reproducible
  artifact-review conditions?
- RQ2 Semantic tamper detection: does Table 3 show that per-theorem
  checker-index validation detects semantic drift classes that digest-only or
  count-only checks miss?
- RQ3 Trace-bound evidence architecture: does Figure 1 show how execution plans,
  proof replay reports, selected witness artifacts, checker outputs, native
  admissions, adapter registries, checker result indexes, canonical JSON imports,
  and signed manifests are bound into one reviewable evidence graph?
- RQ4 Nonstandard-domain evidence: does Figure 2 show that fixed-precision
  p-adic valuation cutoff evidence and finite-site sheaf glue evidence are
  planner-visible and reviewer-visible rather than hidden dtype conventions?
- RQ5 Reproducibility and claim traceability: do Figure 3 and Table 4 show that
  generated results, source files, tests, validation commands, and manuscript
  claims can be regenerated and audited from the repository?

Result interpretation:

- Tables 1 and 5 support RQ1 by showing controlled baseline and scale rows for
  ordinary tensor execution, p-adic valuation-aware execution, finite-site sheaf
  cover glue, proof-trace validation, and selected witness metadata. These rows
  demonstrate artifact structure, relative proof-aware cost accounting, and
  nonstandard-domain planner visibility; they do not demonstrate production
  throughput.
- Table 2 supports RQ1 by separating deterministic proxy rows from repeated
  wall-clock smoke checks. This makes paper-table values stable across reviewer
  machines while still preserving a small runtime sanity check.
- Table 3 supports RQ2 by showing that `manifest_digest`, `native_count`,
  `theorem_result_fingerprint`, `checker_index_adapter`,
  `checker_index_admission`, and `witness_metadata` mutations have distinct
  detection surfaces. The strongest evidence is the per-theorem checker-index
  path, not digest-only or count-only validation.
- Figure 1 supports RQ3 by turning trace-bound semantic evidence into an
  inspectable graph. The figure should be read together with
  `docs/paper/formal_model.md` because the graph summarizes bindings but does
  not replace the trusted computing base assumptions.
- Figure 2 supports RQ4 by connecting `shape_equality_dimension_witness_v1` and
  `padic_precision_cutoff_witness_v1` to generated p-adic and sheaf evidence.
  The result is evidence that Tokitai can carry selected nonstandard-domain
  obligations, not evidence that it formalizes complete p-adic algebra or sheaf
  theory.
- Figure 3 and Table 4 support RQ5 by connecting artifacts, tests, validation
  commands, and claims C1-C9. This is reviewer traceability evidence, not an
  exhaustive proof that future backends or arbitrary kernels preserve semantics.

Threats to validity:

- Execution backend validity: current rows use CPU reference backend evidence.
  They do not claim SIMD, GPU, distributed execution, or production accelerator
  coverage.
- Measurement validity: deterministic proxy values are intentionally stable for
  artifact review. Repeated wall-clock rows are smoke checks with limited run
  counts. Neither category is a Criterion-grade benchmark or production
  performance study.
- Proof-assistant validity: External Lean execution is optional in the default
  validation gate. Selected witness metadata improves reviewer-visible evidence,
  but the artifact does not generate full Lean/Coq/Isabelle proofs for every
  obligation.
- Nonstandard-domain validity: p-adic evidence is fixed precision over
  u128-backed residues, and sheaf evidence is a finite-site prototype. The
  evaluation does not claim complete p-adic algebra, sheaf cohomology, or
  general categorical formalization.
- Security and authenticity validity: signed trace checks, canonical JSON
  imports, SHA-256 digests, and checker indexes provide reproducible artifact
  binding. They are not cryptographic novelty claims and do not replace external
  key management policy.
- Compiler-verification validity: Tokitai validates trace and artifact evidence
  through tested Rust code, but the Rust validator is not mechanically verified
  and the project is not a verified compiler.

Manuscript wording should use "results indicate", "controlled artifact
evidence", "reviewer-visible traceability", and "scoped nonstandard-domain
evidence". It should not use "outperforms mature tensor compilers",
"production-grade benchmark", "full theorem-prover replacement",
"cryptographically novel", or "complete p-adic/sheaf formalization".

Preferred compact wording: controlled artifact evidence, reviewer-visible
traceability, scoped nonstandard-domain evidence.

## Limitations

- CPU reference execution remains the semantic oracle.
- ROCm/HIP measurements are feature-gated, local-pilot smoke evidence for the
  scoped p-adic GEMM path only.
- No SIMD measurements yet.
- Repeated wall-clock rows are smoke measurements with limited run count.
- The deterministic model is useful for artifact stability but not a substitute
  for production benchmarking.
- Lean full proof generation is still outside the current automated benchmark.
- Witness metadata overhead is a deterministic proxy for artifact generation
  and validation, not an external Lean runtime benchmark.
- P149 scale rows are controlled proxy evidence for SCI-Q2 evaluation, not
  throughput claims against mature tensor compiler baselines.

Post-submission benchmark expansion is planned in
`docs/paper/post_submission_benchmarks.md`. That plan is optional revision or
camera-ready hardening; it does not change the current deterministic proxy and
controlled baseline claims.