certeza 0.1.1

A scientific experiment into realistic provability with Rust - asymptotic test effectiveness framework
Documentation
# Reproducibility Instructions

This document provides step-by-step instructions to reproduce all benchmark results from the certeza project.

## Overview

The certeza project emphasizes scientific reproducibility through:

1. **Toolchain Pinning**: Exact Rust version (1.82.0) via `rust-toolchain.toml`
2. **Dependency Locking**: Committed `Cargo.lock` with exact versions
3. **Hermetic Builds**: Docker-based isolated environments
4. **Metadata Capture**: Complete hardware/software environment documentation
5. **Statistical Validation**: Automated reproduction verification scripts

## Prerequisites

### Native Build Requirements

- **Rust**: 1.82.0 (automatically installed via rust-toolchain.toml)
- **Deno**: 2.x or later (for reporting scripts)
- **hyperfine**: 1.19.0 (for benchmarking)
- **Git**: For source checkout

### Docker Build Requirements (Recommended)

- **Docker**: 20.10+ or compatible container runtime
- **Docker Compose** (optional): For orchestrated workflows

## Method 1: Docker-Based Reproduction (Hermetic)

### Step 1: Build Docker Image

```bash
# Clone repository
git clone https://github.com/paiml/certeza.git
cd certeza

# Build hermetic Docker image
docker build -t certeza:reproducible .

# Verify build
docker images | grep certeza
```

### Step 2: Run Benchmarks in Container

```bash
# Run comprehensive benchmark suite
docker run --rm \
    -v $(pwd)/benchmarks/results:/app/benchmarks/results \
    certeza:reproducible

# Results will be written to benchmarks/results/latest.json
```

### Step 3: Generate Reports

```bash
# Run container with report generation
docker run --rm \
    -v $(pwd)/benchmarks:/app/benchmarks \
    certeza:reproducible \
    /bin/bash -c "
        /app/scripts/run_benchmarks.sh --benchmarks all --warmup 5 --iterations 20 && \
        deno run --allow-read --allow-write /app/scripts/generate_markdown_report.ts \
            /app/benchmarks/results/latest.json \
            /app/benchmarks/results/report.md && \
        deno run --allow-read --allow-write /app/scripts/generate_csv_report.ts \
            /app/benchmarks/results/latest.json \
            /app/benchmarks/results/report.csv && \
        deno run --allow-read --allow-write /app/scripts/generate_dashboard.ts \
            /app/benchmarks/results/latest.json \
            /app/benchmarks/results/dashboard.html
    "
```

### Step 4: Validate Byte-Identical Builds

```bash
# Build verifier stage
docker build --target verifier -t certeza:verify .

# Extract build hash
docker run --rm certeza:verify cat /tmp/rebuild_hash.txt

# Compare with original build
docker run --rm certeza:reproducible sha256sum /app/certeza
```

Expected: Hashes should match within the same Docker environment.

## Method 2: Native Build Reproduction

### Step 1: Install Dependencies

```bash
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install Deno
curl -fsSL https://deno.land/install.sh | sh
export PATH="$HOME/.deno/bin:$PATH"

# Install hyperfine
cargo install hyperfine --version 1.19.0
```

### Step 2: Clone and Build

```bash
# Clone repository
git clone https://github.com/paiml/certeza.git
cd certeza

# Verify toolchain (should automatically use 1.82.0)
rustc --version  # Should show: rustc 1.82.0

# Build release binaries
cargo build --release --all-targets

# Generate reproducibility manifest
./scripts/generate_reproducibility_manifest.sh
```

### Step 3: Run Benchmarks

```bash
# Run critical benchmarks (Tier 2, ~5 minutes)
./scripts/run_benchmarks.sh \
    --benchmarks critical \
    --warmup 3 \
    --iterations 10 \
    --output benchmarks/results/tier2.json

# Run comprehensive suite (Tier 3, ~30 minutes)
./scripts/run_benchmarks.sh \
    --benchmarks all \
    --warmup 5 \
    --iterations 20 \
    --output benchmarks/results/tier3.json
```

### Step 4: Generate All Report Formats

```bash
# JSON (already generated by run_benchmarks.sh)
ls -lh benchmarks/results/tier3.json

# CSV export
deno run --allow-read --allow-write \
    scripts/generate_csv_report.ts \
    benchmarks/results/tier3.json \
    benchmarks/results/report.csv

# Markdown report
deno run --allow-read --allow-write \
    scripts/generate_markdown_report.ts \
    benchmarks/results/tier3.json \
    benchmarks/results/report.md

# HTML dashboard
deno run --allow-read --allow-write \
    scripts/generate_dashboard.ts \
    benchmarks/results/tier3.json \
    benchmarks/results/dashboard.html

# Multi-file CSV (summary + metadata + raw timings)
deno run --allow-read --allow-write \
    scripts/generate_csv_report.ts \
    benchmarks/results/tier3.json \
    benchmarks/results/csv/ \
    --multi
```

### Step 5: Validate Reproduction

```bash
# Compare against published baseline
deno run --allow-read --allow-write \
    scripts/check_regression.ts \
    --baseline benchmarks/baselines/published.json \
    --current benchmarks/results/tier3.json

# Statistical reproduction validation
./scripts/validate_reproduction.sh \
    benchmarks/baselines/published.json \
    benchmarks/results/tier3.json
```

Expected: Should pass with <5% mean difference for all benchmarks.

## Method 3: Using Makefile Targets

The project includes comprehensive Makefile targets for common workflows:

```bash
# Install all required tools
make install-tools

# Run critical benchmarks (Tier 2)
make benchmark

# Run comprehensive suite (Tier 3)
make benchmark-all

# Generate all reports
make benchmark-report

# Compare against baseline
make benchmark-compare

# Save current run as new baseline
make benchmark-baseline-save --name=reproduction-$(date +%Y%m%d)
```

## Validation Checklist

After reproducing results, verify the following:

- [ ] **Toolchain**: `rustc --version` shows 1.82.0
- [ ] **Build Success**: `cargo build --release` completes without errors
- [ ] **Tests Pass**: `cargo test --all` shows 261 tests passing
- [ ] **Benchmarks Run**: `latest.json` generated with valid data
- [ ] **Reports Generated**: CSV, Markdown, HTML files created
- [ ] **Statistical Validity**: CV < 10% for all benchmarks
- [ ] **Reproduction Validation**: `validate_reproduction.sh` passes
- [ ] **Metadata Complete**: Toolchain manifest includes hardware/software specs

## Expected Results

### Performance Characteristics

Based on reference hardware (Intel Core i7-9750H, 16GB RAM):

| Benchmark | Mean (ms) | CV | Expected Range |
|-----------|-----------|-----|----------------|
| trueno_vec_push | ~21.6 | <5% | 20.5 - 22.7 |
| trueno_vec_pop | ~18.4 | <2% | 18.0 - 18.8 |
| trueno_vec_get | ~12.4 | <1% | 12.2 - 12.6 |

**Note**: Absolute timings will vary by hardware. Focus on:
1. **Relative performance** (speedup ratios)
2. **Coefficient of Variation** (CV < 10%)
3. **Statistical equivalence** (reproduction validation)

### Report Outputs

Expected file structure after complete reproduction:

```
benchmarks/
├── results/
│   ├── latest.json          # Most recent run (JSON schema v1.0)
│   ├── report.md            # GitHub-flavored markdown
│   ├── report.csv           # Single CSV file
│   ├── dashboard.html       # Interactive Chart.js dashboard
│   └── csv/                 # Multi-file CSV export
│       ├── benchmarks_summary.csv
│       ├── metadata.csv
│       └── raw_timings.csv
├── baselines/
│   └── published.json       # Published reference baseline
├── metadata/
│   └── toolchain_manifest_<commit>.txt
└── history/
    └── weekly-*.json        # Historical snapshots
```

## Troubleshooting

### Issue: Different Rust Version

**Symptom**: `rustc --version` shows version != 1.82.0

**Solution**:
```bash
# rust-toolchain.toml should enforce correct version
# If not, manually override:
rustup override set 1.82.0
```

### Issue: Benchmark Variance Too High

**Symptom**: CV > 10% or reproduction validation fails

**Possible Causes**:
1. CPU governor not set to "performance"
2. Background processes consuming resources
3. Thermal throttling

**Solutions**:
```bash
# Set CPU governor (Linux)
sudo cpupower frequency-set --governor performance

# Check CPU frequency
cat /proc/cpuinfo | grep MHz

# Close unnecessary background processes
# Disable turbo boost for consistency (if high variance persists)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
```

### Issue: Docker Build Fails

**Symptom**: Docker build errors during dependency installation

**Solution**:
```bash
# Clear Docker cache and rebuild
docker builder prune
docker build --no-cache -t certeza:reproducible .

# If persistent, check Docker version
docker --version  # Should be 20.10+
```

### Issue: Report Generation Fails

**Symptom**: Deno scripts exit with errors

**Solution**:
```bash
# Verify Deno installation
deno --version  # Should be 2.x+

# Add permissions explicitly
deno run --allow-read --allow-write --allow-run <script>

# Check JSON validity
deno eval "JSON.parse(await Deno.readTextFile('benchmarks/results/latest.json'))"
```

## Citation

If you successfully reproduce these results, please cite:

```bibtex
@software{certeza2025,
  author       = {Pragmatic AI Labs},
  title        = {certeza: Scientific Framework for Asymptotic Test Effectiveness in Rust},
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.XXXXXXX},
  url          = {https://doi.org/10.5281/zenodo.XXXXXXX}
}
```

## Contact

For reproduction issues or questions:

- **GitHub Issues**: https://github.com/paiml/certeza/issues
- **Discussions**: https://github.com/paiml/certeza/discussions

## Changelog

- **2025-11-18**: Initial reproduction documentation (Phase 3.5)