# Reproducibility Instructions
This document provides step-by-step instructions to reproduce all benchmark results from the certeza project.
## Overview
The certeza project emphasizes scientific reproducibility through:
1. **Toolchain Pinning**: Exact Rust version (1.82.0) via `rust-toolchain.toml`
2. **Dependency Locking**: Committed `Cargo.lock` with exact versions
3. **Hermetic Builds**: Docker-based isolated environments
4. **Metadata Capture**: Complete hardware/software environment documentation
5. **Statistical Validation**: Automated reproduction verification scripts
## Prerequisites
### Native Build Requirements
- **Rust**: 1.82.0 (automatically installed via rust-toolchain.toml)
- **Deno**: 2.x or later (for reporting scripts)
- **hyperfine**: 1.19.0 (for benchmarking)
- **Git**: For source checkout
### Docker Build Requirements (Recommended)
- **Docker**: 20.10+ or compatible container runtime
- **Docker Compose** (optional): For orchestrated workflows
## Method 1: Docker-Based Reproduction (Hermetic)
### Step 1: Build Docker Image
```bash
# Clone repository
git clone https://github.com/paiml/certeza.git
cd certeza
# Build hermetic Docker image
docker build -t certeza:reproducible .
# Verify build
### Step 2: Run Benchmarks in Container
```bash
# Run comprehensive benchmark suite
docker run --rm \
-v $(pwd)/benchmarks/results:/app/benchmarks/results \
certeza:reproducible
# Results will be written to benchmarks/results/latest.json
```
### Step 3: Generate Reports
```bash
# Run container with report generation
docker run --rm \
-v $(pwd)/benchmarks:/app/benchmarks \
certeza:reproducible \
/bin/bash -c "
/app/scripts/run_benchmarks.sh --benchmarks all --warmup 5 --iterations 20 && \
deno run --allow-read --allow-write /app/scripts/generate_markdown_report.ts \
/app/benchmarks/results/latest.json \
/app/benchmarks/results/report.md && \
deno run --allow-read --allow-write /app/scripts/generate_csv_report.ts \
/app/benchmarks/results/latest.json \
/app/benchmarks/results/report.csv && \
deno run --allow-read --allow-write /app/scripts/generate_dashboard.ts \
/app/benchmarks/results/latest.json \
/app/benchmarks/results/dashboard.html
"
```
### Step 4: Validate Byte-Identical Builds
```bash
# Build verifier stage
docker build --target verifier -t certeza:verify .
# Extract build hash
docker run --rm certeza:verify cat /tmp/rebuild_hash.txt
# Compare with original build
docker run --rm certeza:reproducible sha256sum /app/certeza
```
Expected: Hashes should match within the same Docker environment.
## Method 2: Native Build Reproduction
### Step 1: Install Dependencies
```bash
# Install Rust (if not already installed)
# Install Deno
# Install hyperfine
cargo install hyperfine --version 1.19.0
```
### Step 2: Clone and Build
```bash
# Clone repository
git clone https://github.com/paiml/certeza.git
cd certeza
# Verify toolchain (should automatically use 1.82.0)
rustc --version # Should show: rustc 1.82.0
# Build release binaries
cargo build --release --all-targets
# Generate reproducibility manifest
./scripts/generate_reproducibility_manifest.sh
```
### Step 3: Run Benchmarks
```bash
# Run critical benchmarks (Tier 2, ~5 minutes)
./scripts/run_benchmarks.sh \
--benchmarks critical \
--warmup 3 \
--iterations 10 \
--output benchmarks/results/tier2.json
# Run comprehensive suite (Tier 3, ~30 minutes)
./scripts/run_benchmarks.sh \
--benchmarks all \
--warmup 5 \
--iterations 20 \
--output benchmarks/results/tier3.json
```
### Step 4: Generate All Report Formats
```bash
# JSON (already generated by run_benchmarks.sh)
ls -lh benchmarks/results/tier3.json
# CSV export
deno run --allow-read --allow-write \
scripts/generate_csv_report.ts \
benchmarks/results/tier3.json \
benchmarks/results/report.csv
# Markdown report
deno run --allow-read --allow-write \
scripts/generate_markdown_report.ts \
benchmarks/results/tier3.json \
benchmarks/results/report.md
# HTML dashboard
deno run --allow-read --allow-write \
scripts/generate_dashboard.ts \
benchmarks/results/tier3.json \
benchmarks/results/dashboard.html
# Multi-file CSV (summary + metadata + raw timings)
deno run --allow-read --allow-write \
scripts/generate_csv_report.ts \
benchmarks/results/tier3.json \
benchmarks/results/csv/ \
--multi
```
### Step 5: Validate Reproduction
```bash
# Compare against published baseline
deno run --allow-read --allow-write \
scripts/check_regression.ts \
--baseline benchmarks/baselines/published.json \
--current benchmarks/results/tier3.json
# Statistical reproduction validation
./scripts/validate_reproduction.sh \
benchmarks/baselines/published.json \
benchmarks/results/tier3.json
```
Expected: Should pass with <5% mean difference for all benchmarks.
## Method 3: Using Makefile Targets
The project includes comprehensive Makefile targets for common workflows:
```bash
# Install all required tools
make install-tools
# Run critical benchmarks (Tier 2)
make benchmark
# Run comprehensive suite (Tier 3)
make benchmark-all
# Generate all reports
make benchmark-report
# Compare against baseline
make benchmark-compare
# Save current run as new baseline
make benchmark-baseline-save --name=reproduction-$(date +%Y%m%d)
```
## Validation Checklist
After reproducing results, verify the following:
- [ ] **Toolchain**: `rustc --version` shows 1.82.0
- [ ] **Build Success**: `cargo build --release` completes without errors
- [ ] **Tests Pass**: `cargo test --all` shows 261 tests passing
- [ ] **Benchmarks Run**: `latest.json` generated with valid data
- [ ] **Reports Generated**: CSV, Markdown, HTML files created
- [ ] **Statistical Validity**: CV < 10% for all benchmarks
- [ ] **Reproduction Validation**: `validate_reproduction.sh` passes
- [ ] **Metadata Complete**: Toolchain manifest includes hardware/software specs
## Expected Results
### Performance Characteristics
Based on reference hardware (Intel Core i7-9750H, 16GB RAM):
| trueno_vec_push | ~21.6 | <5% | 20.5 - 22.7 |
| trueno_vec_pop | ~18.4 | <2% | 18.0 - 18.8 |
| trueno_vec_get | ~12.4 | <1% | 12.2 - 12.6 |
**Note**: Absolute timings will vary by hardware. Focus on:
1. **Relative performance** (speedup ratios)
2. **Coefficient of Variation** (CV < 10%)
3. **Statistical equivalence** (reproduction validation)
### Report Outputs
Expected file structure after complete reproduction:
```
benchmarks/
├── results/
│ ├── latest.json # Most recent run (JSON schema v1.0)
│ ├── report.md # GitHub-flavored markdown
│ ├── report.csv # Single CSV file
│ ├── dashboard.html # Interactive Chart.js dashboard
│ └── csv/ # Multi-file CSV export
│ ├── benchmarks_summary.csv
│ ├── metadata.csv
│ └── raw_timings.csv
├── baselines/
│ └── published.json # Published reference baseline
├── metadata/
│ └── toolchain_manifest_<commit>.txt
└── history/
└── weekly-*.json # Historical snapshots
```
## Troubleshooting
### Issue: Different Rust Version
**Symptom**: `rustc --version` shows version != 1.82.0
**Solution**:
```bash
# rust-toolchain.toml should enforce correct version
# If not, manually override:
rustup override set 1.82.0
```
### Issue: Benchmark Variance Too High
**Symptom**: CV > 10% or reproduction validation fails
**Possible Causes**:
1. CPU governor not set to "performance"
2. Background processes consuming resources
3. Thermal throttling
**Solutions**:
```bash
# Set CPU governor (Linux)
sudo cpupower frequency-set --governor performance
# Check CPU frequency
# Close unnecessary background processes
# Disable turbo boost for consistency (if high variance persists)
### Issue: Docker Build Fails
**Symptom**: Docker build errors during dependency installation
**Solution**:
```bash
# Clear Docker cache and rebuild
docker builder prune
docker build --no-cache -t certeza:reproducible .
# If persistent, check Docker version
docker --version # Should be 20.10+
```
### Issue: Report Generation Fails
**Symptom**: Deno scripts exit with errors
**Solution**:
```bash
# Verify Deno installation
deno --version # Should be 2.x+
# Add permissions explicitly
deno run --allow-read --allow-write --allow-run <script>
# Check JSON validity
deno eval "JSON.parse(await Deno.readTextFile('benchmarks/results/latest.json'))"
```
## Citation
If you successfully reproduce these results, please cite:
```bibtex
@software{certeza2025,
author = {Pragmatic AI Labs},
title = {certeza: Scientific Framework for Asymptotic Test Effectiveness in Rust},
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.XXXXXXX},
url = {https://doi.org/10.5281/zenodo.XXXXXXX}
}
```
## Contact
For reproduction issues or questions:
- **GitHub Issues**: https://github.com/paiml/certeza/issues
- **Discussions**: https://github.com/paiml/certeza/discussions
## Changelog
- **2025-11-18**: Initial reproduction documentation (Phase 3.5)