# Certeza Benchmark Reporting Tools
Phase 3.3 implementation: Statistical analysis and reporting tools for scientific benchmarking.
## Overview
This directory contains Deno TypeScript tools for comprehensive benchmark analysis and reporting:
- **Statistical Analysis**: Descriptive statistics, outlier detection, comparative analysis
- **Report Generation**: CSV and Markdown export formats
- **Regression Detection**: Automated performance regression detection with configurable thresholds
- **Baseline Management**: Save, compare, and track benchmark baselines over time
## Prerequisites
- [Deno](https://deno.land/) runtime (v1.38+)
- Benchmark results in JSON format (generated by Rust benchmarking framework)
## Installation
Install Deno:
```bash
All scripts are standalone and require no additional dependencies.
## Tools
### 1. Statistical Analysis (`statistical_analysis.ts`)
Core statistical utilities library providing:
- **Descriptive Statistics**: mean, median, std dev, coefficient of variation
- **Confidence Intervals**: Bootstrap method (1000 iterations)
- **Outlier Detection**: IQR-based method
- **Comparative Analysis**: Cohen's d effect size, Welch's t-test
- **Speedup Calculation**: Ratio with confidence intervals
**Usage as Library:**
```typescript
import {
calculateStatistics,
detectOutliers,
calculateCohenD,
calculateSpeedup,
} from "./statistical_analysis.ts";
const data = [21.4, 21.3, 20.1, 21.9, 21.4];
const stats = calculateStatistics(data);
console.log(stats.mean, stats.confidenceInterval95);
```
**CLI Demo:**
```bash
deno run --allow-read scripts/statistical_analysis.ts
```
### 2. CSV Report Generator (`generate_csv_report.ts`)
Export benchmark results to CSV format with multiple output modes.
**Single-file Mode** (summary only):
```bash
deno run --allow-read --allow-write \
scripts/generate_csv_report.ts \
benchmarks/results/latest.json \
report.csv
```
**Multi-file Mode** (summary + metadata + raw timings):
```bash
deno run --allow-read --allow-write \
scripts/generate_csv_report.ts \
benchmarks/results/latest.json \
reports/ \
--multi
```
Generates:
- `benchmarks_summary.csv` - Main results table
- `metadata.csv` - Hardware/software environment
- `raw_timings.csv` - All measurement iterations
### 3. Markdown Report Generator (`generate_markdown_report.ts`)
Generate comprehensive GitHub-flavored Markdown reports.
**Usage:**
```bash
deno run --allow-read --allow-write \
scripts/generate_markdown_report.ts \
benchmarks/results/latest.json \
report.md
```
**Report Sections:**
- Executive summary with key findings
- Benchmark results table with statistics
- Performance comparisons (if baseline exists)
- Regression/improvement warnings
- Hardware and software environment details
- Statistical methodology documentation
- Reproducibility instructions
### 4. Regression Detection (`check_regression.ts`)
Automated performance regression detection with statistical significance testing.
**Basic Usage:**
```bash
deno run --allow-read --allow-write \
scripts/check_regression.ts \
--baseline benchmarks/baselines/v1.0.0.json \
--current benchmarks/results/latest.json
```
**Custom Thresholds:**
```bash
deno run --allow-read --allow-write \
scripts/check_regression.ts \
--baseline benchmarks/baselines/v1.0.0.json \
--current benchmarks/results/latest.json \
--warning-threshold 0.15 \
--critical-threshold 0.25
```
**CI/CD Integration:**
```bash
# Exit codes:
# 0 = No regressions
# 1 = Warnings detected
# 2 = Critical regressions detected
# 3 = Error
deno run --allow-read --allow-write \
scripts/check_regression.ts \
--baseline benchmarks/baselines/main.json \
--current benchmarks/results/latest.json \
--output-json regression_report.json \
--quiet
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "CRITICAL REGRESSION DETECTED!"
exit 1
fi
```
**Configuration:**
| `--warning-threshold` | 0.05 | Warning at 5% slowdown |
| `--critical-threshold` | 0.10 | Critical at 10% slowdown |
| `--significance-level` | 0.05 | P-value threshold (α) |
| `--min-effect-size` | 0.2 | Minimum Cohen's d |
### 5. Baseline Manager (`baseline_manager.ts`)
Manage benchmark baselines for long-term performance tracking.
**Save a Baseline:**
```bash
deno run --allow-read --allow-write \
scripts/baseline_manager.ts save \
--input benchmarks/results/latest.json \
--name v1.0.0 \
--description "Release 1.0.0 baseline"
```
**List All Baselines:**
```bash
deno run --allow-read --allow-write \
scripts/baseline_manager.ts list
```
**Show Baseline Details:**
```bash
deno run --allow-read --allow-write \
scripts/baseline_manager.ts info \
--name v1.0.0
```
**Compare Against Baseline:**
```bash
# Console output
deno run --allow-read --allow-write \
scripts/baseline_manager.ts compare \
--baseline v1.0.0 \
--current benchmarks/results/latest.json
# Markdown report
deno run --allow-read --allow-write \
scripts/baseline_manager.ts compare \
--baseline v1.0.0 \
--current benchmarks/results/latest.json \
--format markdown \
--output comparison.md
# JSON export
deno run --allow-read --allow-write \
scripts/baseline_manager.ts compare \
--baseline v1.0.0 \
--current benchmarks/results/latest.json \
--format json \
--output comparison.json
```
**Delete Baseline:**
```bash
deno run --allow-read --allow-write \
scripts/baseline_manager.ts delete \
--name old-baseline \
--force
```
### 6. Bash Scripts
#### `run_benchmarks.sh`
Wrapper for running benchmarks with bashrs/hyperfine.
```bash
./scripts/run_benchmarks.sh \
--benchmarks critical \
--output benchmarks/results/latest.json \
--warmup 3 \
--iterations 10
```
#### `generate_reproducibility_manifest.sh`
Generate complete reproducibility manifest.
```bash
./scripts/generate_reproducibility_manifest.sh \
benchmarks/metadata/toolchain_manifest.txt
```
## Typical Workflows
### Initial Baseline Creation
```bash
# 1. Run benchmarks
./scripts/run_benchmarks.sh --benchmarks all --output results.json
# 2. Save as baseline
deno run --allow-read --allow-write scripts/baseline_manager.ts save \
--input results.json \
--name main-baseline \
--description "Main branch baseline"
# 3. Generate reports
deno run --allow-read --allow-write scripts/generate_markdown_report.ts \
results.json report.md
```
### CI/CD Regression Check
```bash
# 1. Run current benchmarks
./scripts/run_benchmarks.sh --benchmarks critical --output current.json
# 2. Check for regressions
deno run --allow-read --allow-write scripts/check_regression.ts \
--baseline benchmarks/baselines/main.json \
--current current.json \
--output-json regression.json
# 3. If no critical regressions (exit 0 or 1), generate comparison report
if [ $? -ne 2 ]; then
deno run --allow-read --allow-write scripts/baseline_manager.ts compare \
--baseline main \
--current current.json \
--format markdown \
--output comparison.md
fi
```
### Release Performance Report
```bash
# 1. Run comprehensive benchmarks
./scripts/run_benchmarks.sh --benchmarks all --profiles all --output release.json
# 2. Compare against previous release
deno run --allow-read --allow-write scripts/baseline_manager.ts compare \
--baseline v1.0.0 \
--current release.json \
--format markdown \
--output release-comparison.md
# 3. Generate all export formats
deno run --allow-read --allow-write scripts/generate_csv_report.ts \
release.json reports/ --multi
deno run --allow-read --allow-write scripts/generate_markdown_report.ts \
release.json reports/full-report.md
# 4. Save as new baseline
deno run --allow-read --allow-write scripts/baseline_manager.ts save \
--input release.json \
--name v1.1.0 \
--description "Release 1.1.0 baseline"
```
## Statistical Methodology
### Descriptive Statistics
- **Mean**: Arithmetic average of measurements
- **Median**: 50th percentile (robust to outliers)
- **Standard Deviation**: Measurement variability
- **Coefficient of Variation**: Std dev / mean (stability indicator)
- **Confidence Interval (95%)**: Bootstrap method with 1000 samples
### Comparative Analysis
- **Speedup Ratio**: `baseline_mean / current_mean`
- \> 1.0 = improvement (faster)
- < 1.0 = regression (slower)
- **Effect Size (Cohen's d)**: Standardized mean difference
- < 0.2 = negligible
- 0.2 - 0.5 = small
- 0.5 - 0.8 = medium
- \> 0.8 = large
- **Statistical Significance**: Welch's t-test
- P-value < 0.05 = statistically significant
- Accounts for unequal variances and sample sizes
### Outlier Detection
- **Method**: Interquartile Range (IQR)
- **Threshold**: Values outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR]
- **Conservative approach**: Does not automatically remove outliers
## Data Format
All tools expect benchmark results in this JSON schema:
```json
{
"schema_version": "1.0",
"metadata": {
"benchmark_suite": "certeza-benchmarks",
"timestamp": "2025-11-18T10:30:00Z",
"git_commit": "abc123",
"git_branch": "main",
"operator": "ci-runner",
"hardware": { ... },
"software": { ... },
"environment": { ... }
},
"benchmarks": [
{
"benchmark_name": "trueno_vec_push",
"benchmark_type": "microbenchmark",
"optimization_level": "release",
"warmup_iterations": 3,
"measured_iterations": 10,
"raw_timings_ms": [21.4, 21.3, ...],
"statistics": { ... },
"comparison": { ... }
}
],
"summary": { ... }
}
```
See `src/benchmark/mod.rs` for full Rust type definitions.
## Directory Structure
```
benchmarks/
├── baselines/ # Saved baselines (managed by baseline_manager)
│ ├── index.json # Baseline registry
│ ├── v1.0.0.json # Baseline files
│ └── main.json
├── metadata/ # Reproducibility manifests
│ └── toolchain_manifest.txt
├── results/ # Benchmark results
│ └── latest.json
└── test_results/ # Test outputs
```
## Testing
Run the comprehensive test suite:
```bash
deno run --allow-read --allow-write scripts/test_reporting.ts
```
This validates:
- Statistical utilities
- CSV export (single and multi-file)
- Markdown report generation
- Baseline management
## Performance Considerations
- Bootstrap confidence intervals use 1000 iterations (configurable)
- Welch's t-test uses normal approximation for large samples (df > 30)
- All statistical calculations are vectorized where possible
- File I/O uses streaming for large datasets
## Contributing
When adding new statistical methods:
1. Add core function to `statistical_analysis.ts`
2. Export for library use
3. Add tests to `test_reporting.ts`
4. Update this README with methodology
## References
- **Bootstrap Methods**: Efron & Tibshirani (1993)
- **Cohen's d**: Cohen (1988)
- **Welch's t-test**: Welch (1947)
- **IQR Outlier Detection**: Tukey (1977)
## License
Part of the certeza project. See repository root for license information.