aprender-verify 0.31.2

# Certeza Benchmark Reporting Tools

Phase 3.3 implementation: Statistical analysis and reporting tools for scientific benchmarking.

## Overview

This directory contains Deno TypeScript tools for comprehensive benchmark analysis and reporting:

- **Statistical Analysis**: Descriptive statistics, outlier detection, comparative analysis
- **Report Generation**: CSV and Markdown export formats
- **Regression Detection**: Automated performance regression detection with configurable thresholds
- **Baseline Management**: Save, compare, and track benchmark baselines over time

## Prerequisites

- [Deno](https://deno.land/) runtime (v1.38+)
- Benchmark results in JSON format (generated by Rust benchmarking framework)

## Installation

Install Deno:

```bash
curl -fsSL https://deno.land/install.sh | sh
```

All scripts are standalone and require no additional dependencies.

## Tools

### 1. Statistical Analysis (`statistical_analysis.ts`)

Core statistical utilities library providing:

- **Descriptive Statistics**: mean, median, std dev, coefficient of variation
- **Confidence Intervals**: Bootstrap method (1000 iterations)
- **Outlier Detection**: IQR-based method
- **Comparative Analysis**: Cohen's d effect size, Welch's t-test
- **Speedup Calculation**: Ratio with confidence intervals

**Usage as Library:**

```typescript
import {
  calculateStatistics,
  detectOutliers,
  calculateCohenD,
  calculateSpeedup,
} from "./statistical_analysis.ts";

const data = [21.4, 21.3, 20.1, 21.9, 21.4];
const stats = calculateStatistics(data);
console.log(stats.mean, stats.confidenceInterval95);
```

**CLI Demo:**

```bash
deno run --allow-read scripts/statistical_analysis.ts
```

### 2. CSV Report Generator (`generate_csv_report.ts`)

Export benchmark results to CSV format with multiple output modes.

**Single-file Mode** (summary only):

```bash
deno run --allow-read --allow-write \
  scripts/generate_csv_report.ts \
  benchmarks/results/latest.json \
  report.csv
```

**Multi-file Mode** (summary + metadata + raw timings):

```bash
deno run --allow-read --allow-write \
  scripts/generate_csv_report.ts \
  benchmarks/results/latest.json \
  reports/ \
  --multi
```

Generates:
- `benchmarks_summary.csv` - Main results table
- `metadata.csv` - Hardware/software environment
- `raw_timings.csv` - All measurement iterations

### 3. Markdown Report Generator (`generate_markdown_report.ts`)

Generate comprehensive GitHub-flavored Markdown reports.

**Usage:**

```bash
deno run --allow-read --allow-write \
  scripts/generate_markdown_report.ts \
  benchmarks/results/latest.json \
  report.md
```

**Report Sections:**
- Executive summary with key findings
- Benchmark results table with statistics
- Performance comparisons (if baseline exists)
- Regression/improvement warnings
- Hardware and software environment details
- Statistical methodology documentation
- Reproducibility instructions

### 4. Regression Detection (`check_regression.ts`)

Automated performance regression detection with statistical significance testing.

**Basic Usage:**

```bash
deno run --allow-read --allow-write \
  scripts/check_regression.ts \
  --baseline benchmarks/baselines/v1.0.0.json \
  --current benchmarks/results/latest.json
```

**Custom Thresholds:**

```bash
deno run --allow-read --allow-write \
  scripts/check_regression.ts \
  --baseline benchmarks/baselines/v1.0.0.json \
  --current benchmarks/results/latest.json \
  --warning-threshold 0.15 \
  --critical-threshold 0.25
```

**CI/CD Integration:**

```bash
# Exit codes:
#   0 = No regressions
#   1 = Warnings detected
#   2 = Critical regressions detected
#   3 = Error

deno run --allow-read --allow-write \
  scripts/check_regression.ts \
  --baseline benchmarks/baselines/main.json \
  --current benchmarks/results/latest.json \
  --output-json regression_report.json \
  --quiet

EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
  echo "CRITICAL REGRESSION DETECTED!"
  exit 1
fi
```

**Configuration:**

| Option | Default | Description |
|--------|---------|-------------|
| `--warning-threshold` | 0.05 | Warning at 5% slowdown |
| `--critical-threshold` | 0.10 | Critical at 10% slowdown |
| `--significance-level` | 0.05 | P-value threshold (α) |
| `--min-effect-size` | 0.2 | Minimum Cohen's d |

### 5. Baseline Manager (`baseline_manager.ts`)

Manage benchmark baselines for long-term performance tracking.

**Save a Baseline:**

```bash
deno run --allow-read --allow-write \
  scripts/baseline_manager.ts save \
  --input benchmarks/results/latest.json \
  --name v1.0.0 \
  --description "Release 1.0.0 baseline"
```

**List All Baselines:**

```bash
deno run --allow-read --allow-write \
  scripts/baseline_manager.ts list
```

**Show Baseline Details:**

```bash
deno run --allow-read --allow-write \
  scripts/baseline_manager.ts info \
  --name v1.0.0
```

**Compare Against Baseline:**

```bash
# Console output
deno run --allow-read --allow-write \
  scripts/baseline_manager.ts compare \
  --baseline v1.0.0 \
  --current benchmarks/results/latest.json

# Markdown report
deno run --allow-read --allow-write \
  scripts/baseline_manager.ts compare \
  --baseline v1.0.0 \
  --current benchmarks/results/latest.json \
  --format markdown \
  --output comparison.md

# JSON export
deno run --allow-read --allow-write \
  scripts/baseline_manager.ts compare \
  --baseline v1.0.0 \
  --current benchmarks/results/latest.json \
  --format json \
  --output comparison.json
```

**Delete Baseline:**

```bash
deno run --allow-read --allow-write \
  scripts/baseline_manager.ts delete \
  --name old-baseline \
  --force
```

### 6. Bash Scripts

#### `run_benchmarks.sh`

Wrapper for running benchmarks with bashrs/hyperfine.

```bash
./scripts/run_benchmarks.sh \
  --benchmarks critical \
  --output benchmarks/results/latest.json \
  --warmup 3 \
  --iterations 10
```

#### `generate_reproducibility_manifest.sh`

Generate complete reproducibility manifest.

```bash
./scripts/generate_reproducibility_manifest.sh \
  benchmarks/metadata/toolchain_manifest.txt
```

## Typical Workflows

### Initial Baseline Creation

```bash
# 1. Run benchmarks
./scripts/run_benchmarks.sh --benchmarks all --output results.json

# 2. Save as baseline
deno run --allow-read --allow-write scripts/baseline_manager.ts save \
  --input results.json \
  --name main-baseline \
  --description "Main branch baseline"

# 3. Generate reports
deno run --allow-read --allow-write scripts/generate_markdown_report.ts \
  results.json report.md
```

### CI/CD Regression Check

```bash
# 1. Run current benchmarks
./scripts/run_benchmarks.sh --benchmarks critical --output current.json

# 2. Check for regressions
deno run --allow-read --allow-write scripts/check_regression.ts \
  --baseline benchmarks/baselines/main.json \
  --current current.json \
  --output-json regression.json

# 3. If no critical regressions (exit 0 or 1), generate comparison report
if [ $? -ne 2 ]; then
  deno run --allow-read --allow-write scripts/baseline_manager.ts compare \
    --baseline main \
    --current current.json \
    --format markdown \
    --output comparison.md
fi
```

### Release Performance Report

```bash
# 1. Run comprehensive benchmarks
./scripts/run_benchmarks.sh --benchmarks all --profiles all --output release.json

# 2. Compare against previous release
deno run --allow-read --allow-write scripts/baseline_manager.ts compare \
  --baseline v1.0.0 \
  --current release.json \
  --format markdown \
  --output release-comparison.md

# 3. Generate all export formats
deno run --allow-read --allow-write scripts/generate_csv_report.ts \
  release.json reports/ --multi

deno run --allow-read --allow-write scripts/generate_markdown_report.ts \
  release.json reports/full-report.md

# 4. Save as new baseline
deno run --allow-read --allow-write scripts/baseline_manager.ts save \
  --input release.json \
  --name v1.1.0 \
  --description "Release 1.1.0 baseline"
```

## Statistical Methodology

### Descriptive Statistics

- **Mean**: Arithmetic average of measurements
- **Median**: 50th percentile (robust to outliers)
- **Standard Deviation**: Measurement variability
- **Coefficient of Variation**: Std dev / mean (stability indicator)
- **Confidence Interval (95%)**: Bootstrap method with 1000 samples

### Comparative Analysis

- **Speedup Ratio**: `baseline_mean / current_mean`
  - \> 1.0 = improvement (faster)
  - < 1.0 = regression (slower)

- **Effect Size (Cohen's d)**: Standardized mean difference
  - < 0.2 = negligible
  - 0.2 - 0.5 = small
  - 0.5 - 0.8 = medium
  - \> 0.8 = large

- **Statistical Significance**: Welch's t-test
  - P-value < 0.05 = statistically significant
  - Accounts for unequal variances and sample sizes

### Outlier Detection

- **Method**: Interquartile Range (IQR)
- **Threshold**: Values outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR]
- **Conservative approach**: Does not automatically remove outliers

## Data Format

All tools expect benchmark results in this JSON schema:

```json
{
  "schema_version": "1.0",
  "metadata": {
    "benchmark_suite": "certeza-benchmarks",
    "timestamp": "2025-11-18T10:30:00Z",
    "git_commit": "abc123",
    "git_branch": "main",
    "operator": "ci-runner",
    "hardware": { ... },
    "software": { ... },
    "environment": { ... }
  },
  "benchmarks": [
    {
      "benchmark_name": "trueno_vec_push",
      "benchmark_type": "microbenchmark",
      "optimization_level": "release",
      "warmup_iterations": 3,
      "measured_iterations": 10,
      "raw_timings_ms": [21.4, 21.3, ...],
      "statistics": { ... },
      "comparison": { ... }
    }
  ],
  "summary": { ... }
}
```

See `src/benchmark/mod.rs` for full Rust type definitions.

## Directory Structure

```
benchmarks/
├── baselines/           # Saved baselines (managed by baseline_manager)
│   ├── index.json       # Baseline registry
│   ├── v1.0.0.json      # Baseline files
│   └── main.json
├── metadata/            # Reproducibility manifests
│   └── toolchain_manifest.txt
├── results/             # Benchmark results
│   └── latest.json
└── test_results/        # Test outputs
```

## Testing

Run the comprehensive test suite:

```bash
deno run --allow-read --allow-write scripts/test_reporting.ts
```

This validates:
- Statistical utilities
- CSV export (single and multi-file)
- Markdown report generation
- Baseline management

## Performance Considerations

- Bootstrap confidence intervals use 1000 iterations (configurable)
- Welch's t-test uses normal approximation for large samples (df > 30)
- All statistical calculations are vectorized where possible
- File I/O uses streaming for large datasets

## Contributing

When adding new statistical methods:

1. Add core function to `statistical_analysis.ts`
2. Export for library use
3. Add tests to `test_reporting.ts`
4. Update this README with methodology

## References

- **Bootstrap Methods**: Efron & Tibshirani (1993)
- **Cohen's d**: Cohen (1988)
- **Welch's t-test**: Welch (1947)
- **IQR Outlier Detection**: Tukey (1977)

## License

Part of the certeza project. See repository root for license information.