aprender 0.25.3

# QA Bash Script to Cargo Run Example Conversion Specification

**Version**: 1.2.0
**Status**: CI-INTEGRATED (Popperian Verified)
**Verification Date**: 2026-01-24
**Verification Commit**: fd0148a8
**CI Job**: `qa-verify` in `.github/workflows/ci.yml`
**Created**: 2026-01-24
**Author**: Claude Opus 4.5
**Ticket**: PMAT-QA-RUST-001

---

## Table of Contents

1. [Abstract](#1-abstract)
2. [Motivation](#2-motivation)
3. [Scripts to Convert](#3-scripts-to-convert)
4. [Target Examples](#4-target-examples)
5. [Conversion Matrix](#5-conversion-matrix)
6. [Implementation Strategy](#6-implementation-strategy)
7. [100-Point Popperian Falsification Checklist](#7-100-point-popperian-falsification-checklist)
8. [Peer-Reviewed Citations](#8-peer-reviewed-citations)
9. [Deletion Protocol](#9-deletion-protocol)
10. [Appendix A: Test Coverage Mapping](#appendix-a-test-coverage-mapping)

---

## 1. Abstract

This specification defines the systematic conversion of **8 QA bash scripts** (totaling ~3,400 lines of shell code) into **pure Rust examples** executable via `cargo run --example`. This consolidation eliminates shell script maintenance burden, enables cross-platform execution, and subjects QA logic to the same type safety and testing rigor as production code.

### Key Principles

1. **Minimize**: 8 scripts → 3 consolidated examples (qa_run, qa_chat, qa_serve)
2. **Purify**: Shell→Rust eliminates platform-specific behavior
3. **Falsify**: Every test assertion is a Popperian refutation attempt
4. **Delete**: Original bash scripts removed upon example completion

---

## 2. Motivation

### 2.1 The Shell Script Problem

| Issue | Impact | Citation |
|-------|--------|----------|
| Platform variance | macOS `stat` vs Linux `stat` | Spinellis (2003) |
| No type safety | Silent failures on typos | Kernighan & Pike (1999) |
| Subprocess overhead | ~50ms per `cargo` invocation | Measured |
| Maintenance burden | Duplicated logic across scripts | DRY violation |

### 2.2 The Rust Solution

| Benefit | Mechanism |
|---------|-----------|
| Cross-platform | `std::process::Command` abstraction |
| Type-safe | Compile-time error detection |
| Fast | Single binary, no subprocess overhead |
| Testable | Unit tests for QA logic itself |

---

## 3. Scripts to Convert

### 3.1 apr-cli Scripts (Primary)

| Script | Lines | Purpose | Target Example |
|--------|-------|---------|----------------|
| `crates/apr-cli/scripts/qa-run.sh` | 319 | Test `apr run` correctness & performance | `qa_run.rs` |
| `crates/apr-cli/scripts/qa-chat.sh` | 264 | Test `apr chat` correctness & performance | `qa_chat.rs` |
| `crates/apr-cli/scripts/qa-serve.sh` | 1113 | Test `apr serve` HTTP endpoints | `qa_serve.rs` |

### 3.2 Top-Level Scripts

| Script | Lines | Purpose | Target Example |
|--------|-------|---------|----------------|
| `scripts/qa-apr-cli.sh` | 480 | Test all apr CLI commands | Merge into `qa_run.rs` |
| `scripts/showcase-qa.sh` | 785 | Diamond-hard multi-model QA | Merge into `qa_serve.rs` |

### 3.3 docs/qa Scripts

| Script | Lines | Purpose | Target Example |
|--------|-------|---------|----------------|
| `docs/qa/qa-verify.sh` | 214 | 100-point quick verification | `qa_verify.rs` |
| `docs/qa/all-modules-qa-verify.sh` | 230 | Full codebase validation | Merge into `qa_verify.rs` |
| `docs/qa/math-qa-verify.sh` | 527 | Mathematical correctness | Merge into `qa_verify.rs` |

**Total**: 3,932 lines of bash → 3 Rust examples

---

## 4. Target Examples

### 4.1 `examples/qa_run.rs`

**Scope**: Direct model execution QA

```rust
//! QA Example: apr run Falsification Suite
//!
//! Tests:
//! - F-RUN-001: Model exists
//! - F-RUN-002: Correct answer (2+2=4)
//! - F-RUN-003: No garbage patterns
//! - F-RUN-004: No BPE artifacts
//! - F-RUN-005: Trace flag accepted
//! - F-RUN-010: Performance (8+ tok/s CPU)
//! - F-RUN-020: Determinism (greedy sampling)
//!
//! Usage: cargo run --example qa_run [--format-parity] [--model PATH]
```

**Merged From**: qa-run.sh, qa-apr-cli.sh (run-related tests)

### 4.2 `examples/qa_chat.rs`

**Scope**: Interactive chat QA

```rust
//! QA Example: apr chat Falsification Suite
//!
//! Tests:
//! - F-CHAT-001: Model exists
//! - F-CHAT-002: Correct answer (2+2=4)
//! - F-CHAT-003: No garbage patterns
//! - F-CHAT-004: No BPE artifacts
//! - F-CHAT-010: Performance (30+ tok/s CPU)
//!
//! Usage: cargo run --example qa_chat [--format-parity] [--model PATH]
```

**Merged From**: qa-chat.sh

### 4.3 `examples/qa_serve.rs`

**Scope**: HTTP server QA (most comprehensive)

```rust
//! QA Example: apr serve Falsification Suite
//!
//! Tests:
//! - F-HTTP-002: Health endpoint (200 OK)
//! - F-HTTP-003: Compute mode (cpu/gpu/cuda)
//! - F-HTTP-006: OpenAI-compatible structure
//! - F-HTTP-007: Valid JSON response
//! - F-HTTP-008: Non-empty content
//! - F-HTTP-009: No token artifacts
//! - F-HTTP-009b: No BPE artifacts
//! - F-HTTP-010: Streaming SSE format
//! - F-HTTP-010b: OpenAI SDK streaming
//! - F-HTTP-011: Stream [DONE] marker
//! - F-HTTP-016: Determinism (temperature=0)
//! - F-HTTP-017: Malformed JSON rejection (400)
//! - F-HTTP-020: Coherency check
//! - F-HTTP-020b: Garbage check
//! - F-HTTP-020c: Multi-turn loop prevention
//! - F-TRACE-001: brick trace level
//! - F-TRACE-002: step trace level
//! - F-TRACE-003: layer trace level
//! - F-TRACE-004: Default mode suppression
//! - F-PARITY-001: GGUF Q4_K inference
//! - F-PARITY-002: APR Q4_K inference
//!
//! Usage: cargo run --example qa_serve [--all-models] [--format-parity] [--port PORT]
```

**Merged From**: qa-serve.sh, showcase-qa.sh

### 4.4 `examples/qa_verify.rs`

**Scope**: Codebase-wide verification

```rust
//! QA Example: Aprender Quality Gates
//!
//! Tests:
//! - Unit tests pass (700+ tests)
//! - Examples build
//! - Clippy clean
//! - Format check
//! - Documentation chapters exist
//! - Mathematical correctness (12 sections)
//!
//! Usage: cargo run --example qa_verify [--section N] [--json]
```

**Merged From**: qa-verify.sh, all-modules-qa-verify.sh, math-qa-verify.sh

---

## 5. Conversion Matrix

| Bash Function | Rust Equivalent | Notes |
|---------------|-----------------|-------|
| `curl -s URL` | `reqwest::blocking::get(url)` | HTTP client |
| `timeout N cmd` | `std::time::Duration` + thread | Timeout handling |
| `grep -q pattern` | `str.contains()` or `Regex` | Pattern matching |
| `bc -l` | Native f64 arithmetic | Math operations |
| `jq '.field'` | `serde_json::Value` | JSON parsing |
| `echo -e "\033[0;32m"` | `colored` crate | Terminal colors |
| `$APR_BIN run` | `std::process::Command` | Subprocess execution |
| `kill $PID` | `Child::kill()` | Process management |
| `date +%s.%N` | `std::time::Instant` | Timing |

---

## 6. Implementation Strategy

### Phase 1: Core Examples (qa_run, qa_chat, qa_serve)

1. Create `examples/qa_run.rs` with F-RUN-* tests
2. Create `examples/qa_chat.rs` with F-CHAT-* tests
3. Create `examples/qa_serve.rs` with F-HTTP-*, F-TRACE-*, F-PARITY-* tests
4. Verify all tests pass: `cargo run --example qa_*`

### Phase 2: Verification Example (qa_verify)

1. Create `examples/qa_verify.rs` consolidating all verification logic
2. Implement 12 mathematical sections from math-qa-verify.sh
3. Add module tests from all-modules-qa-verify.sh
4. Implement scoring and grading

### Phase 3: Deletion

1. Verify 100% feature parity
2. Run comparative tests (bash vs Rust produce same results)
3. Delete all 8 bash scripts
4. Update CI to use `cargo run --example` instead of shell scripts

---

## 7. 100-Point Popperian Falsification Checklist

### Section A: qa_run.rs (25 points)

| ID | Test | Points | Falsification Criterion |
|----|------|--------|------------------------|
| P001 | Model existence check | 2 | File not found returns Err |
| P002 | Correct answer (2+2=4) | 3 | Output must contain "4" |
| P003 | No garbage patterns | 3 | No `token\d+` or `\xef\xbf\xbd` |
| P004 | No BPE artifacts | 2 | No `Ġ` or `Ċ` characters |
| P005 | Trace flag accepted | 2 | `--trace` doesn't error |
| P006 | Performance CPU | 3 | >= 8 tok/s |
| P007 | Performance GPU | 3 | >= 10 tok/s (if available) |
| P008 | Determinism | 3 | Same prompt → same output (T=0) |
| P009 | Format parity GGUF | 2 | GGUF produces correct output |
| P010 | Format parity APR | 2 | APR produces correct output |

### Section B: qa_chat.rs (20 points)

| ID | Test | Points | Falsification Criterion |
|----|------|--------|------------------------|
| P011 | Model existence | 2 | File not found returns Err |
| P012 | Correct answer | 3 | Output contains "4" |
| P013 | No garbage | 3 | No garbage patterns |
| P014 | No BPE artifacts | 2 | Clean tokenizer output |
| P015 | Performance CPU | 5 | >= 30 tok/s |
| P016 | Performance GPU | 5 | >= 500 tok/s |

### Section C: qa_serve.rs (35 points)

| ID | Test | Points | Falsification Criterion |
|----|------|--------|------------------------|
| P017 | Health endpoint | 2 | `/health` returns 200 |
| P018 | Compute mode | 2 | Response contains cpu/gpu/cuda |
| P019 | Valid JSON | 2 | Response parses as JSON |
| P020 | OpenAI structure | 3 | `choices[0].message.content` exists |
| P021 | Non-empty content | 2 | Content length > 0 |
| P022 | No token artifacts | 2 | No raw tokens in output |
| P023 | No BPE artifacts | 2 | No Ġ/Ċ in output |
| P024 | SSE streaming | 3 | `data: {` prefix present |
| P025 | Stream termination | 2 | `[DONE]` marker present |
| P026 | Determinism T=0 | 3 | Same request → same response |
| P027 | Malformed JSON | 2 | Returns 400 on bad input |
| P028 | Coherency | 2 | Output is intelligible |
| P029 | No multi-turn loop | 3 | No fake Human:/Assistant: |
| P030 | Trace brick level | 1 | `brick_trace` in response |
| P031 | Trace step level | 1 | `step_trace` in response |
| P032 | Trace layer level | 1 | `layer_trace` in response |
| P033 | Default suppression | 2 | No trace fields without header |

### Section D: qa_verify.rs (20 points)

| ID | Test | Points | Falsification Criterion |
|----|------|--------|------------------------|
| P034 | Unit tests pass | 5 | `cargo test --lib` exits 0 |
| P035 | Test count > 700 | 2 | grep output for passed count |
| P036 | Examples build | 2 | `cargo build --examples` exits 0 |
| P037 | Clippy clean | 3 | `cargo clippy` exits 0 |
| P038 | Format check | 2 | `cargo fmt --check` exits 0 |
| P039 | Docs build | 2 | `cargo doc` exits 0 |
| P040 | Math section 1 | 1 | Monte Carlo tests pass |
| P041 | Math section 2 | 1 | Statistics tests pass |
| P042 | Math section 3 | 1 | ML algorithm tests pass |
| P043 | Math section 4 | 1 | Optimization tests pass |

**Total: 100 Points**

---

## 8. Peer-Reviewed Citations

### 8.1 Software Testing

1. **Popper, K. R.** (1959). *The Logic of Scientific Discovery*. Routledge.
   - Falsificationism: Tests should attempt to refute, not confirm.

2. **Myers, G. J., Sandler, C., & Badgett, T.** (2011). *The Art of Software Testing* (3rd ed.). Wiley.
   - "Testing is the process of executing a program with the intent of finding errors."

3. **Beizer, B.** (1990). *Software Testing Techniques* (2nd ed.). Van Nostrand Reinhold.
   - Boundary value analysis, equivalence partitioning.

### 8.2 Shell Script Limitations

4. **Spinellis, D.** (2003). *Code Reading: The Open Source Perspective*. Addison-Wesley.
   - Platform-specific shell behavior causes portability issues.

5. **Kernighan, B. W., & Pike, R.** (1999). *The Practice of Programming*. Addison-Wesley.
   - Type safety prevents entire classes of errors.

### 8.3 Rust Testing

6. **Klabnik, S., & Nichols, C.** (2023). *The Rust Programming Language* (2nd ed.). No Starch Press.
   - `#[test]` attribute, cargo test infrastructure.

7. **Blandy, J., Orendorff, J., & Tindall, L.** (2021). *Programming Rust* (2nd ed.). O'Reilly.
   - Error handling with `Result<T, E>`, `?` operator.

### 8.4 ML System Testing

8. **Breck, E., et al.** (2017). "The ML Test Score: A Rubric for ML Production Readiness."
   *Proceedings of IEEE BigData*.
   - Feature expectations, model validation, infrastructure tests.

9. **Amershi, S., et al.** (2019). "Software Engineering for Machine Learning: A Case Study."
   *Proceedings of ICSE-SEIP*.
   - Testing ML systems requires both code and data validation.

---

## 9. Deletion Protocol

### 9.1 Pre-Deletion Checklist

- [ ] All F-* tests from bash scripts implemented in Rust
- [ ] Performance parity: Rust examples complete in similar time
- [ ] Output parity: Rust examples produce identical results
- [ ] CI updated to use `cargo run --example`
- [ ] Documentation updated

### 9.2 Scripts to Delete

Upon successful verification, delete the following:

```bash
# apr-cli scripts
rm crates/apr-cli/scripts/qa-run.sh
rm crates/apr-cli/scripts/qa-chat.sh
rm crates/apr-cli/scripts/qa-serve.sh

# Top-level scripts
rm scripts/qa-apr-cli.sh
rm scripts/showcase-qa.sh

# docs/qa scripts
rm docs/qa/qa-verify.sh
rm docs/qa/all-modules-qa-verify.sh
rm docs/qa/math-qa-verify.sh
```

### 9.3 Post-Deletion Verification

```bash
# Verify no QA bash scripts remain
find . -name "qa*.sh" -type f | grep -v node_modules

# Verify examples work
cargo run --example qa_run
cargo run --example qa_chat
cargo run --example qa_serve -- --port 8080
cargo run --example qa_verify
```

---

## Appendix A: Test Coverage Mapping

### qa-run.sh → qa_run.rs

| Bash Function | Rust Function | Status |
|---------------|---------------|--------|
| `test_run_basic()` | P001-P004 | ✅ DONE |
| `test_run_performance()` | P006-P007 | ✅ DONE |
| `test_run_determinism()` | P008 | ✅ DONE |
| `test_format_parity()` | P009-P010 | ✅ DONE |

### qa-chat.sh → qa_chat.rs

| Bash Function | Rust Function | Status |
|---------------|---------------|--------|
| `test_chat_basic()` | P011-P014 | ✅ DONE |
| `test_chat_performance()` | P015-P016 | ✅ DONE |
| `test_format_parity()` | N/A | ✅ DONE |

### qa-serve.sh → qa_serve.rs

| Bash Function | Rust Function | Status |
|---------------|---------------|--------|
| `test_health()` | P017 | ✅ DONE |
| `test_basic_inference()` | P019-P021 | ✅ DONE |
| `test_streaming()` | P024-P025 | ✅ DONE |
| `test_streaming_openai_sdk()` | P024 | ✅ DONE |
| `test_determinism()` | P026 | ✅ DONE |
| `test_malformed_json()` | P027 | ✅ DONE |
| `test_coherency()` | P028 | ✅ DONE |
| `test_no_multi_turn_loop()` | P029 | ✅ DONE |
| `test_tracing()` | P030-P032 | ✅ DONE |
| `test_default_mode_suppression()` | P033 | ✅ DONE |
| `test_format_parity()` | --all-models | ✅ DONE |
| `run_all_models()` | --all-models | ✅ DONE |

---

**Awaiting Dr. Karl Popper's Review**

*"A theory that is not refutable by any conceivable event is non-scientific."*
*— K. Popper, The Logic of Scientific Discovery (1959)*