rscrypto 0.3.1 - Docs.rs

# Performance Tasks

Near-term performance work should close measured gaps from the latest benchmark
overview, not chase theoretical wins.

Source of truth:

- Benchmark overview: [`../../benchmark_results/OVERVIEW.md`](../../benchmark_results/OVERVIEW.md)
- Linux benchmark commit: `26845c85530d90725d3ad90c7871d7a474d80d27`
- Apple Silicon benchmark commit: `b06b946d217248d634c43688211c9ebe5c2692e8`
- Primary CI scorecard: 2026-05-27 Linux CI, nine runners
- Apple Silicon scorecard: 2026-06-01 local MBP M1 macOS/aarch64 full run
- Coverage boundary: the Linux CI run omits Argon2, scrypt, and Ascon-AEAD; the Apple Silicon run includes them. Do not combine the totals as a single aggregate because the commits and scopes differ.

## Rules

- Measure first, then change code.
- Preserve the portable path as the byte-for-byte authority.
- Add differential tests before trusting a new kernel or dispatch path.
- Do not claim platform wins without platform-specific benchmark rows.
- Do not optimize a diagnostic row unless it maps to a public API path.

## Top Five Gaps

| Priority | Gap | Current evidence | First move | Close criteria |
| --- | --- | --- | --- | --- |
| P0 | PBKDF2-SHA256 low-iteration setup path | `pbkdf2-sha256 / iters=1`: 18 rows, `0.81x` geomean; two Intel Sapphire Rapids rows against `aws-lc-rs` are `0.03x..0.04x` | Inspect HMAC setup reuse and inner-loop overhead in `src/auth/pbkdf2.rs` and `src/auth/hmac.rs`; isolate fixed setup from per-iteration work | `iters=1` `>=0.95x` geomean and overall `pbkdf2-sha256` `>=1.00x`, with `iters=100` and `iters=1000` unchanged or better |
| P0 | Arm/RISC-V public-key paths | X25519 DH is `0.92x`; Graviton3/4 RSA 3072/4096/8192 verification rows are `0.36x..0.43x` vs `aws-lc-rs`; RISE RISC-V is `0.56x..0.81x` | Profile Curve25519 field arithmetic and RSA public exponentiation on Graviton and RISC-V; separate arithmetic from key/import overhead | X25519 DH `>=1.00x`; RSA verification `>=0.95x` on Graviton and RISC-V without regressing x86_64 or IBM Z |
| P1 | Small-message AEAD frontend overhead | Chacha20-Poly1305 is `0.99x` overall with 73 losses; AES-GCM and GCM-SIV also lose many 1-byte and 32-byte rows | Profile nonce/key schedule setup, tag path setup, and wrapper overhead before touching block kernels | 1-byte and 32-byte AEAD encrypt/decrypt rows reach `>=0.95x` fastest-external geomean while sustained rows stay at or above current levels |
| P1 | HMAC-SHA256 one-shot path | `hmac-sha256 / hash`: 99 rows, `1.14x` geomean but `0.99x` median and 33 losses | Profile key normalization, pad setup, state cloning, and SHA-256 dispatch in `src/auth/hmac.rs` and `src/hashes/crypto/sha256/*` | HMAC-SHA256 median `>=1.05x`, with PBKDF2 regressions explicitly ruled out |
| P2 | Ed25519 verification variance | `ed25519 / verify`: 36 rows, `1.00x` geomean with 13 losses; pressure shifts between `dalek`, `aws-lc-rs`, `ring`, and `dryoc` | Profile scalar multiplication, decompression, and verification equation in `src/auth/ed25519/*`; keep signing separate because it is not the same bottleneck | `ed25519 / verify` `>=1.05x` Linux CI geomean and fewer than 10 fastest-external losses |

## Watch List

These are real, but should not displace the top five until the main gaps move.

| Area | Evidence | Action |
| --- | --- | --- |
| RSA verification overall | Verification-only RSA is `0.98x` despite import+verify being `1.32x` | Do not let the faster parse/import rows hide the verifier gap |
| X25519 | Overall Linux CI `0.95x`; `diffie-hellman` small-sample `0.92x` | Recheck after Curve25519 arithmetic changes, but do not overfit nine-row groups |
| Argon2 and scrypt | Missing from the 2026-05-27 Linux CI plan; Apple Silicon full run no longer shows catastrophic Argon2 small-memory loss; worst matched Argon2 row is `0.96x` and scrypt rows are above parity | Add `password_hashing` back to CI before publishing cross-platform password-hashing claims |
| KMAC/cSHAKE | Missing from the 2026-05-27 Linux CI plan | Add CI coverage before making README or release claims about these rows |
| Apple Silicon SHA-3 | macOS/aarch64 SHA-3 is `0.94x` fastest-external geomean, below parity and much weaker than Linux CI | Treat SHA-3 as platform-sensitive; profile before changing the portable path |
| Apple Silicon BLAKE3 64 KiB | `>=64 KiB` geomean is `1.80x`, but the keyed and derive-key 64 KiB rows lose before larger rows recover | Do not summarize this lane as universal BLAKE3 dominance; inspect chunking and threshold behavior if optimizing |

## Do Not Spend Time Here Yet

- Checksums: current Linux CI fastest-external checksum geomean is `5.03x`.
- SHA-3/SHAKE: Linux CI geomeans are `2.15x` and `1.86x`.
- BLAKE3 sustained rows: `>=64 KiB` Linux CI geomean is `2.31x`.
- IBM Z AEAD: already far ahead of the fastest external rows.
- Apple Silicon AEAD and RSA: current fastest-external geomeans are `1.47x` and `1.45x`.

## Verification Loop

1. Reproduce the gap locally or on the matching CI runner class.
2. Add or tighten a focused benchmark if the current row is too broad.
3. Profile the public API path, not only internal diagnostic helpers.
4. Patch the smallest hot path that explains the measured delta.
5. Run correctness tests, differential backend tests, and the affected bench.
6. Update `benchmark_results/OVERVIEW.md`, `README.md`, and `assets/readme/perf.svg` only after a clean benchmark run.