mft 0.7.0 - Docs.rs

# Performance theses (living document)

This file is a running log of **hypotheses (“theses”)** and the **measurement protocol** we’ll use to validate them one by one.

Principles:
- **One change per experiment** (or one tightly-coupled set), with before/after measurements.
- Prefer **end-to-end CLI throughput** on a fixed input (`samples/MFT`) as the primary KPI.
- Keep a **saved profile** around for every “checkpoint” so we can explain wins / regressions.
- When results are noisy, prefer **median** and **min** over mean, and record variance.

## Agent playbook (reproducible workflow)

This section is the **exact workflow** used to land each hypothesis as a PR-quality change.
If you hand this file to another agent, they should be able to reproduce the same process and artifacts.

### Naming & artifacts (do this consistently)

Pick the next hypothesis ID: `H{N}` (monotonic, don’t reuse IDs).

- **Branch**: `perf/h{N}-{short-slug}` (example: `perf/h6-resident-slices`)
- **Saved binaries** (so benchmarks are stable and diffable):
  - `target/release/mft_dump.h{N}_before`
  - `target/release/mft_dump.h{N}_after`
- **Hyperfine JSON**:
  - `target/h{N}-before-vs-after.hyperfine.json`
- **Samply profiles** (merged by running many iterations):
  - `target/samply/h{N}_before.profile.json.gz`
  - `target/samply/h{N}_after.profile.json.gz`

### Canonical benchmark command lines (copy/paste)

These are the commands we benchmark/profiler-record. Keep them unchanged unless the thesis *requires* changing them.

W1 (JSONL, end-to-end, write suppressed):

```bash
./target/release/mft_dump samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite
```

W2 (CSV, end-to-end, write suppressed):

```bash
./target/release/mft_dump samples/MFT -o csv -f /dev/null --no-confirm-overwrite
```

### Step-by-step: run an experiment end-to-end

#### 0) Start a new thesis

```bash
cd /Users/omerba/Workspace/mft
git checkout -b perf/h{N}-{short-slug}
```

#### 1) Build + snapshot the **before** binary

```bash
cd /Users/omerba/Workspace/mft
cargo build --release --bin mft_dump
cp -f target/release/mft_dump target/release/mft_dump.h{N}_before
```

#### 2) Record a stable **before** profile (Samply)

We merge many iterations so leaf frames are stable.

```bash
cd /Users/omerba/Workspace/mft
mkdir -p target/samply
samply record --save-only --unstable-presymbolicate --reuse-threads --main-thread-only \
  -o target/samply/h{N}_before.profile.json.gz \
  --iteration-count 200 -- \
  ./target/release/mft_dump.h{N}_before samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite
```

To view (serve locally and open the printed Firefox Profiler URL):

```bash
cd /Users/omerba/Workspace/mft
samply load --no-open -P 4033 target/samply/h{N}_before.profile.json.gz
```

What to record from the UI:
- Use **Call Tree** + **Invert call stack** for top **leaf/self** frames.
- Use normal Call Tree for “big buckets” (inclusive time).
- Filter stack for `mft::` / `mft_dump::` when looking for in-crate work.

#### 3) Implement the change (keep it tight)

- Make the smallest change that tests the hypothesis.
- If you find yourself changing 5+ unrelated things, split into multiple theses.

#### 4) Build + snapshot the **after** binary

```bash
cd /Users/omerba/Workspace/mft
cargo build --release --bin mft_dump
cp -f target/release/mft_dump target/release/mft_dump.h{N}_after
```

#### 5) Benchmark **before vs after in the same hyperfine command**

We always run both saved binaries in a single `hyperfine` invocation and export JSON.

```bash
cd /Users/omerba/Workspace/mft
hyperfine --warmup 5 --runs 40 \
  --export-json target/h{N}-before-vs-after.hyperfine.json \
  './target/release/mft_dump.h{N}_before samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite' \
  './target/release/mft_dump.h{N}_after  samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite'
```

Extract medians quickly (no jq required):

```bash
python3 - <<'PY'
import json
path = "target/h{N}-before-vs-after.hyperfine.json"
d = json.load(open(path))
for r in d["results"]:
    print(r["command"])
    print("  median:", r["times"]["median"])
    print("  mean  :", r["times"]["mean"], "stddev:", r["times"]["stddev"])
PY
```

If variance is high, amortize noise by running multiple iterations inside each hyperfine run:

```bash
cd /Users/omerba/Workspace/mft
hyperfine --warmup 2 --runs 15 \
  --export-json target/h{N}-before-vs-after.hyperfine.json \
  --command-name 'before (20x)' "bash -lc 'for i in {1..20}; do ./target/release/mft_dump.h{N}_before samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite; done'" \
  --command-name 'after  (20x)' "bash -lc 'for i in {1..20}; do ./target/release/mft_dump.h{N}_after  samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite; done'"
```

#### 6) Record an **after** profile (Samply)

```bash
cd /Users/omerba/Workspace/mft
samply record --save-only --unstable-presymbolicate --reuse-threads --main-thread-only \
  -o target/samply/h{N}_after.profile.json.gz \
  --iteration-count 200 -- \
  ./target/release/mft_dump.h{N}_after samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite
```

View:

```bash
cd /Users/omerba/Workspace/mft
samply load --no-open -P 4034 target/samply/h{N}_after.profile.json.gz
```

#### 7) Correctness checks (pick the strictness that matches the thesis)

**Semantic JSONL equality** (preferred; formatting differences allowed):

```bash
cd /Users/omerba/Workspace/mft
rm -f /tmp/mft_before.jsonl /tmp/mft_after.jsonl
./target/release/mft_dump.h{N}_before samples/MFT --ranges 0-200 -o jsonl -f /tmp/mft_before.jsonl --no-confirm-overwrite
./target/release/mft_dump.h{N}_after  samples/MFT --ranges 0-200 -o jsonl -f /tmp/mft_after.jsonl  --no-confirm-overwrite
python3 - <<'PY'
import json
b = [json.loads(l) for l in open("/tmp/mft_before.jsonl")]
a = [json.loads(l) for l in open("/tmp/mft_after.jsonl")]
assert b == a, "semantic JSONL mismatch"
print("OK: semantic JSONL identical (ranges 0-200)")
PY
```

**Byte-for-byte equality** (use when the thesis claims exact output identity):

```bash
diff -u /tmp/mft_before.jsonl /tmp/mft_after.jsonl >/dev/null && echo "OK: byte-identical"
```

#### 8) Update this file (PERF.md) with a write-up

Add a section under “Completed optimizations” (or “Rejected”) with:
- **What changed**
- **Benchmarks** (paste the exact hyperfine command)
- **Extracted medians** (from exported JSON)
- **Speedup** (ratio and %)
- **Profile delta** (top leaf frame(s) before/after, mention if top leaf changed)
- **Correctness check** (command + result)
- **Artifacts**: profile paths + hyperfine JSON path

#### 9) PR-quality finish

Run the usual checks before committing:

```bash
cd /Users/omerba/Workspace/mft
cargo test --all-features
cargo fmt
cargo clippy --all-targets --all-features
```

Then commit with a message that matches the thesis and the observable change (example):

```bash
git commit -am "perf: H{N} avoid resident attribute copies in JSONL"
```

### How to handle negative results (rejected theses)

If the benchmark is within noise or regresses:
- **Revert** the change (keep the branch clean), or leave it but clearly mark as rejected.
- Add a “Rejected” subsection documenting:
  - the hypothesis
  - the benchmark numbers (showing it’s noise/regression)
  - the profile evidence (what got worse / what new leaf appeared)
  - the conclusion (“not worth it”) and what to try next

## Canonical workloads

All commands assume:

```bash
cargo build --release --bin mft_dump
```

- **W1 (JSONL, end-to-end)**:

```bash
./target/release/mft_dump samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite
```

- **W2 (CSV, end-to-end)**:

```bash
./target/release/mft_dump samples/MFT -o csv -f /dev/null --no-confirm-overwrite
```

## Baseline environment (2025-12-23)

- **OS**: macOS 26.2 (25C56), Darwin 25.2.0, arm64
- **HW**: `Mac15,6`, 11 cores, 36GB RAM
- **Toolchain**: rustc 1.92.0, cargo 1.92.0

If you’re re-running baselines on a different machine/OS, append a new baseline section rather than overwriting this one.

## Baseline numbers (2025-12-23)

Measured with `hyperfine` (30 runs, 3 warmup), output to `/dev/null`:

- **W1 JSONL**: ~**103 ms mean** (σ ~14 ms), range ~94–169 ms
- **W2 CSV**: observed **high variance** on this machine/session (outliers up to ~468 ms). Re-run on a quiet system before treating CSV as a stable KPI.

Raw captures (not committed, under `target/`):
- `target/perf-baseline.json`
- `target/perf-baseline.csv.json`

## Profiling (baseline)

### Samply (hot functions / leafs)

End-to-end JSONL profile (merge many iterations for stability):

```bash
mkdir -p target/samply
samply record --save-only --unstable-presymbolicate --reuse-threads --main-thread-only \
  -o target/samply/mft_dump_jsonl_merged.profile.json.gz \
  --iteration-count 200 -- \
  ./target/release/mft_dump samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite

samply load target/samply/mft_dump_jsonl_merged.profile.json.gz
```

What to look at:
- **Call Tree + “Invert call stack”** for top leaf frames (true hot spots).
- **Call Tree (non-inverted)** for inclusive costs (big buckets like “serialization”).
- Filter stack: `mft::` / `mft_dump::` to focus on crate code.

#### Baseline profile notes (from `mft_dump_jsonl_merged`)

Top inclusive buckets:
- `MftEntry::serialize` dominates (serialization is the main cost center).
- `MftParser::get_entry` is non-trivial but secondary in the end-to-end JSONL path.

Top leaf frames include:
- `serde_json::ser::format_escaped_str_contents` (string escaping)
- `_platform_memmove` (buffer copying)
- `write` / `read` / `__lseek` (I/O syscalls)

### macOS hardware counters (optional)

On macOS, `xctrace` can record CPU counter templates. This isn’t as clean as Linux `perf stat`, but it can still provide useful sanity checks (e.g. cycle counts / bottleneck breakdown).

Record:

```bash
mkdir -p target/xctrace
xcrun xctrace record --no-prompt --template 'CPU Counters' \
  --output target/xctrace/mft_dump_jsonl_cpu_counters.trace \
  --launch -- ./target/release/mft_dump samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite
```

Explore/export:

```bash
xcrun xctrace export --input target/xctrace/mft_dump_jsonl_cpu_counters.trace --toc
```

We’ve confirmed these schemas exist in the trace:
- `MetricTable`
- `MetricAggregationForProcess`
- `CounterMetricAggregatedForProcess`

Note: the default templates we tried expose cycles + “bottleneck” style metrics; raw retired-instruction counts may require different counter configuration (or use Linux `perf stat`).

## Theses / hypotheses backlog

Each item includes:
- **Claim**: what we think is true
- **Change**: the minimal code change to test it
- **Success metric**: what improvement we require on W1
- **Guardrails**: correctness + “don’t regress too much” constraints

### H1 — Remove per-entry allocation/copy in JSON serialization

- **Claim**: end-to-end JSONL is dominated by `serde_json` work; we can shave a large chunk by removing avoidable allocations/copies.
- **Evidence**: `MftEntry::serialize` is ~3/4 of inclusive time in samply; leaf frames show `memmove` and string escaping.
- **Change**:
  - Stop building a `Vec<MftAttribute>` inside `MftEntry::serialize` (stream attributes as a `SerializeSeq`).
  - Stop serializing into a fresh `Vec<u8>` per entry in `mft_dump::print_json_entry` (reuse a buffer).
  - Use a faster serde-compatible JSON serializer for JSONL (`sonic-rs`).
- **Success metric**: W1 improves by **≥ 15%** on median time.
- **Guardrails**:
  - Output must remain **semantically identical** for JSONL (same JSON values per line; formatting/escaping differences are allowed).
  - `cargo test --all-features` stays green.

### H2 — Reduce syscall overhead in sequential reads

- **Claim**: sequential iteration still pays a lot of `lseek` overhead; removing it will meaningfully reduce CPU time once serialization is cheaper.
- **Evidence**: parser-only profiles show `__lseek` as a major leaf; end-to-end still has visible syscall leaf time.
- **Change**:
  - Teach `get_entry` to skip `seek` when already positioned for sequential reads (track `next_read_offset`).
  - Update CLI loop to use the sequential path when ranges are not random.
- **Success metric**: W1 improves by **≥ 5%** after H1 lands (or measure on W2 if JSONL still hides it).
- **Guardrails**: no functional changes; still supports `--ranges`.

### H4 — Reduce hex formatting overhead (`to_hex_string`)

- **Claim**: hex encoding of raw attribute blobs is still a meaningful formatting cost.
- **Evidence**: leaf frames show `core::fmt::num::<impl UpperHex for u8>::fmt` at ~2% self, and `mft::utils::to_hex_string` in the top leaf list.
- **Change**: replace `to_hex_string`’s `write!(\"{byte:02X}\")` loop with a table-based encoder (no `fmt`).
- **Success metric**: W1 improves by **≥ 2%** on median time (post-H1/H2/H3).
- **Guardrails**: output must be byte-for-byte identical for hex strings (uppercase, no separators).

## Completed optimizations

### H1 (2025-12-23) — Faster JSONL serialization

**What changed**
- Stream `attributes` in `MftEntry` serialization (avoid allocating `Vec<MftAttribute>`).
- Reuse a `Vec<u8>` JSON buffer in `mft_dump` (avoid per-entry allocation).
- Switch JSONL output from `serde_json` to **`sonic-rs`** (serde-compatible, SIMD-focused).
  - Pretty JSON (`-o json`) still uses `serde_json` for formatting.

**Benchmarks**

Single `hyperfine` run comparing the saved binaries:

```bash
hyperfine --warmup 3 --runs 30 \
  './target/release/mft_dump.h1_before samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite' \
  './target/release/mft_dump.h1_after3_sonic samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite'
```

Extracted medians (from `target/h1-before-vs-after.hyperfine.json`):
- **Before median**: **95.94 ms**
- **After median**: **73.65 ms**
- **Speedup**: ~**1.30×** (≈ **23%** faster)

**Profile delta (top leaf)**
- **Before**: `serde_json::ser::format_escaped_str_contents` (~18% self)
- **After**: `sonic_rs::format::Formatter::write_string_fast` (~18% self)

Profiles:
- `target/samply/h1_before.profile.json.gz`
- `target/samply/h1_after3_sonic.profile.json.gz`

**Correctness check**

We verified **semantic equality** of JSONL output on a small range:
- Command: both binaries with `--ranges 0-200` and `-o jsonl`
- Method: parse each line as JSON and compare Python objects
- Result: OK (193 lines; some entries are skipped due to zero headers)

### H2 (2025-12-23) — Skip per-entry seek for sequential scans

**What changed**
- `MftParser::get_entry` now tracks the **next expected stream offset** and only calls `seek()` when the requested entry is not the sequential next entry.

**Benchmarks**

Single `hyperfine` run comparing the saved binaries:

```bash
hyperfine --warmup 3 --runs 30 \
  './target/release/mft_dump.h2_before samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite' \
  './target/release/mft_dump.h2_after samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite'
```

Extracted medians (from `target/h2-before-vs-after.hyperfine.json`):
- **Before median**: **74.72 ms**
- **After median**: **63.06 ms**
- **Speedup**: ~**1.18×** (≈ **16%** faster)

**Profile delta (leaf reduction)**

Before (`target/samply/h2_before.profile.json.gz`, inverted call tree):
- `read` ~11% self
- `__lseek` ~5.5% self

After (`target/samply/h2_after.profile.json.gz`, inverted call tree):
- `read` ~4.8% self
- `__lseek` no longer appears in top leaf list (effectively eliminated for W1)

### H3 (2025-12-23) — Migrate timestamps to `jiff` (preserve chrono-compatible output)

**What changed**
- Replace `chrono::DateTime<Utc>` fields in:
  - `StandardInfoAttr` (`0x10`)
  - `FileNameAttr` (`0x30`)
  - `FlatMftEntryWithName` (CSV)
  with `jiff::Timestamp` (re-exported as `mft::Timestamp`).
- Convert Windows FILETIME directly in `mft::utils::windows_filetime_to_timestamp` (truncate to microseconds to match historical behavior).
- Preserve the exact JSON/CSV timestamp string format by forcing chrono-compatible RFC3339 precision using `jiff::fmt::temporal::DateTimePrinter` (via `#[serde(serialize_with = ...)]`).
- Enable `jiff`’s `perf-inline` feature (important when using `default-features = false`; it’s enabled by default otherwise).

**Benchmarks**

Single `hyperfine` run comparing saved binaries:

```bash
hyperfine --warmup 5 --runs 40 \
  './target/release/mft_dump.h3_before samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite' \
  './target/release/mft_dump.h3_after_final samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite'
```

Extracted medians (from `target/h3-before-vs-after-final.hyperfine.json`):
- **Before median**: **69.21 ms**
- **After median**: **64.65 ms**
- **Speedup**: ~**1.07×** (≈ **6.6%** faster)

**Profile delta (leaf shift)**

Before (`target/samply/h3_before.profile.json.gz`, inverted call tree) included:
- `chrono::...FormatIso8601...::fmt` (~0.8% self)
- `chrono::naive::date::NaiveDate::add_days` (~1.2% self)

After (`target/samply/h3_after_final.profile.json.gz`, inverted call tree):
- No `chrono::` frames in the top leaf list
- Timestamp formatting is now primarily `jiff::fmt::temporal::printer::DateTimePrinter::print_datetime` (~3.9% self)

Profiles:
- `target/samply/h3_before.profile.json.gz`
- `target/samply/h3_after_final.profile.json.gz`

**Correctness check**

We verified **semantic equality** of JSONL output (including timestamp strings) on a small range:
- Command: both binaries with `--ranges 0-200` and `-o jsonl`
- Method: parse each line as JSON and compare Python objects
- Result: OK (193 lines)

### H4 (2025-12-23) — Faster hex encoding (remove `fmt`-based per-byte formatting)

**What changed**
- Replaced `mft::utils::to_hex_string`’s per-byte `write!(\"{byte:02X}\")` loop with a nibble lookup table and `String::push` (no `core::fmt::UpperHex` formatting path).
- Added a small unit test to lock in uppercase, separator-free output.

**Benchmarks**

Single `hyperfine` run comparing the saved binaries:

```bash
hyperfine --warmup 5 --runs 40 \
  './target/release/mft_dump.h4_before samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite' \
  './target/release/mft_dump.h4_after2 samples/MFT -o jsonl -f /dev/null --no-confirm-overwrite'
```

Extracted medians (from `target/h4-before-vs-after2.hyperfine.json`):
- **Before median**: **63.31 ms**
- **After median**: **57.81 ms**
- **Speedup**: ~**1.10×** (≈ **8.7%** faster)

**Profile delta (leaf reduction)**

Before (`target/samply/h4_before.profile.json.gz`, inverted call tree):
- `core::fmt::num::<impl UpperHex for u8>::fmt` ~2.3% self

After (`target/samply/h4_after2.profile.json.gz`, inverted call tree):
- `UpperHex` no longer appears in the top leaf list
- `mft::utils::to_hex_string` is still visible (~1.3% self), but the heavy `fmt` machinery is gone

Profiles:
- `target/samply/h4_before.profile.json.gz`
- `target/samply/h4_after2.profile.json.gz`