audio_samples_io 0.1.8

# I/O Performance Plan: Eliminating Redundant Copies

## Status

| Operation | Before | After | vs scipy | vs soundfile |
|-----------|--------|-------|----------|--------------|
| Load (44kHz i16 60s stereo) | 5.5ms | **0.38ms** | ✅ 1.3× faster | ✅ 46× faster |
| Save — F-order / from file  | 6.8ms | **0.56ms** | ✅ 12× faster  | ✅ 11× faster |
| Save — C-order / programmatic | 7.7ms | **2.16ms** | ✅ 3.1× faster | ✅ 2.8× faster |

**All paths solved.**  Load is now faster than scipy, matching `np.fromfile` speed.

### Load fix round 1 (`python.rs:136`)
`create_pyarray_fortran` was cloning the owned `NonEmptyVec` with `.to_vec()` right before
handing it to numpy.  Changed to `.into_vec()` — ownership transfer, zero extra copy.
Result: 5.5ms → 0.50ms.

### Load fix round 2 (`python.rs::read_wav_direct`, `wav_file.rs::parse_wav_header_streaming`)
The mmap path still had double TLB pressure: it mapped the file pages into one virtual
address range, then `to_vec()` copied them into a second allocation (2× TLB entries for
the same data).  `np.fromfile` at 0.38ms was the baseline — one allocation, one kernel
`read()` syscall.

Fixed with a streaming header parser + direct `read_exact` path:
1. `parse_wav_header_streaming` reads only the WAV header via `BufReader` (~80–100 bytes,
   one OS read for the 8 KiB buffer) and returns `(BaseAudioInfo, data_byte_offset)`.
2. `read_wav_direct` allocates `Vec<T>` of the right size, seeks to `data_byte_offset`,
   then calls `read_exact` — the kernel copies tmpfs pages directly into our Vec with
   no intermediate mmap mapping.
3. `Vec` is transferred zero-copy to numpy via `into_vec()` (existing mechanism).

Only used when T matches the file's native type and the platform is LE; falls back to
the mmap path for type conversions, I24, or big-endian.
Result: 0.50ms → **0.38ms**, matching `np.fromfile` exactly.

### Save fix — F-order (`wav_file.rs`, `write_audio_data_interleaved`)
The F-order `(channels, frames)` array loaded by `read_pyarray` already has WAV-interleaved
memory layout `[L0, R0, L1, R1, …]`.  The write path was calling `as_interleaved_vec()` which:
1. Saw F-order → `as_slice()` returned `None` → `iter().copied().collect()` (C-order copy)
2. Called `interleave_multi_vec` on that (second copy back to interleaved)
3. Then serialised byte-by-byte in 256 KiB chunks (third copy)

Fixed by adding a fast path before the interleave logic: if the array is non-standard-layout
(F-order) and memory-contiguous (`as_slice_memory_order()` returns `Some`), cast the slice
directly to `&[u8]` and `write_all` in one shot.  Zero intermediate copies, zero allocations.

### Save fix — C-order (`wav_file.rs`, `write_audio_data_interleaved`)
Programmatic audio (`AudioSamples.stack`, `sine_wave`, etc.) produces C-order planar arrays
`[L0,L1,…,R0,R1,…]` which must be interleaved before writing WAV.  The naive approach
(`view.t().as_standard_layout()`) allocated a 10 MB intermediate buffer, incurring ~2500 OS
page faults (~2.5 ms) before any data moved.

Fixed with a streaming tiled interleave:
1. Pre-allocate one fixed 512 KB tile buffer (128 page faults × 1 μs = ~0.1 ms, amortised once).
2. For each 512 KB window of frames, iterate channels-outer / frames-inner so each channel's
   window is read sequentially (HW prefetcher works optimally); write interleaved into the tile.
3. `write_all` the tile to `BufWriter` (~41 syscalls for 10 MB).

Result: 7.7 ms → **2.16 ms** (3.1× faster than scipy's 6.8 ms).

---

## Background

Benchmarking (`bench_io.csv`) shows `audio_samples` is **4–11× slower** than scipy/soundfile on WAV
load and save for larger files (44100 Hz, stereo, ≥30 s).  Flamegraph profiling (`py-spy --native`)
confirms the time is spent inside `s_io::wav::data::DataChunk::to_sample_vec`, which is called from
the WAV read path.

## Root Cause: The Copy Chain

For a 10 MB stereo i16 WAV file the current path produces **four full copies** of the audio data:

```
File on disk
    │  mmap (zero copy — AudioDataSource::MemoryMapped)
    ▼
&[u8] view via DataChunk
    │  slice.to_vec()  inside to_sample_vec               ← COPY #1  (~10 MB)
    ▼
Vec<i16>  (interleaved, owned)
    │  .collect_non_empty()  inside read_samples, S == T  ← COPY #2  (~10 MB)
    ▼
NonEmptyVec<i16>  (still interleaved)
    │  deinterleave_multi_vec  (alloc + scatter write)    ← COPY #3  (~10 MB)
    ▼
NonEmptyVec<i16>  (planar L…L R…R)
    │  planar_data.to_vec() → Array2::from_shape_vec      ← COPY #4  (~10 MB)
    ▼
Array2<i16>  inside AudioSamples
```

**~40 MB of memcpy for a 10 MB file.**

### How scipy avoids this

scipy reads directly into a numpy array — no intermediate representation:

```python
# One copy: kernel → numpy buffer
data = np.fromfile(fid, dtype=np.int16, count=n_samples)
# Or zero copies with mmap:
data = np.memmap(fid, dtype=np.int16, mode='c', offset=offset, shape=(n_samples,))

# Zero-copy reshape — just changes strides, no data movement
if n_channels > 1:
    data = data.reshape(-1, n_channels)   # (frames, channels), same memory
```

scipy returns interleaved `(frames, channels)` layout and uses stride tricks instead of physically
rearranging memory.  The tradeoff is that `AudioSamples` uses planar `(channels, frames)` layout,
which normally requires a physical copy — except that F-order (column-major) arrays can represent
the same interleaved memory layout without touching the data (see Fix C below).

---

## Fixes (ordered by impact / ease)

### Fix A — Eliminate copy #2: skip the identity collect in `read_samples`
**File:** `src/wav/data.rs`
**Effort:** trivial
**Gain:** saves one full memcpy (~10 MB for the target file)

`read_samples::<S, T>` unconditionally routes through
`.into_non_empty_iter().map(T::convert_from).collect_non_empty()`.  When `S == T` this is an
identity transform that allocates a new `Vec` and copies every element.

Fix: add a `TypeId` early-return before the iterator chain (same pattern already used inside
`to_sample_vec` for the f32 case):

```rust
// At the top of the else-branch in read_samples, after the 24-bit / 64-bit early returns:
if TypeId::of::<S>() == TypeId::of::<T>() {
    let samples = self.to_sample_vec::<S>()?;
    // Safety: S and T are the same type — TypeId equality guarantees identical layout.
    return Ok(unsafe { mem::transmute(samples) });
}
```

---

### Fix B — Eliminate copy #4: move instead of clone into Array2
**File:** `src/wav/wav_file.rs`, `build_samples_from_interleaved_vec`
**Effort:** trivial
**Gain:** saves one full memcpy (~10 MB for the target file)

Both the mono and stereo paths call `.to_vec()` on a `NonEmptyVec` that is already owned and about
to be discarded.  `NonEmptyVec` has `into_vec(self) -> Vec<T>` which moves the allocation without
copying.

```rust
// Mono — line ~524
// Before:
AudioSamples::new_mono(Array1::from_vec(interleaved_data.to_vec()), sample_rate)
// After:
AudioSamples::new_mono(Array1::from_vec(interleaved_data.into_vec()), sample_rate)

// Stereo — line ~547
// Before:
Array2::from_shape_vec((num_channels.get() as usize, frames), planar_data.to_vec())
// After:
Array2::from_shape_vec((num_channels.get() as usize, frames), planar_data.into_vec())
```

---

### Fix C — Eliminate copy #3 (deinterleave): use F-order arrays
**File:** `src/wav/wav_file.rs`, `build_samples_from_interleaved_vec`; potentially `audio_samples` repr
**Effort:** moderate
**Gain:** eliminates the deinterleave allocation entirely for 2-channel files; generalises to N channels

**Key insight:** An F-order (column-major) `Array2` with logical shape `(channels, frames)` stores
data in memory as `[s[0,0], s[1,0], s[0,1], s[1,1], …]` — which is exactly WAV's interleaved
layout.  So the interleaved `Vec<i16>` from `to_sample_vec` can be wrapped directly into an
`Array2` without any data movement:

```rust
// Instead of allocating a planar Vec and scatter-writing into it:
let arr = Array2::from_shape_vec(
    (num_channels.get() as usize, frames).f(),   // <-- .f() = Fortran/column-major
    interleaved_data.into_vec(),                  // move, no copy
)
.map_err(|e| ...)?;
AudioSamples::new_multi_channel(arr, sample_rate).map_err(Into::into)
```

The ndarray shape `(C, N).f()` gives strides `(1, C)` — element `[c, n]` is at offset `n*C + c`,
matching interleaved layout.  Downstream code that iterates over `arr.row(c)` will still work
correctly; the only difference is that rows are no longer contiguous in memory (stride = C, not 1).

**Considerations:**
- Operations on non-contiguous rows may be slower due to cache behaviour.  Profile the operations
  that matter (STFT, resampling, etc.) before committing to this layout for the internal repr.
- An alternative is to use F-order only at the I/O boundary and transpose (which forces a copy)
  only when an operation requires contiguous rows.  This keeps the hot paths fast at the cost of
  one copy on first use.
- `AudioSamples::new_multi_channel` currently accepts any `Array2`; check whether it asserts or
  relies on C-order anywhere before switching.

---

### Fix D — Eliminate copy #1: zero-copy from mmap for owned-source fallback
**File:** `src/wav/data.rs`, `to_sample_vec`; `src/wav/wav_file.rs`, `open_with_options`
**Effort:** larger refactor
**Gain:** eliminates the last remaining copy for files loaded via `AudioDataSource::Owned`

When the file is too large to mmap (or mmap is disabled), `open_with_options` reads the whole file
into `AudioDataSource::Owned(Vec<u8>)`.  Currently `DataChunk` borrows `&[u8]` from that `Vec`,
and `to_sample_vec` copies the bytes into a `Vec<S>`.

The allocation can be reused by changing `to_sample_vec` to consume `self` (`into_sample_vec`) and
transmuting the `Vec<u8>` into `Vec<S>` in-place for the aligned fast-path:

```rust
fn into_sample_vec<S>(self) -> AudioIOResult<NonEmptyVec<S>>
where
    S: StandardSample,
{
    // ... alignment + size checks ...
    if aligned && (S::BITS == 16 || S::BITS == 32) {
        // Reuse the Vec<u8> allocation.
        // Caller must pass AudioDataSource::Owned; mmap case still needs a copy.
        let mut bytes: Vec<u8> = self.bytes.into_owned(); // requires bytes field change
        let num_samples = bytes.len() / sample_size;
        let capacity    = bytes.capacity() / sample_size;
        let ptr = bytes.as_mut_ptr() as *mut S;
        mem::forget(bytes);
        // Safety: alignment checked, size multiple checked, S is plain-old-data.
        let vec = unsafe { Vec::from_raw_parts(ptr, num_samples, capacity) };
        return Ok(unsafe { NonEmptyVec::new_unchecked(vec) });
    }
    // fallback unchanged
}
```

This requires `DataChunk.bytes` to change from `&'a [u8]` to an owned or `Cow<'a, [u8]>` variant,
or the caller to pass ownership through.  That is a non-trivial signature change — do this after
Fixes A–C are in and measured.

For the **mmap path**, the data is not owned so a copy into `Vec<S>` remains unavoidable as long as
`AudioSamples` requires an owned `Array`.  If zero-copy mmap reads are ever needed, `AudioSamples`
would need to support a borrowed/mmap-backed `ArrayView2` variant — a large architectural change.

---

## Expected Impact

| Fix | Copies removed | Estimated speedup (stereo 44 kHz 60 s) |
|-----|---------------|----------------------------------------|
| A   | #2 (collect)  | ~1.3–1.5×                              |
| B   | #4 (to_vec)   | ~1.3–1.5×                              |
| A+B | #2 + #4       | ~1.8–2×                                |
| A+B+C | #2+#3+#4   | ~3–4×  (matches scipy on mmap path)    |
| A+B+C+D | all    | approaches zero-copy limit             |

Fixes A and B are safe, mechanical, and should be done immediately.  Fix C needs a profile pass on
downstream ops to confirm F-order rows don't regress anything.  Fix D is a longer-term refactor.

---

## Save Path

The save path has a similar problem: `AudioSamples::to_interleaved_vec()` materialises an owned
interleaved `Vec<T>` even though the write path immediately serialises it to bytes and discards it.
A zero-copy save would iterate over the `Array2` directly (by index or with a custom interleaving
iterator) and write bytes on the fly, avoiding the intermediate Vec entirely.  This is a separate
workstream; profile the save flamegraph before deciding on priority.