# I/O Performance Plan: Eliminating Redundant Copies
## Status
| Load (44kHz i16 60s stereo) | 5.5ms | **0.38ms** | ✅ 1.3× faster | ✅ 46× faster |
| Save — F-order / from file | 6.8ms | **0.56ms** | ✅ 12× faster | ✅ 11× faster |
| Save — C-order / programmatic | 7.7ms | **2.16ms** | ✅ 3.1× faster | ✅ 2.8× faster |
**All paths solved.** Load is now faster than scipy, matching `np.fromfile` speed.
### Load fix round 1 (`python.rs:136`)
`create_pyarray_fortran` was cloning the owned `NonEmptyVec` with `.to_vec()` right before
handing it to numpy. Changed to `.into_vec()` — ownership transfer, zero extra copy.
Result: 5.5ms → 0.50ms.
### Load fix round 2 (`python.rs::read_wav_direct`, `wav_file.rs::parse_wav_header_streaming`)
The mmap path still had double TLB pressure: it mapped the file pages into one virtual
address range, then `to_vec()` copied them into a second allocation (2× TLB entries for
the same data). `np.fromfile` at 0.38ms was the baseline — one allocation, one kernel
`read()` syscall.
Fixed with a streaming header parser + direct `read_exact` path:
1. `parse_wav_header_streaming` reads only the WAV header via `BufReader` (~80–100 bytes,
one OS read for the 8 KiB buffer) and returns `(BaseAudioInfo, data_byte_offset)`.
2. `read_wav_direct` allocates `Vec<T>` of the right size, seeks to `data_byte_offset`,
then calls `read_exact` — the kernel copies tmpfs pages directly into our Vec with
no intermediate mmap mapping.
3. `Vec` is transferred zero-copy to numpy via `into_vec()` (existing mechanism).
Only used when T matches the file's native type and the platform is LE; falls back to
the mmap path for type conversions, I24, or big-endian.
Result: 0.50ms → **0.38ms**, matching `np.fromfile` exactly.
### Save fix — F-order (`wav_file.rs`, `write_audio_data_interleaved`)
The F-order `(channels, frames)` array loaded by `read_pyarray` already has WAV-interleaved
memory layout `[L0, R0, L1, R1, …]`. The write path was calling `as_interleaved_vec()` which:
1. Saw F-order → `as_slice()` returned `None` → `iter().copied().collect()` (C-order copy)
2. Called `interleave_multi_vec` on that (second copy back to interleaved)
3. Then serialised byte-by-byte in 256 KiB chunks (third copy)
Fixed by adding a fast path before the interleave logic: if the array is non-standard-layout
(F-order) and memory-contiguous (`as_slice_memory_order()` returns `Some`), cast the slice
directly to `&[u8]` and `write_all` in one shot. Zero intermediate copies, zero allocations.
### Save fix — C-order (`wav_file.rs`, `write_audio_data_interleaved`)
Programmatic audio (`AudioSamples.stack`, `sine_wave`, etc.) produces C-order planar arrays
`[L0,L1,…,R0,R1,…]` which must be interleaved before writing WAV. The naive approach
(`view.t().as_standard_layout()`) allocated a 10 MB intermediate buffer, incurring ~2500 OS
page faults (~2.5 ms) before any data moved.
Fixed with a streaming tiled interleave:
1. Pre-allocate one fixed 512 KB tile buffer (128 page faults × 1 μs = ~0.1 ms, amortised once).
2. For each 512 KB window of frames, iterate channels-outer / frames-inner so each channel's
window is read sequentially (HW prefetcher works optimally); write interleaved into the tile.
3. `write_all` the tile to `BufWriter` (~41 syscalls for 10 MB).
Result: 7.7 ms → **2.16 ms** (3.1× faster than scipy's 6.8 ms).
---
## Background
Benchmarking (`bench_io.csv`) shows `audio_samples` is **4–11× slower** than scipy/soundfile on WAV
load and save for larger files (44100 Hz, stereo, ≥30 s). Flamegraph profiling (`py-spy --native`)
confirms the time is spent inside `s_io::wav::data::DataChunk::to_sample_vec`, which is called from
the WAV read path.
## Root Cause: The Copy Chain
For a 10 MB stereo i16 WAV file the current path produces **four full copies** of the audio data:
```
File on disk
│ mmap (zero copy — AudioDataSource::MemoryMapped)
▼
&[u8] view via DataChunk
│ slice.to_vec() inside to_sample_vec ← COPY #1 (~10 MB)
▼
Vec<i16> (interleaved, owned)
│ .collect_non_empty() inside read_samples, S == T ← COPY #2 (~10 MB)
▼
NonEmptyVec<i16> (still interleaved)
│ deinterleave_multi_vec (alloc + scatter write) ← COPY #3 (~10 MB)
▼
NonEmptyVec<i16> (planar L…L R…R)
│ planar_data.to_vec() → Array2::from_shape_vec ← COPY #4 (~10 MB)
▼
Array2<i16> inside AudioSamples
```
**~40 MB of memcpy for a 10 MB file.**
### How scipy avoids this
scipy reads directly into a numpy array — no intermediate representation:
```python
# One copy: kernel → numpy buffer
data = np.fromfile(fid, dtype=np.int16, count=n_samples)
# Or zero copies with mmap:
data = np.memmap(fid, dtype=np.int16, mode='c', offset=offset, shape=(n_samples,))
# Zero-copy reshape — just changes strides, no data movement
if n_channels > 1:
data = data.reshape(-1, n_channels) # (frames, channels), same memory
```
scipy returns interleaved `(frames, channels)` layout and uses stride tricks instead of physically
rearranging memory. The tradeoff is that `AudioSamples` uses planar `(channels, frames)` layout,
which normally requires a physical copy — except that F-order (column-major) arrays can represent
the same interleaved memory layout without touching the data (see Fix C below).
---
## Fixes (ordered by impact / ease)
### Fix A — Eliminate copy #2: skip the identity collect in `read_samples`
**File:** `src/wav/data.rs`
**Effort:** trivial
**Gain:** saves one full memcpy (~10 MB for the target file)
`read_samples::<S, T>` unconditionally routes through
`.into_non_empty_iter().map(T::convert_from).collect_non_empty()`. When `S == T` this is an
identity transform that allocates a new `Vec` and copies every element.
Fix: add a `TypeId` early-return before the iterator chain (same pattern already used inside
`to_sample_vec` for the f32 case):
```rust
// At the top of the else-branch in read_samples, after the 24-bit / 64-bit early returns:
if TypeId::of::<S>() == TypeId::of::<T>() {
let samples = self.to_sample_vec::<S>()?;
// Safety: S and T are the same type — TypeId equality guarantees identical layout.
return Ok(unsafe { mem::transmute(samples) });
}
```
---
### Fix B — Eliminate copy #4: move instead of clone into Array2
**File:** `src/wav/wav_file.rs`, `build_samples_from_interleaved_vec`
**Effort:** trivial
**Gain:** saves one full memcpy (~10 MB for the target file)
Both the mono and stereo paths call `.to_vec()` on a `NonEmptyVec` that is already owned and about
to be discarded. `NonEmptyVec` has `into_vec(self) -> Vec<T>` which moves the allocation without
copying.
```rust
// Mono — line ~524
// Before:
AudioSamples::new_mono(Array1::from_vec(interleaved_data.to_vec()), sample_rate)
// After:
AudioSamples::new_mono(Array1::from_vec(interleaved_data.into_vec()), sample_rate)
// Stereo — line ~547
// Before:
Array2::from_shape_vec((num_channels.get() as usize, frames), planar_data.to_vec())
// After:
Array2::from_shape_vec((num_channels.get() as usize, frames), planar_data.into_vec())
```
---
### Fix C — Eliminate copy #3 (deinterleave): use F-order arrays
**File:** `src/wav/wav_file.rs`, `build_samples_from_interleaved_vec`; potentially `audio_samples` repr
**Effort:** moderate
**Gain:** eliminates the deinterleave allocation entirely for 2-channel files; generalises to N channels
**Key insight:** An F-order (column-major) `Array2` with logical shape `(channels, frames)` stores
data in memory as `[s[0,0], s[1,0], s[0,1], s[1,1], …]` — which is exactly WAV's interleaved
layout. So the interleaved `Vec<i16>` from `to_sample_vec` can be wrapped directly into an
`Array2` without any data movement:
```rust
// Instead of allocating a planar Vec and scatter-writing into it:
let arr = Array2::from_shape_vec(
(num_channels.get() as usize, frames).f(), // <-- .f() = Fortran/column-major
interleaved_data.into_vec(), // move, no copy
)
.map_err(|e| ...)?;
AudioSamples::new_multi_channel(arr, sample_rate).map_err(Into::into)
```
The ndarray shape `(C, N).f()` gives strides `(1, C)` — element `[c, n]` is at offset `n*C + c`,
matching interleaved layout. Downstream code that iterates over `arr.row(c)` will still work
correctly; the only difference is that rows are no longer contiguous in memory (stride = C, not 1).
**Considerations:**
- Operations on non-contiguous rows may be slower due to cache behaviour. Profile the operations
that matter (STFT, resampling, etc.) before committing to this layout for the internal repr.
- An alternative is to use F-order only at the I/O boundary and transpose (which forces a copy)
only when an operation requires contiguous rows. This keeps the hot paths fast at the cost of
one copy on first use.
- `AudioSamples::new_multi_channel` currently accepts any `Array2`; check whether it asserts or
relies on C-order anywhere before switching.
---
### Fix D — Eliminate copy #1: zero-copy from mmap for owned-source fallback
**File:** `src/wav/data.rs`, `to_sample_vec`; `src/wav/wav_file.rs`, `open_with_options`
**Effort:** larger refactor
**Gain:** eliminates the last remaining copy for files loaded via `AudioDataSource::Owned`
When the file is too large to mmap (or mmap is disabled), `open_with_options` reads the whole file
into `AudioDataSource::Owned(Vec<u8>)`. Currently `DataChunk` borrows `&[u8]` from that `Vec`,
and `to_sample_vec` copies the bytes into a `Vec<S>`.
The allocation can be reused by changing `to_sample_vec` to consume `self` (`into_sample_vec`) and
transmuting the `Vec<u8>` into `Vec<S>` in-place for the aligned fast-path:
```rust
fn into_sample_vec<S>(self) -> AudioIOResult<NonEmptyVec<S>>
where
S: StandardSample,
{
// ... alignment + size checks ...
if aligned && (S::BITS == 16 || S::BITS == 32) {
// Reuse the Vec<u8> allocation.
// Caller must pass AudioDataSource::Owned; mmap case still needs a copy.
let mut bytes: Vec<u8> = self.bytes.into_owned(); // requires bytes field change
let num_samples = bytes.len() / sample_size;
let capacity = bytes.capacity() / sample_size;
let ptr = bytes.as_mut_ptr() as *mut S;
mem::forget(bytes);
// Safety: alignment checked, size multiple checked, S is plain-old-data.
let vec = unsafe { Vec::from_raw_parts(ptr, num_samples, capacity) };
return Ok(unsafe { NonEmptyVec::new_unchecked(vec) });
}
// fallback unchanged
}
```
This requires `DataChunk.bytes` to change from `&'a [u8]` to an owned or `Cow<'a, [u8]>` variant,
or the caller to pass ownership through. That is a non-trivial signature change — do this after
Fixes A–C are in and measured.
For the **mmap path**, the data is not owned so a copy into `Vec<S>` remains unavoidable as long as
`AudioSamples` requires an owned `Array`. If zero-copy mmap reads are ever needed, `AudioSamples`
would need to support a borrowed/mmap-backed `ArrayView2` variant — a large architectural change.
---
## Expected Impact
| A | #2 (collect) | ~1.3–1.5× |
| B | #4 (to_vec) | ~1.3–1.5× |
| A+B | #2 + #4 | ~1.8–2× |
| A+B+C | #2+#3+#4 | ~3–4× (matches scipy on mmap path) |
| A+B+C+D | all | approaches zero-copy limit |
Fixes A and B are safe, mechanical, and should be done immediately. Fix C needs a profile pass on
downstream ops to confirm F-order rows don't regress anything. Fix D is a longer-term refactor.
---
## Save Path
The save path has a similar problem: `AudioSamples::to_interleaved_vec()` materialises an owned
interleaved `Vec<T>` even though the write path immediately serialises it to bytes and discards it.
A zero-copy save would iterate over the `Array2` directly (by index or with a custom interleaving
iterator) and write bytes on the fly, avoiding the intermediate Vec entirely. This is a separate
workstream; profile the save flamegraph before deciding on priority.