Translate bytes from an mmap’d byte slice — zero syscall reads.
Uses SIMD AVX2 for range-delta patterns (e.g., a-z → A-Z).
Chunked approach: 1MB buffer fits in L2 cache, avoids large allocations.
Translation is memory-bandwidth-bound (not compute-bound), so parallel
offers minimal gain but costs 100MB+ allocation + zero-init overhead.
Translate + squeeze from mmap’d byte slice.
Single buffer: translate into buffer, then squeeze in-place (wp <= i always holds).
Eliminates second buffer allocation and reduces memory traffic.