Translate bytes from an mmap’d byte slice.
Detects single-range translations (e.g., a-z to A-Z) and uses SIMD vectorized
arithmetic (AVX2: 32 bytes/iter, SSE2: 16 bytes/iter) for those cases.
Falls back to scalar 256-byte table lookup for general translations.
Translate bytes in-place on an owned buffer, then write.
For piped stdin where we own the data, this avoids the separate output buffer
allocation needed by translate_mmap. Uses parallel in-place SIMD for large data.