Streaming paste for the parallel (normal) mode.
Scans each file line-by-line with memchr on-the-fly — no pre-split offset arrays.
Uses a single 2MB output buffer with raw fd writes.
Paste files in normal (parallel) mode and return the output buffer.
Pre-splits files into line offsets (one SIMD pass each), then the main
loop uses O(1) array indexing instead of per-line memchr calls.
Uses unsafe raw pointer writes to eliminate bounds-check overhead.