Streaming paste for the parallel (normal) mode.
Uses memchr per-line scanning and a 1MB output buffer with raw fd writes.
For the common 2-file case, dispatches to an optimized fast path.
Paste files in normal (parallel) mode and return the output buffer.
Pre-splits files into line offsets (one SIMD pass each), then the main
loop uses O(1) array indexing instead of per-line memchr calls.
Uses unsafe raw pointer writes to eliminate bounds-check overhead.
Paste files in serial mode and return the output buffer.
For each file, join all lines with the delimiter list (cycling).
Pre-splits lines using SIMD memchr, then iterates offset pairs.