bytes-handoff
bytes-handoff is a small Rust crate for moving owned byte buffers across
async I/O boundaries.
It does not replace AsyncRead or AsyncWrite. It layers on top of them so
protocol code can:
- read bytes as soon as they arrive
- keep nonblocking reads in safe, owned mutable buffer state
- peek at incomplete input without committing
- preserve unconsumed tails across parser or mode boundaries
- split complete prefixes into
Bytesfor cheap cross-task handoff - submit owned
Byteswrites to an async writer without borrowing memory until the socket finishes - bound queued writes by item count and byte count
The zero-copy claim here is intentionally scoped: BytesMut/Bytes avoid
extra application-level copies after socket ingress. They do not make TCP
kernel-to-userspace reads zero-copy.
The main use case is content-routed streaming I/O: many independent connections, each with bytes arriving in arbitrary fragments, where routing decisions depend on the stream content. In that setting the read buffer is protocol state, not temporary scratch memory.
Use plain read(&mut [u8]) when bytes are consumed immediately in the same
function and then discarded. Use bytes-handoff when the read buffer itself is
part of the protocol state: partial frames must survive, complete prefixes need
owned handoff, or a parser may switch into a raw tunnel without losing already
read bytes.
Examples
The repository includes small runnable examples:
line_protocol: incremental line parsing with a partial tail.length_prefix: length-prefixed frames where the header arrives before the full payload.content_routing: inspect buffered bytes, route complete safe prefixes, then switch to a raw tunnel while preserving already-read tail bytes.write_handoff: submit owned bytes to an async writer and await completion.
Run one with:
Read Handoff
use Bytes;
use HandoffBuffer;
/// Reads available bytes, splits one complete line, and hands it off.
async
/// Sends an owned byte slice to sync or async protocol work.
Write Handoff
use Bytes;
use ;
/// Submits owned bytes to an async writer without blocking the producer.
Benchmarks
The repository includes Criterion benchmarks for the main operations this crate adds around async I/O:
- incremental reads into persistent state
- complete-frame splitting into owned
Bytes - preserving an already-read tail when parser mode changes
- bounded owned write submission and completion tracking
- byte-budget backpressure
Run them with:
For a defensible local baseline:
The baseline script uses --sample-size 100 --measurement-time 5 by default.
On Linux, set TASKSET_CORES=2-5 to pin the run with taskset and
PERF_STAT=1 to collect cycles, instructions, cache misses, context switches,
and CPU migrations alongside Criterion. For release-grade numbers, run on the
same kernel and CPU family as deployment, with the CPU governor and turbo policy
fixed outside the script.
For an end-to-end stream harness rather than a Criterion microbenchmark:
The harness drives complete client/proxy/sink streams through fragmented input,
content-routed prefixes, tunnel handoff, and WriteHandoff output. It writes
machine-readable run artifacts under bench/results/, including throughput,
latency percentiles, context switches, CPU cost, and peak RSS. Use
--transport tcp for a localhost TCP service/sink harness; see
bench/README.md.
Latest TCP Harness Results
These are end-to-end localhost TCP harness results from an Ubuntu 24.04 server
(adam), not Criterion microbenchmarks. The driver/client and sink processes
were pinned to a different CPU set than the proxy service so the service-side
CPU cost can be read separately from load generation.
Run shape:
- transport: localhost TCP
- worker threads: 16
- concurrent connections: 128
- route frames per connection: 64
- route frame payload: 63 bytes
- tunnel payload per connection: 1 MiB
- read reserve: 16 KiB
- handoff write pending budget: 32 KiB default (
2 * read_reserve) - completion mode: fire-and-forget
- runs: 2 per point, 10 second target duration
The implementations are:
handoff:HandoffBufferplusWriteHandoff; this is the crate path.manual_vec: a hand-written persistentVec<u8>parser with direct writes; this is the practicalread(&mut [u8])comparison.raw_copy: unparsed async copy; this is a lower bound and does less protocol work thanhandoff.
Coalesced input, where client writes arrive in 16 KiB chunks:
| implementation | service throughput | service CPU | service cost | driver p99 latency |
|---|---|---|---|---|
handoff |
2860 MiB/s | 6.27 cores | 2.09 ns/B | 59.8 ms |
manual_vec |
3098 MiB/s | 5.09 cores | 1.57 ns/B | 48.9 ms |
raw_copy |
3072 MiB/s | 4.57 cores | 1.42 ns/B | 56.2 ms |
Fragmented input, where client writes arrive in 64 byte chunks:
| implementation | service throughput | service CPU | service cost | driver p99 latency |
|---|---|---|---|---|
handoff |
103.1 MiB/s | 7.58 cores | 70.14 ns/B | 1195 ms |
manual_vec |
103.7 MiB/s | 6.69 cores | 61.64 ns/B | 1138 ms |
raw_copy |
104.5 MiB/s | 6.64 cores | 60.72 ns/B | 1138 ms |
Interpretation:
- In the coalesced TCP workload,
handoffis about 8% behind the practical manualVec<u8>baseline on throughput and uses about 23% more proxy-service CPU. That is the current cost of the safer buffer lifecycle, owned prefix handoff, bounded write queue, and mode-switch tail preservation. - In the fragmented 64 byte workload, throughput is dominated by tiny socket operations. All implementations cluster around 104 MiB/s; the difference is mostly service CPU cost.
raw_copyis useful as a lower bound, but it is not a semantic replacement: it does not parse route frames, preserve parser state, or hand off owned prefixes.
What The Read Benchmarks Measure
The read benchmarks are split by workload so the output does not imply one single headline throughput number.
| benchmark family | what it measures | when it should win |
|---|---|---|
read_raw_discard_lower_bound |
Raw read(&mut [u8]) into temporary scratch, then count and discard bytes. No parsing, persistent state, tail preservation, or owned handoff. |
Immediate local consumption where bytes do not outlive the read call. Treat this as a lower bound, not a peer comparison. |
read_owned_lines/manual_vec_copy |
Manual read(&mut [u8]) loop that appends to persistent state, preserves partial lines, finds complete frames, and copies each frame into a new Vec<u8>. |
Code that needs owned frames but does not use BytesMut/Bytes splitting. |
read_owned_lines/bytesmut_split |
Direct BytesMut implementation: read into persistent mutable state and split complete frames into owned Bytes. |
The closest baseline for wrapper overhead. |
read_owned_lines/handoff_buffer |
HandoffBuffer: same owned-frame workload, with max-length enforcement and the crate API around the buffer lifecycle. |
Content-routed streams where buffering rules should live behind a small API. |
read_handoff_fragmentation_sweep |
The same HandoffBuffer line workload at different read reserve sizes. |
Understanding sensitivity to tiny, fragmented reads versus coalesced reads. |
split_freeze_prefixes |
Cost of repeatedly splitting owned Bytes prefixes out of buffered state. |
Many complete frames already buffered. |
split_prefix_mut |
Cost of splitting BytesMut prefixes without freezing them into Bytes. |
Hot paths that keep mutable owned frames. |
take_tail_mode_switch |
Cost of preserving already-read bytes when switching parser modes, such as parsed routing to raw tunnel. | Protocols that inspect first, then tunnel or hand off the remaining stream. |
The raw discard lower bound should be much faster than the framed workloads.
That does not mean HandoffBuffer is slow at the job it is meant to do; it
means the raw benchmark does less work. For wrapper overhead, compare
read_owned_lines/bytesmut_split with read_owned_lines/handoff_buffer.
What The Write Benchmarks Measure
| benchmark family | what it measures |
|---|---|
write_large_chunks/direct_write_all |
Direct AsyncWriteExt::write_all of large chunks into an in-memory duplex stream. |
write_large_chunks/handoff_ticket_single_task |
Submit the same chunks as owned Bytes through WriteHandoff, then await completion tickets. |
write_large_chunks/handoff_fire_and_forget_single_task |
Submit owned Bytes without allocating per-write completion tickets. |
write_many_tasks/ticket |
Many Tokio tasks submit owned Bytes to one handoff and await completion tickets. This is task fan-in, not a cross-thread producer benchmark. |
write_many_tasks/fire_and_forget |
The same task fan-in workload without per-write completion tickets. |
write_byte_budget_backpressure |
Fast rejection when a write exceeds the configured pending-byte budget. |
The write benchmarks measure owned Bytes submission into one async writer,
batched drain behavior, optional completion notification, and backpressure. They
are not raw socket throughput benchmarks.
Treat all benchmark numbers as directional, not universal. They use in-memory
readers/writers; real sockets add scheduler, kernel, and network effects. The
benchmark exists to make the tradeoff explicit: raw slice reads are best for
immediate consumption, while bytes-handoff targets safe mutable buffering,
prefix ownership, tail preservation, and bounded async write handoff.