bytes-handoff 1.0.0

Incremental async byte ingestion and bounded owned write handoff.
Documentation

bytes-handoff

bytes-handoff is a small Rust crate for moving owned byte buffers across async I/O boundaries.

It does not replace AsyncRead or AsyncWrite. It layers on top of them so protocol code can:

  • read bytes as soon as they arrive
  • keep nonblocking reads in safe, owned mutable buffer state
  • peek at incomplete input without committing
  • preserve unconsumed tails across parser or mode boundaries
  • split complete prefixes into Bytes for cheap cross-task handoff
  • submit owned Bytes writes to an async writer without borrowing memory until the socket finishes
  • bound queued writes by item count and byte count

The zero-copy claim here is intentionally scoped: BytesMut/Bytes avoid extra application-level copies after socket ingress. They do not make TCP kernel-to-userspace reads zero-copy.

The main use case is content-routed streaming I/O: many independent connections, each with bytes arriving in arbitrary fragments, where routing decisions depend on the stream content. In that setting the read buffer is protocol state, not temporary scratch memory.

Use plain read(&mut [u8]) when bytes are consumed immediately in the same function and then discarded. Use bytes-handoff when the read buffer itself is part of the protocol state: partial frames must survive, complete prefixes need owned handoff, or a parser may switch into a raw tunnel without losing already read bytes.

Examples

The repository includes small runnable examples:

  • line_protocol: incremental line parsing with a partial tail.
  • length_prefix: length-prefixed frames where the header arrives before the full payload.
  • content_routing: inspect buffered bytes, route complete safe prefixes, then switch to a raw tunnel while preserving already-read tail bytes.
  • write_handoff: submit owned bytes to an async writer and await completion.

Run one with:

cargo run --example content_routing

Read Handoff

use bytes::Bytes;
use bytes_handoff::HandoffBuffer;

/// Reads available bytes, splits one complete line, and hands it off.
async fn read_one_line<R>(reader: &mut R) -> Result<(), Box<dyn std::error::Error>>
where
    R: tokio::io::AsyncRead + Unpin,
{
    let mut buffer = HandoffBuffer::new(64 * 1024);

    buffer.read_available(reader).await?;

    if let Some(newline) = buffer.peek().iter().position(|b| *b == b'\n') {
        let line = buffer.split_prefix(newline + 1)?;
        send_to_worker(line);
    }

    Ok(())
}

/// Sends an owned byte slice to sync or async protocol work.
fn send_to_worker(_: Bytes) {}

Write Handoff

use bytes::Bytes;
use bytes_handoff::{WriteHandoff, WriteHandoffConfig};

/// Submits owned bytes to an async writer without blocking the producer.
fn submit_owned_write<W>(writer: W) -> Result<(), Box<dyn std::error::Error>>
where
    W: tokio::io::AsyncWrite + Unpin + Send + 'static,
{
    let handoff = WriteHandoff::spawn(writer, WriteHandoffConfig::new(1024, 8 * 1024 * 1024));

    handoff.try_write_fire_and_forget(Bytes::from_static(b"owned bytes"))?;

    // Use `try_write` or `write` instead when the producer needs a completion
    // ticket for this chunk.
    Ok(())
}

Benchmarks

The repository includes Criterion benchmarks for the main operations this crate adds around async I/O:

  • incremental reads into persistent state
  • complete-frame splitting into owned Bytes
  • preserving an already-read tail when parser mode changes
  • bounded owned write submission and completion tracking
  • byte-budget backpressure

Run them with:

cargo bench

For a defensible local baseline:

./bench/run-criterion-baseline.sh

The baseline script uses --sample-size 100 --measurement-time 5 by default. On Linux, set TASKSET_CORES=2-5 to pin the run with taskset and PERF_STAT=1 to collect cycles, instructions, cache misses, context switches, and CPU migrations alongside Criterion. For release-grade numbers, run on the same kernel and CPU family as deployment, with the CPU governor and turbo policy fixed outside the script.

For an end-to-end stream harness rather than a Criterion microbenchmark:

./bench/run-stream-harness.sh --scenario fragmented --runs 5
./bench/run-stream-harness.sh --scenario coalesced --runs 5
./bench/run-stream-harness.sh --transport tcp --scenario fragmented --runs 5
./bench/run-stream-matrix.sh
./bench/run-stream-harness.sh --scenario fragmented --completion fire_and_forget --runs 5

The harness drives complete client/proxy/sink streams through fragmented input, content-routed prefixes, tunnel handoff, and WriteHandoff output. It writes machine-readable run artifacts under bench/results/, including throughput, latency percentiles, context switches, CPU cost, and peak RSS. Use --transport tcp for a localhost TCP service/sink harness; see bench/README.md.

Latest TCP Harness Results

These are end-to-end localhost TCP harness results from an Ubuntu 24.04 server (adam), not Criterion microbenchmarks. The driver/client and sink processes were pinned to a different CPU set than the proxy service so the service-side CPU cost can be read separately from load generation.

Run shape:

  • transport: localhost TCP
  • worker threads: 16
  • concurrent connections: 128
  • route frames per connection: 64
  • route frame payload: 63 bytes
  • tunnel payload per connection: 1 MiB
  • read reserve: 16 KiB
  • handoff write pending budget: 32 KiB default (2 * read_reserve)
  • completion mode: fire-and-forget
  • runs: 2 per point, 10 second target duration

The implementations are:

  • handoff: HandoffBuffer plus WriteHandoff; this is the crate path.
  • manual_vec: a hand-written persistent Vec<u8> parser with direct writes; this is the practical read(&mut [u8]) comparison.
  • raw_copy: unparsed async copy; this is a lower bound and does less protocol work than handoff.

Coalesced input, where client writes arrive in 16 KiB chunks:

implementation service throughput service CPU service cost driver p99 latency
handoff 2860 MiB/s 6.27 cores 2.09 ns/B 59.8 ms
manual_vec 3098 MiB/s 5.09 cores 1.57 ns/B 48.9 ms
raw_copy 3072 MiB/s 4.57 cores 1.42 ns/B 56.2 ms

Fragmented input, where client writes arrive in 64 byte chunks:

implementation service throughput service CPU service cost driver p99 latency
handoff 103.1 MiB/s 7.58 cores 70.14 ns/B 1195 ms
manual_vec 103.7 MiB/s 6.69 cores 61.64 ns/B 1138 ms
raw_copy 104.5 MiB/s 6.64 cores 60.72 ns/B 1138 ms

Interpretation:

  • In the coalesced TCP workload, handoff is about 8% behind the practical manual Vec<u8> baseline on throughput and uses about 23% more proxy-service CPU. That is the current cost of the safer buffer lifecycle, owned prefix handoff, bounded write queue, and mode-switch tail preservation.
  • In the fragmented 64 byte workload, throughput is dominated by tiny socket operations. All implementations cluster around 104 MiB/s; the difference is mostly service CPU cost.
  • raw_copy is useful as a lower bound, but it is not a semantic replacement: it does not parse route frames, preserve parser state, or hand off owned prefixes.

What The Read Benchmarks Measure

The read benchmarks are split by workload so the output does not imply one single headline throughput number.

benchmark family what it measures when it should win
read_raw_discard_lower_bound Raw read(&mut [u8]) into temporary scratch, then count and discard bytes. No parsing, persistent state, tail preservation, or owned handoff. Immediate local consumption where bytes do not outlive the read call. Treat this as a lower bound, not a peer comparison.
read_owned_lines/manual_vec_copy Manual read(&mut [u8]) loop that appends to persistent state, preserves partial lines, finds complete frames, and copies each frame into a new Vec<u8>. Code that needs owned frames but does not use BytesMut/Bytes splitting.
read_owned_lines/bytesmut_split Direct BytesMut implementation: read into persistent mutable state and split complete frames into owned Bytes. The closest baseline for wrapper overhead.
read_owned_lines/handoff_buffer HandoffBuffer: same owned-frame workload, with max-length enforcement and the crate API around the buffer lifecycle. Content-routed streams where buffering rules should live behind a small API.
read_handoff_fragmentation_sweep The same HandoffBuffer line workload at different read reserve sizes. Understanding sensitivity to tiny, fragmented reads versus coalesced reads.
split_freeze_prefixes Cost of repeatedly splitting owned Bytes prefixes out of buffered state. Many complete frames already buffered.
split_prefix_mut Cost of splitting BytesMut prefixes without freezing them into Bytes. Hot paths that keep mutable owned frames.
take_tail_mode_switch Cost of preserving already-read bytes when switching parser modes, such as parsed routing to raw tunnel. Protocols that inspect first, then tunnel or hand off the remaining stream.

The raw discard lower bound should be much faster than the framed workloads. That does not mean HandoffBuffer is slow at the job it is meant to do; it means the raw benchmark does less work. For wrapper overhead, compare read_owned_lines/bytesmut_split with read_owned_lines/handoff_buffer.

What The Write Benchmarks Measure

benchmark family what it measures
write_large_chunks/direct_write_all Direct AsyncWriteExt::write_all of large chunks into an in-memory duplex stream.
write_large_chunks/handoff_ticket_single_task Submit the same chunks as owned Bytes through WriteHandoff, then await completion tickets.
write_large_chunks/handoff_fire_and_forget_single_task Submit owned Bytes without allocating per-write completion tickets.
write_many_tasks/ticket Many Tokio tasks submit owned Bytes to one handoff and await completion tickets. This is task fan-in, not a cross-thread producer benchmark.
write_many_tasks/fire_and_forget The same task fan-in workload without per-write completion tickets.
write_byte_budget_backpressure Fast rejection when a write exceeds the configured pending-byte budget.

The write benchmarks measure owned Bytes submission into one async writer, batched drain behavior, optional completion notification, and backpressure. They are not raw socket throughput benchmarks.

Treat all benchmark numbers as directional, not universal. They use in-memory readers/writers; real sockets add scheduler, kernel, and network effects. The benchmark exists to make the tradeoff explicit: raw slice reads are best for immediate consumption, while bytes-handoff targets safe mutable buffering, prefix ownership, tail preservation, and bounded async write handoff.