ugi 0.2.1 - Docs.rs

# Performance

Benchmarks are loopback (localhost), Apple M-series CPU, `cargo bench --release`.  
Run them yourself: `cargo bench --bench http1_compare` / `http2_compare` / `grpc_compare` / `protocol_compare` / `h3_compare --features h3`.

Numbers are the median of three consecutive runs. HTTP/1.1 and HTTP/2 runs show
±15–25 % run-to-run variance on a lightly loaded laptop; the broad ranges are noted.

## HTTP/1.1 Keep-Alive

Sequential 2,000 × GET, keep-alive, small response body.  
Benchmark: `benches/http1_compare.rs`

| Client   | req/s (median) | avg latency |
|----------|----------------|-------------|
| ugi      | ~40,000        | ~25 µs      |
| reqwest  | ~34,000        | ~29 µs      |

**ugi is ~18 % faster than reqwest** on sequential HTTP/1.1 keep-alive.

ugi's HTTP/1.1 path uses a hand-rolled parser that avoids heap allocation for
common small responses.  reqwest delegates to `hyper` which allocates per-request
header storage.  The gap is consistent across runs (ratio 0.67–0.87).

## HTTP/2 Keep-Alive

Sequential 50 × GET, h2 over TLS (self-signed), small response body.  
Benchmark: `benches/http2_compare.rs`

| Client   | req/s (median) | avg latency |
|----------|----------------|-------------|
| ugi      | ~12,000        | ~85 µs      |
| reqwest  | ~10,000        | ~103 µs     |

**ugi is ~20 % faster than reqwest** on sequential HTTP/2 keep-alive.

Both clients speak the same HTTP/2 wire protocol (HPACK + flow control).  The
difference comes from runtime scheduling: ugi runs on `async-io`'s reactor in a
`block_on` context; reqwest/hyper use tokio's multi-threaded runtime, which carries
thread-pool coordination overhead for sequential workloads.

## gRPC Unary (JSON)

200 × unary `POST /bench.Svc/Echo`, JSON codec (`application/grpc+json`),
h2c prior-knowledge (cleartext).  Competitor: raw `h2` crate client with no
middleware, builder, or JSON encode/decode — measures the bare transport floor.  
Benchmark: `benches/grpc_compare.rs`

| Client    | rps (median) | avg latency |
|-----------|--------------|-------------|
| ugi       | ~17,000      | ~60 µs      |
| raw h2    | ~21,000      | ~47 µs      |

**ugi adds ~1.3× overhead over the raw h2 transport layer.**

The gap is attributable to:
- JSON encode/decode via `serde_json` per call (~5 µs round-trip on typical payloads)
- Builder pattern allocation + header construction per request
- `block_on` → `block_on` scheduling across the async executor boundary

For production gRPC workloads: if protobuf codec is preferred over JSON, the
encode/decode cost drops to near-zero since protobuf serialization is faster than
JSON.  The builder and scheduling overhead (~12 µs) remains regardless of codec.

## HTTP/3 (QUIC) Unary GET

100 × GET over QUIC/HTTP3, loopback, self-signed certificate
(`danger_accept_invalid_certs`).  Competitor: raw `quiche` client with no
middleware, builder, or response parsing — measures the bare QUIC+H3 transport
floor.  
Benchmark: `benches/h3_compare.rs` (requires `--features h3`)

| Client      | rps (median) | avg latency |
|-------------|--------------|-------------|
| ugi         | ~13,000      | ~77 µs      |
| raw quiche  | ~17,000      | ~60 µs      |

**ugi adds ~1.3× overhead over the raw quiche transport layer.**

The gap is attributable to:
- Builder pattern allocation + header construction per request
- Response parsing (status, headers, body buffering)
- The `async-io` reactor handoff between the h3 task thread and the calling executor

Run with: `cargo bench --bench h3_compare --features h3`  
Note: first build compiles quiche with vendored BoringSSL (~60 s); subsequent builds are cached.

## WebSocket Echo Round-Trip  
Benchmark: `benches/protocol_compare.rs`

| Client            | ops/s (median) | avg latency |
|-------------------|----------------|-------------|
| ugi               | ~21,000        | ~47 µs      |
| tokio-tungstenite | ~55,000        | ~18 µs      |

**ugi is ~2.6× slower than tokio-tungstenite** for WebSocket echo.

### Root cause

tokio-tungstenite is tightly integrated with the tokio I/O reactor — it uses tokio's
`AsyncRead`/`AsyncWrite` adapters and the tokio thread pool.  ugi's WebSocket
implementation uses `async-io` (smol ecosystem) which runs a separate OS-thread-based
poller and hands ownership of file descriptors to its own reactor.  When measuring on
the same tokio Runtime, the cross-reactor handoff adds wakeup latency per round trip.

Optimizations already applied:
- **TCP_NODELAY** on all connections (eliminates Nagle's algorithm for small frames)
- **Masking loop**: XOR 4 bytes at a time instead of byte-by-byte
- **Frame flush**: every frame is flushed to the kernel immediately after `write_all`

Known remaining opportunities:
- Move WebSocket I/O to tokio's native socket types to eliminate reactor cross-talk
- Use `drain` → ring-buffer to avoid O(n) buffer shift after each frame parse
- Use `writev`/`sendmsg` to send header + payload in a single syscall (currently two)

The gap is not a correctness issue — all frames are valid RFC 6455 frames.  It is an
architectural concern: production WebSocket workloads that care deeply about latency
should benchmark against their specific server before selecting a client library.

## Methodology

- All benchmarks include warmup iterations (30–200 depending on scenario).
- Loopback eliminates network jitter; numbers represent protocol + runtime overhead only.
- Both ugi and the competitor connect to the same local test server.
- CPU: Apple M-series (ARM64).  Results on x86_64 Linux may differ.
- Run each benchmark 3 × and take the median if reproducibility matters.
- "raw h2" in the gRPC scenario is the `h2 = "0.4"` crate used directly — the same
  library that `hyper` and (transitively) reqwest and tonic sit on top of.

## Implementation Notes

### TCP_NODELAY

Every connection sets `TCP_NODELAY = true` immediately after the TCP handshake,
disabling Nagle's algorithm.  This is critical for request/response protocols where
the client sends small messages and waits for a reply — without it, the kernel may
buffer the outgoing packet for up to 40 ms.

### Happy Eyeballs (RFC 6555)

When a hostname resolves to both IPv6 and IPv4 addresses, ugi races them with a 250 ms
IPv6 head-start.  The first *successful* connection wins; if IPv6 fails instantly
(ECONNREFUSED, unreachable), the IPv4 fallback is used without waiting.  This avoids
the multi-second delay that naive "try IPv6, if it fails use IPv4" causes on networks
where IPv6 is broken.

### TLS Session Resumption

rustls' default `ClientConfig` includes an in-memory session cache, so TLS 1.3
resumption (0-RTT where the server supports it) is active without any explicit
configuration.

### Connection Pooling

ugi maintains per-host connection pools for HTTP/1.1 and HTTP/2 separately.  Idle
connections are reclaimed after a configurable timeout.  HTTP/2 connections are shared
across concurrent requests up to the server's `MAX_CONCURRENT_STREAMS` limit.