# Performance
Benchmarks are loopback (localhost), Apple M-series CPU, `cargo bench --release`.
Run them yourself: `cargo bench --bench http1_compare` / `http2_compare` / `grpc_compare` / `protocol_compare` / `h3_compare --features h3`.
Numbers are the median of three consecutive runs. HTTP/1.1 and HTTP/2 runs show
±15–25 % run-to-run variance on a lightly loaded laptop; the broad ranges are noted.
## HTTP/1.1 Keep-Alive
Sequential 2,000 × GET, keep-alive, small response body.
Benchmark: `benches/http1_compare.rs`
| ugi | ~40,000 | ~25 µs |
| reqwest | ~34,000 | ~29 µs |
**ugi is ~18 % faster than reqwest** on sequential HTTP/1.1 keep-alive.
ugi's HTTP/1.1 path uses a hand-rolled parser that avoids heap allocation for
common small responses. reqwest delegates to `hyper` which allocates per-request
header storage. The gap is consistent across runs (ratio 0.67–0.87).
## HTTP/2 Keep-Alive
Sequential 50 × GET, h2 over TLS (self-signed), small response body.
Benchmark: `benches/http2_compare.rs`
| ugi | ~12,000 | ~85 µs |
| reqwest | ~10,000 | ~103 µs |
**ugi is ~20 % faster than reqwest** on sequential HTTP/2 keep-alive.
Both clients speak the same HTTP/2 wire protocol (HPACK + flow control). The
difference comes from runtime scheduling: ugi runs on `async-io`'s reactor in a
`block_on` context; reqwest/hyper use tokio's multi-threaded runtime, which carries
thread-pool coordination overhead for sequential workloads.
## gRPC Unary (JSON)
200 × unary `POST /bench.Svc/Echo`, JSON codec (`application/grpc+json`),
h2c prior-knowledge (cleartext). Competitor: raw `h2` crate client with no
middleware, builder, or JSON encode/decode — measures the bare transport floor.
Benchmark: `benches/grpc_compare.rs`
| ugi | ~17,000 | ~60 µs |
| raw h2 | ~21,000 | ~47 µs |
**ugi adds ~1.3× overhead over the raw h2 transport layer.**
The gap is attributable to:
- JSON encode/decode via `serde_json` per call (~5 µs round-trip on typical payloads)
- Builder pattern allocation + header construction per request
- `block_on` → `block_on` scheduling across the async executor boundary
For production gRPC workloads: if protobuf codec is preferred over JSON, the
encode/decode cost drops to near-zero since protobuf serialization is faster than
JSON. The builder and scheduling overhead (~12 µs) remains regardless of codec.
## HTTP/3 (QUIC) Unary GET
100 × GET over QUIC/HTTP3, loopback, self-signed certificate
(`danger_accept_invalid_certs`). Competitor: raw `quiche` client with no
middleware, builder, or response parsing — measures the bare QUIC+H3 transport
floor.
Benchmark: `benches/h3_compare.rs` (requires `--features h3`)
| ugi | ~13,000 | ~77 µs |
| raw quiche | ~17,000 | ~60 µs |
**ugi adds ~1.3× overhead over the raw quiche transport layer.**
The gap is attributable to:
- Builder pattern allocation + header construction per request
- Response parsing (status, headers, body buffering)
- The `async-io` reactor handoff between the h3 task thread and the calling executor
Run with: `cargo bench --bench h3_compare --features h3`
Note: first build compiles quiche with vendored BoringSSL (~60 s); subsequent builds are cached.
## WebSocket Echo Round-Trip
Benchmark: `benches/protocol_compare.rs`
| ugi | ~21,000 | ~47 µs |
| tokio-tungstenite | ~55,000 | ~18 µs |
**ugi is ~2.6× slower than tokio-tungstenite** for WebSocket echo.
### Root cause
tokio-tungstenite is tightly integrated with the tokio I/O reactor — it uses tokio's
`AsyncRead`/`AsyncWrite` adapters and the tokio thread pool. ugi's WebSocket
implementation uses `async-io` (smol ecosystem) which runs a separate OS-thread-based
poller and hands ownership of file descriptors to its own reactor. When measuring on
the same tokio Runtime, the cross-reactor handoff adds wakeup latency per round trip.
Optimizations already applied:
- **TCP_NODELAY** on all connections (eliminates Nagle's algorithm for small frames)
- **Masking loop**: XOR 4 bytes at a time instead of byte-by-byte
- **Frame flush**: every frame is flushed to the kernel immediately after `write_all`
Known remaining opportunities:
- Move WebSocket I/O to tokio's native socket types to eliminate reactor cross-talk
- Use `drain` → ring-buffer to avoid O(n) buffer shift after each frame parse
- Use `writev`/`sendmsg` to send header + payload in a single syscall (currently two)
The gap is not a correctness issue — all frames are valid RFC 6455 frames. It is an
architectural concern: production WebSocket workloads that care deeply about latency
should benchmark against their specific server before selecting a client library.
## Methodology
- All benchmarks include warmup iterations (30–200 depending on scenario).
- Loopback eliminates network jitter; numbers represent protocol + runtime overhead only.
- Both ugi and the competitor connect to the same local test server.
- CPU: Apple M-series (ARM64). Results on x86_64 Linux may differ.
- Run each benchmark 3 × and take the median if reproducibility matters.
- "raw h2" in the gRPC scenario is the `h2 = "0.4"` crate used directly — the same
library that `hyper` and (transitively) reqwest and tonic sit on top of.
## Implementation Notes
### TCP_NODELAY
Every connection sets `TCP_NODELAY = true` immediately after the TCP handshake,
disabling Nagle's algorithm. This is critical for request/response protocols where
the client sends small messages and waits for a reply — without it, the kernel may
buffer the outgoing packet for up to 40 ms.
### Happy Eyeballs (RFC 6555)
When a hostname resolves to both IPv6 and IPv4 addresses, ugi races them with a 250 ms
IPv6 head-start. The first *successful* connection wins; if IPv6 fails instantly
(ECONNREFUSED, unreachable), the IPv4 fallback is used without waiting. This avoids
the multi-second delay that naive "try IPv6, if it fails use IPv4" causes on networks
where IPv6 is broken.
### TLS Session Resumption
rustls' default `ClientConfig` includes an in-memory session cache, so TLS 1.3
resumption (0-RTT where the server supports it) is active without any explicit
configuration.
### Connection Pooling
ugi maintains per-host connection pools for HTTP/1.1 and HTTP/2 separately. Idle
connections are reclaimed after a configurable timeout. HTTP/2 connections are shared
across concurrent requests up to the server's `MAX_CONCURRENT_STREAMS` limit.