# Tonbo Benchmark Program
This document defines the broader benchmark program around Tonbo's current
compaction-focused harness. The goal is to evaluate Tonbo as a live analytical
engine on object storage, not only as a quiesced embedded analytical engine.
## Program Shape
Tonbo should take inspiration from the benchmark structure used by other open
source databases, not just their workload names.
- From object-store-native systems such as SlateDB: split microbenchmarks from
a long-running engine bencher, add mixed workloads, contention tests, system
resource monitoring, nightly trend runs, and explicit topology disclosure.
- From LSM-oriented systems such as RocksDB and LevelDB: use a named workload
taxonomy, force deterministic phase ordering, emit richer metrics, and
explain regime changes such as fixed overhead, large values, sync cost, and
compaction debt clearly.
- From server-oriented systems such as libSQL: add endpoint-facing and
freshness-facing benchmarks, not only storage-engine internals.
Tonbo should therefore have three benchmark layers:
- `micro`: narrow internal costs such as scan planning, stream init, merge
init, package init, manifest open, and WAL sync.
- `engine`: object-store-backed workload scenarios with mixed traffic,
compaction, GC, freshness, and durability.
- `surface`: query, open, freshness, or API benchmarks that look like a user
workload rather than only an engine test.
Every benchmark artifact should include:
- `topology`: runner region, bucket region, same VPC/AZ or not, public/private
path, cold/warm run, median RTT
- `live state`: logical bytes, physical bytes, visible SST count, obsolete SST
count, WAL bytes, manifest bytes
- `request economics`: GET/HEAD/range-GET/PUT counts, bytes/request, estimated
request cost
- `latency and tail`: mean, p50, p95, p99, p99.9
- `engine pressure`: compaction backlog, GC backlog, WAL queue depth,
freshness lag, durable-ack lag, CPU, RSS, network MB/s
## Instrumentation Track
The benchmark plan only works if regressions are diagnosable. Tonbo should add
instrumentation in parallel with benchmark work so a slow or unstable result
can be explained without ad hoc debugging.
User why:
- If a benchmark fails to explain where latency or cost comes from, it does not
build trust.
- If a mixed-load result regresses, we need to say whether the problem is WAL
durability, scan setup, object-store requests, compaction pressure, or GC
lag.
Tonbo why:
- The current benchmark already shows an object-store latency floor, but not
its source.
- Mixed workloads will be hard to reason about unless the engine and the
harness expose phase-level timing and queue pressure.
- Instrumentation lowers the cost of both benchmark development and future
product debugging.
### Instrumentation Principles
- Always record end-to-end latency, and add phase-level timers so the top-line
number is explainable.
End-to-end latency is the primary user-facing number and should lead the
report. Phase-level timers are what make that number actionable by showing
whether time is going to setup, object-store requests, merge work,
packaging, WAL sync, or manifest work.
- Record both logical work and physical work.
Logical work describes what the user asked Tonbo to do, such as rows returned
or live bytes scanned. Physical work describes what Tonbo and the storage
backend actually paid for, such as bytes read, requests issued, objects
created, or bytes rewritten by compaction.
- Keep metric names stable across scenarios so results are comparable.
If one benchmark reports `prepare_ms`, another reports `scan_setup_ms`, and a
third folds setup into total latency, the benchmark suite becomes hard to
compare and easy to misread. Stable names create one shared vocabulary across
all runs.
- Emit machine-readable artifacts first; derive charts and summaries from them
later.
Structured outputs such as JSON or TSV should be the source of truth.
Charts, markdown summaries, and dashboards should be generated from that raw
data so the same run can be reinterpreted later without rerunning the
benchmark.
- Every benchmark run should be traceable back to the manifest state, WAL mode,
compaction settings, and topology.
This is broader than just recording a random seed. Tonbo should capture
enough engine, storage, workload, and deployment context to explain whether a
result came from the code path, the storage layout, or the network path.
### Engine Modules To Instrument
#### Read path
Why:
- Users experience read latency as one number, but Tonbo needs to know whether
time is spent in snapshot resolution, planning, stream construction, remote
open, merge, or packaging.
Add:
- snapshot resolution time
- plan and prune time
- stream-open time per source
- merge init time
- package init time
- rows scanned vs rows returned
- bytes read per source
- object requests per scan
Use:
- explain fixed overhead vs data-size effects
- explain why object-store runs are flat
- isolate whether setup or consume dominates
#### WAL and commit path
Why:
- Users care about durable-ack latency and fresh-read visibility; Tonbo needs
to know where that path stalls.
Add:
- enqueue to durable latency
- queue depth over time
- bytes per WAL frame and per segment
- sync duration
- commit wait reason counters
- ack-to-visible lag
Use:
- explain stream-ingest results
- compare `strict` vs `fast`
- identify whether durability cost is batching, sync, or publication
#### Flush and minor compaction
Why:
- Fresh data benchmarks will be shaped heavily by sealing and minor compaction,
not just by reads.
Add:
- seal trigger reason
- time from seal to flush start
- flush duration
- output SST count and bytes
- rows in, rows out
- overlap or amplification indicators
Use:
- tune seal thresholds
- explain read freshness and SST explosion
- detect small-object amplification
#### Major compaction
Why:
- Users only care when compaction affects read tails or storage cost; Tonbo
needs to expose that mechanism directly.
Add:
- compaction job wait time
- execution time
- bytes read and bytes written
- input SST count and output SST count
- obsolete SST count produced
- WAL floor movement
- backlog depth over time
Use:
- explain read-during-compaction results
- calculate write amp and cleanup lag
- identify whether planner or executor is the bottleneck
#### GC and snapshot pinning
Why:
- Cleanup cost is part of the object-store product story, not an internal
footnote.
Add:
- obsolete bytes pending delete
- obsolete object count pending delete
- time from obsolete to reclaimed
- delete request counts and latency
- protected bytes due to active snapshot pins
Use:
- explain physical amplification windows
- set GC cadence and snapshot-lifetime guidance
- distinguish delayed reclaim from ineffective compaction
#### Manifest and metadata path
Why:
- On object storage, metadata and version movement can be a large fixed cost.
Add:
- HEAD fetch latency
- manifest decode latency
- CAS publish latency
- CAS retry count
- visible version size
Use:
- explain constant setup floors
- identify metadata bottlenecks under mixed load
### Benchmark Engine To Instrument
#### Harness phase timing
Add:
- setup time
- warmup time
- steady-state window time
- teardown time
- per-iteration phase timers
Use:
- separate engine cost from harness overhead
- keep small-workload results honest
#### Topology capture
Add:
- runner region
- bucket region
- same VPC or AZ flag
- endpoint type
- cold or warm run flag
- median RTT probe
Use:
- stop over-attributing latency to Tonbo when it comes from path placement
#### System resource monitoring
Add:
- process CPU
- RSS
- disk throughput
- network throughput
- runtime queue depth if available
Use:
- explain whether a regression is CPU-bound, memory-bound, or network-bound
#### Request accounting
Add:
- GET, HEAD, range-GET, PUT, DELETE counts
- bytes per request type
- request failures and retries
Use:
- explain object-store economics
- connect page size and flush size to request amplification
#### Run configuration snapshot
Add:
- scenario name
- commit mode
- sync policy
- seal thresholds
- compaction settings
- page size
- batch size
- WAL retention settings
- snapshot / historical-read workload shape
- git revision
Use:
- make every run reproducible
- make cross-run comparisons defensible
### Rollout Priority
1. Read-path phase timers and request accounting
2. WAL and commit-path timers
3. Topology capture and harness/system metrics
4. Minor and major compaction accounting
5. GC and snapshot-pinning metrics
6. Richer manifest and CAS instrumentation
### Minimum Instrumentation Required Before Each Scenario
- `interleaved_freshness_read_write`
- read-path phase timers
- WAL queue depth
- ack-to-visible lag
- request counts
- `durable_parallel_stream_ingest`
- durable-ack timing
- WAL segment and sync metrics
- recovery timing
- `deployment_topology_request_amplification`
- topology capture
- request accounting
- manifest-open timing
- `read_after_compaction_byte_sweep`
- prepare vs consume timing
- logical vs physical bytes
- visible SST count
- `parallel_readers_during_background_compaction`
- compaction backlog
- compaction bytes read and written
- reader tail latency by phase
- `gc_lag_storage_amplification_window`
- obsolete bytes
- reclaim delay
- delete request counts and latency
## Priority Logic
The order should follow the user journey, not Tonbo internals:
1. Can I write fresh data and query it immediately?
2. What does durability cost me?
3. Is the latency floor coming from Tonbo or the object-store path?
4. At what data volume do Tonbo's optimizations start to matter?
5. Does background maintenance hurt live traffic?
6. What does cleanup do to my bill and latency?
## Integrated Scenario Roadmap
### 1. `interleaved_freshness_read_write`
Why first: users care first about whether fresh writes become queryable quickly
under real mixed load, because that is the live-analytics promise.
Tonbo why: this is the core positioning test that separates Tonbo from a
quiesced embedded analytical engine.
Reference pattern:
- concurrent mixed-workload benchmarkers
- freshness-facing query workloads
Workload:
- Run 4 writer tasks and 16 reader tasks for 15 minutes on object storage.
- Writers commit 2,000-row Arrow batches every 100 ms with 20% key overlap and
5% deletes.
- Readers issue "last 5 minutes for tenant" scans every 250 ms with about 1%
selectivity.
- Compare `strict` and `fast` commit-ack modes.
Metrics:
- commit latency
- freshness lag from durable ack to first visible read
- read mean, p95, p99
- WAL queue depth
- compaction backlog
- visible SST count
- request counts
- bytes read and written
- logical live bytes
Decision:
- default commit-ack mode
- seal thresholds
- minor-compaction cadence
- whether Tonbo can credibly claim live freshness on object storage
Timing:
- Short-term
### 2. `durable_parallel_stream_ingest`
Why second: users next ask what durability costs in throughput and latency when
they treat the database like a stream sink, not a batch loader.
Tonbo why: Tonbo's WAL and object-store durability path are central to the
product, so this is a primary proof, not an edge case.
Reference pattern:
- durable vs non-durable write comparisons
- sync-vs-async framing
Workload:
- Run 8 parallel writer streams for 20 minutes.
- Compare batch sizes of 1, 100, and 1,000 rows.
- Compare WAL sync policies like `Always`, `Interval(10 ms)`, and
`Interval(50 ms)`.
- Inject crash-and-recover every 60 seconds after acknowledged writes.
Metrics:
- ack p50, p95, p99
- durable lag
- ingest throughput
- WAL segment size
- sync latency
- queue depth
- recovery time
- lost acknowledged rows
- cost per MiB ingested
Decision:
- default WAL sync policy
- batching window
- whether Tonbo needs stronger coalescing or local staging to make durable
ingest viable
Timing:
- Short-term
### 3. `deployment_topology_request_amplification`
Why third: users need to know whether slow object-store results are caused by
Tonbo, by many small requests, or by the deployment path between compute and
storage.
Tonbo why: this is the main credibility gap in current results because the
object-store latency floor is visible but not explained.
Reference pattern:
- explicit topology reporting
- parameterized harness discipline
Workload:
- Keep logical live-set fixed at 1 GiB.
- Compare three physical layouts such as micro pages or segments, medium
pages, and object-store-optimized large pages.
- Run the same selective range scan from same-AZ in-VPC, same-region public
path, and cross-region when possible.
Metrics:
- GET, HEAD, and range-GET counts
- bytes per request
- mean and p99 latency
- manifest-open time
- scan-plan time
- estimated request cost per GiB scanned
- median RTT
- network throughput
Decision:
- default Parquet page size
- minimum flush size
- benchmark deployment requirements
- whether public-path and colocated results must be reported separately
Timing:
- Short-term if infra is available, otherwise Later
### 4. `read_after_compaction_byte_sweep`
Why fourth: users need to know when Tonbo's structural optimizations start to
pay off, because small datasets often look dominated by constant overhead.
Tonbo why: this explains the current benchmark correctly and prevents weak
conclusions from tiny workloads.
Reference pattern:
- byte-oriented reporting
- regime-change explanation style
Workload:
- Run `baseline` and `quiesced` read scenarios at fixed logical live-set
targets such as 8 MiB, 32 MiB, 128 MiB, 512 MiB, 2 GiB, and 8 GiB on both
`local` and `object_store`.
- Keep query shape fixed with a narrow recent-range filter and projection.
Metrics:
- logical live bytes before and after compaction
- read bytes
- request counts
- visible SST count
- prepare vs consume latency
- mean and p99
- CPU time
Decision:
- where the fixed-cost floor stops dominating
- what public benchmark scale should be used
- how Tonbo should explain compaction value in public benchmark material
Timing:
- Short-term
### 5. `parallel_readers_during_background_compaction`
Why fifth: users care about whether background maintenance causes latency spikes
while they are reading live data.
Tonbo why: compaction only matters strategically once it is measured as a
user-visible tail-latency risk or benefit.
Reference pattern:
- read-during-write style workloads
- deterministic phase sequencing
Workload:
- Seed an overlap-heavy L0 state with roughly 1 GiB logical data and dozens of
visible SSTs.
- Start major compaction.
- Run 8, 32, and 64 concurrent readers for 10 minutes, each issuing a scan
every 200 ms with a small recent-range filter and narrow projection.
Metrics:
- reader p50, p95, p99
- compaction throughput
- bytes rewritten
- visible SST count over time
- backlog depth
- latency spikes around manifest changes
- request counts
Decision:
- whether compaction needs throttling
- whether compaction needs isolation
- whether compaction needs separate service-class treatment to preserve
live-read tails
Timing:
- Later
### 6. `gc_lag_storage_amplification_window`
Why sixth: users eventually ask what cleanup and active snapshot pins do to storage cost,
object count, and read stability after compaction publishes new data.
Tonbo why: it turns the current physical-byte caveat into an operational
benchmark and exposes whether Tonbo's cleanup model is good enough for
object-storage economics.
Reference pattern:
- write amplification and compaction accounting
- background system monitoring
Workload:
- Sustain overwrite and delete traffic through repeated compaction cycles while
varying the lifetime of concurrent read snapshots that pin historical
manifest versions.
Metrics:
- logical vs physical bytes
- obsolete object count
- time-to-reclaim
- delete request count
- read latency during GC
- tombstone density
Decision:
- GC cadence
- snapshot-lifetime guidance
- whether in-process snapshot pinning is sufficient or stronger/durable
reader coordination is required before broader
claims
Timing:
- Later