Expand description
Dataset adapters.
Each adapter exposes a load(path) function that reads a real subset of
the corresponding public dataset from disk and returns a typed
ResidualStream. The adapter is responsible for:
- format-specific parsing (CSV / Parquet / pickle / SQL)
- dropping samples whose required fields are missing or non-finite
- sorting by time
- embedding the dataset name + version + subset id in
stream.source
Where a dataset cannot be redistributed inside the build (Snowset is
~10 GB; SQLShare is permission-gated; the IMDB JOB dump is third-party
licensed) the adapter additionally provides a synthetic exemplar
function that produces a deterministic, seedable residual stream with the
same statistical shape as the real corpus. The paper labels every figure
that uses an exemplar with [exemplar] and the corresponding fetch
script lets the operator regenerate the figure on the real data.
Design rule (panel-imposed): synthetic exemplars never carry the bare
dataset name in stream.source — they always read
"{dataset}-exemplar-seed{N}", so a downstream report cannot
accidentally label exemplar results as if they were real-data results.
Modules§
- ceb
- CEB adapter (Cardinality Estimation Benchmark, Negi et al.).
- generic_
csv - Generic CSV adapter — a single-domain worked example of applying
dsfb-database’s motif grammar to a residual stream that was not captured from a SQL engine. - job
- JOB adapter (Join Order Benchmark, Leis et al., VLDB 2015).
- postgres
- PostgreSQL
pg_stat_statementsadapter — real engine bridge. - snowset
- Snowset adapter (Vuppalapati et al., NSDI 2020).
- sqlshare
- SQLShare adapter (Jain et al., SIGMOD 2016).
- sqlshare_
text - SQLShare text-only adapter.
- tpcds
- TPC-DS adapter.
Traits§
- Dataset
Adapter - Trait for the five dataset adapters.