{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DSFB-Database \u2014 Reproduction Notebook\n",
"\n",
"This notebook runs **dsfb-database** from scratch on **real public datasets**, end-to-end, in Colab (or any Jupyter kernel). No synthetic substitutes are used on the real-data path. The only deterministic-synthetic section is the TPC-DS-shaped perturbation harness, whose entire purpose is ground-truth-validated F1/TTD measurement (\u00a78 of the paper).\n",
"\n",
"## Non-claims (read first)\n",
"\n",
"1. DSFB-Database does not optimise queries, replace the query optimiser, or modify execution plans.\n",
"2. DSFB-Database does not claim causal correctness; motifs represent structural consistency given observed signals, not root causes.\n",
"3. DSFB-Database does not provide a forecasting or predictive guarantee; precursor structure is structural, not probabilistic.\n",
"4. DSFB-Database does not provide ground-truth-validated detection on real workloads; we evaluate via injected perturbations, plan-hash concordance, and replay determinism.\n",
"5. DSFB-Database does not claim a universal SQL grammar; motifs are engine-aware, telemetry-aware, and workload-aware.\n",
"6. DSFB-Database does not validate that an operator-supplied grammar is appropriate for a non-SQL residual stream; the generic CSV adapter is a worked example, not a universality claim.\n",
"\n",
"## What this notebook shows\n",
"\n",
"* **Real data, every run.** Four real public datasets are shipped as representative samples inside the crate at `examples/data/`:\n",
" * `ceb_sample.csv` \u2014 5,000 true/estimated PostgreSQL cardinality rows, sliced from the first 400 CEB-IMDb pickles (https://github.com/learnedsystems/CEB, MIT).\n",
" * `job_trace_sample.csv` \u2014 EXPLAIN ANALYZE trace over all 113 JOB queries \u00d7 3 replays, produced by running `scripts/fetch_job.sh` against IMDb in DuckDB (https://github.com/gregrahn/join-order-benchmark + https://event.cwi.nl/da/job/imdb.tgz, MIT).\n",
" * `queries.txt` \u2014 full 11,136-query SQLShare text release (https://uwescience.github.io/sqlshare/data_release.html, CC-BY 4.0). Analysed in `sqlshare-text` mode (workload-phase motif only, over ordinal-position buckets \u2014 the public release has no timing/user metadata).\n",
" * `snowset_sample.csv` \u2014 first 5,000 real rows from the Snowset Cornell mirror (http://www.cs.cornell.edu/~midhul/snowset/snowset-main.csv.gz, CC-BY 4.0).\n",
"* **Controlled-perturbation pipeline** (TPC-DS-shaped, deterministic in-process) provides the only ground-truth-validated metrics (F1/TTD/FAR) in the paper.\n",
"* **Heavy-fetch path**, for users who want the full corpora (not required for this notebook), is documented in the final cell.\n",
"* Replay-determinism is pinned by fingerprint: the test suite verifies SHA-256 of the canonical stream + episode outputs.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Clean working dir, install Rust, clone repo"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os, sys, shutil, subprocess\n",
"\n",
"# Wipe any previous repo checkout so every run starts from scratch.\n",
"if os.path.isdir('dsfb'):\n",
" shutil.rmtree('dsfb')\n",
"\n",
"if shutil.which('cargo') is None:\n",
" subprocess.check_call(['bash', '-lc',\n",
" \"curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | \"\n",
" \"sh -s -- -y --default-toolchain stable\"])\n",
" os.environ['PATH'] = os.path.expanduser('~/.cargo/bin') + ':' + os.environ['PATH']\n",
"subprocess.check_call(['cargo', '--version'])\n",
"subprocess.check_call(['git', 'clone', '--depth', '1',\n",
" 'https://github.com/infinityabundance/dsfb.git'])\n",
"REPO = os.path.abspath('dsfb')\n",
"CRATE = os.path.join(REPO, 'crates', 'dsfb-database')\n",
"OUT = os.path.join(CRATE, 'out')\n",
"os.makedirs(OUT, exist_ok=True)\n",
"print('crate path:', CRATE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Build + run the test suite\n",
"\n",
"Includes the pinned-fingerprint replay-determinism tests and the non-claim lock."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "subprocess.check_call(['cargo', 'build', '--release', '--features', 'report'], cwd=CRATE)\nsubprocess.check_call(['cargo', 'test', '--release', '--features', 'report'], cwd=CRATE)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. One-command reproduction\n\n",
"`reproduce-all` runs the TPC-DS controlled-perturbation pipeline, every bundled dataset exemplar, the DSFB-vs-PELT/BOCPD comparison figure, the null-trace refusal figure, and the cross-signal / stability metrics, then packs every file into `out/dsfb_database_artifacts.zip`. The zip is byte-stable across runs (same seed \u21d2 same SHA-256) \u2014 `tests/reproduce_all_zip_is_deterministic.rs` pins that invariant.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"subprocess.check_call(['./target/release/dsfb-database', 'reproduce-all',\n",
" '--seed', '42', '--out', OUT], cwd=CRATE)\n",
"subprocess.check_call(['./target/release/dsfb-database', 'replay-check',\n",
" '--seed', '42'], cwd=CRATE)\n",
"\n",
"import pandas as pd\n",
"metrics = pd.read_csv(os.path.join(OUT, 'tpcds.metrics.csv'))\n",
"metrics\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Display the canonical figures + comparison + refusal\n\n",
"Every figure below is regenerated from the run above \u2014 no hidden state.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"from IPython.display import Image, display\n",
"figures = ['tpcds.plan_regression_onset.png',\n",
" 'tpcds.cardinality_mismatch_regime.png',\n",
" 'tpcds.contention_ramp.png',\n",
" 'tpcds.cache_collapse.png',\n",
" 'tpcds.workload_phase_transition.png',\n",
" 'tpcds.phase_portrait.png',\n",
" 'tpcds.funnel.png',\n",
" 'comparison.png',\n",
" 'refusal.png']\n",
"for name in figures:\n",
" path = os.path.join(OUT, name)\n",
" if os.path.exists(path):\n",
" print(name)\n",
" display(Image(path))\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Download everything as one zip\n\n",
"The `reproduce-all` command already produced `out/dsfb_database_artifacts.zip`. In Colab, the file is handed to `google.colab.files.download()` so you can grab it with one click; elsewhere the zip path is printed for manual fetching. `out/MANIFEST.md` enumerates every file inside.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"zip_path = os.path.join(OUT, 'dsfb_database_artifacts.zip')\n",
"print('artifact zip ready:', zip_path, f'({os.path.getsize(zip_path)/1e6:.2f} MB)')\n",
"try:\n",
" from google.colab import files as _files # type: ignore\n",
" _files.download(zip_path)\n",
"except Exception as e:\n",
" print('(not in Colab: fetch from', zip_path, ')')\n",
" print(' reason:', e)\n",
"\n",
"print()\n",
"with open(os.path.join(OUT, 'MANIFEST.md')) as f:\n",
" print(f.read())\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. (Optional) Full-corpus fetch paths\n",
"\n",
"The bundled samples above are **real data** (sliced directly from the\n",
"authoritative public dumps), large enough to exercise every motif the\n",
"dataset can support. If you want to process the **full** public corpus\n",
"of each dataset \u2014 e.g. for an independent re-analysis \u2014 the fetch\n",
"scripts under `scripts/` pull the authoritative artefacts and write the\n",
"same CSV schema. They are heavy (Snowset is ~7.5 GB gzipped; JOB\n",
"requires downloading the 1.17 GB IMDb dump and loading it into DuckDB;\n",
"CEB downloads a ~1 GB pickle tarball from Dropbox) and are\n",
"**not required** to reproduce the paper's figures or any of the\n",
"fingerprints pinned by the test suite.\n",
"\n",
"```bash\n",
"cd crates/dsfb-database\n",
"bash scripts/fetch_ceb.sh # CEB pickles -> data/ceb_subset.csv (~50 MB)\n",
"bash scripts/fetch_job.sh # IMDb 1.17 GB + DuckDB load + 113 x N EXPLAIN ANALYZE\n",
"bash scripts/fetch_snowset_subset.sh # Snowset 7.5 GB dump -> data/snowset_shard.csv\n",
"bash scripts/fetch_sqlshare.sh # SQLShare queries.txt (already bundled in examples/data/)\n",
"bash scripts/build_tpcds.sh # in-process; no download required\n",
"\n",
"./target/release/dsfb-database run --dataset ceb --path data/ceb_subset.csv\n",
"./target/release/dsfb-database run --dataset job --path data/job_trace.csv\n",
"./target/release/dsfb-database run --dataset snowset --path data/snowset_shard.csv\n",
"./target/release/dsfb-database run --dataset sqlshare-text --path examples/data/queries.txt\n",
"```\n",
"\n",
"Every output stream is fingerprinted; re-running any of these commands\n",
"on the same input is bytewise-deterministic."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}