dsfb-database 0.1.1

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DSFB-Database \u2014 Reproduction Notebook\n",
    "\n",
    "This notebook runs **dsfb-database** from scratch on **real public datasets**, end-to-end, in Colab (or any Jupyter kernel). No synthetic substitutes are used on the real-data path. The only deterministic-synthetic section is the TPC-DS-shaped perturbation harness, whose entire purpose is ground-truth-validated F1/TTD measurement (\u00a78 of the paper).\n",
    "\n",
    "## Non-claims (read first)\n",
    "\n",
    "1. DSFB-Database does not optimise queries, replace the query optimiser, or modify execution plans.\n",
    "2. DSFB-Database does not claim causal correctness; motifs represent structural consistency given observed signals, not root causes.\n",
    "3. DSFB-Database does not provide a forecasting or predictive guarantee; precursor structure is structural, not probabilistic.\n",
    "4. DSFB-Database does not provide ground-truth-validated detection on real workloads; we evaluate via injected perturbations, plan-hash concordance, and replay determinism.\n",
    "5. DSFB-Database does not claim a universal SQL grammar; motifs are engine-aware, telemetry-aware, and workload-aware.\n",
    "6. DSFB-Database does not validate that an operator-supplied grammar is appropriate for a non-SQL residual stream; the generic CSV adapter is a worked example, not a universality claim.\n",
    "\n",
    "## What this notebook shows\n",
    "\n",
    "* **Real data, every run.** Four real public datasets are shipped as representative samples inside the crate at `examples/data/`:\n",
    "    * `ceb_sample.csv` \u2014 5,000 true/estimated PostgreSQL cardinality rows, sliced from the first 400 CEB-IMDb pickles (https://github.com/learnedsystems/CEB, MIT).\n",
    "    * `job_trace_sample.csv` \u2014 EXPLAIN ANALYZE trace over all 113 JOB queries \u00d7 3 replays, produced by running `scripts/fetch_job.sh` against IMDb in DuckDB (https://github.com/gregrahn/join-order-benchmark + https://event.cwi.nl/da/job/imdb.tgz, MIT).\n",
    "    * `queries.txt` \u2014 full 11,136-query SQLShare text release (https://uwescience.github.io/sqlshare/data_release.html, CC-BY 4.0). Analysed in `sqlshare-text` mode (workload-phase motif only, over ordinal-position buckets \u2014 the public release has no timing/user metadata).\n",
    "    * `snowset_sample.csv` \u2014 first 5,000 real rows from the Snowset Cornell mirror (http://www.cs.cornell.edu/~midhul/snowset/snowset-main.csv.gz, CC-BY 4.0).\n",
    "* **Controlled-perturbation pipeline** (TPC-DS-shaped, deterministic in-process) provides the only ground-truth-validated metrics (F1/TTD/FAR) in the paper.\n",
    "* **Heavy-fetch path**, for users who want the full corpora (not required for this notebook), is documented in the final cell.\n",
    "* Replay-determinism is pinned by fingerprint: the test suite verifies SHA-256 of the canonical stream + episode outputs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Clean working dir, install Rust, clone repo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, sys, shutil, subprocess\n",
    "\n",
    "# Wipe any previous repo checkout so every run starts from scratch.\n",
    "if os.path.isdir('dsfb'):\n",
    "    shutil.rmtree('dsfb')\n",
    "\n",
    "if shutil.which('cargo') is None:\n",
    "    subprocess.check_call(['bash', '-lc',\n",
    "        \"curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | \"\n",
    "        \"sh -s -- -y --default-toolchain stable\"])\n",
    "    os.environ['PATH'] = os.path.expanduser('~/.cargo/bin') + ':' + os.environ['PATH']\n",
    "subprocess.check_call(['cargo', '--version'])\n",
    "subprocess.check_call(['git', 'clone', '--depth', '1',\n",
    "                       'https://github.com/infinityabundance/dsfb.git'])\n",
    "REPO  = os.path.abspath('dsfb')\n",
    "CRATE = os.path.join(REPO, 'crates', 'dsfb-database')\n",
    "OUT   = os.path.join(CRATE, 'out')\n",
    "os.makedirs(OUT, exist_ok=True)\n",
    "print('crate path:', CRATE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Build + run the test suite\n",
    "\n",
    "Includes the pinned-fingerprint replay-determinism tests and the non-claim lock."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "subprocess.check_call(['cargo', 'build', '--release', '--features', 'report'], cwd=CRATE)\nsubprocess.check_call(['cargo', 'test',  '--release', '--features', 'report'], cwd=CRATE)"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. One-command reproduction\n\n",
    "`reproduce-all` runs the TPC-DS controlled-perturbation pipeline, every bundled dataset exemplar, the DSFB-vs-PELT/BOCPD comparison figure, the null-trace refusal figure, and the cross-signal / stability metrics, then packs every file into `out/dsfb_database_artifacts.zip`. The zip is byte-stable across runs (same seed \u21d2 same SHA-256) \u2014 `tests/reproduce_all_zip_is_deterministic.rs` pins that invariant.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "subprocess.check_call(['./target/release/dsfb-database', 'reproduce-all',\n",
    "                       '--seed', '42', '--out', OUT], cwd=CRATE)\n",
    "subprocess.check_call(['./target/release/dsfb-database', 'replay-check',\n",
    "                       '--seed', '42'], cwd=CRATE)\n",
    "\n",
    "import pandas as pd\n",
    "metrics = pd.read_csv(os.path.join(OUT, 'tpcds.metrics.csv'))\n",
    "metrics\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Display the canonical figures + comparison + refusal\n\n",
    "Every figure below is regenerated from the run above \u2014 no hidden state.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "from IPython.display import Image, display\n",
    "figures = ['tpcds.plan_regression_onset.png',\n",
    "           'tpcds.cardinality_mismatch_regime.png',\n",
    "           'tpcds.contention_ramp.png',\n",
    "           'tpcds.cache_collapse.png',\n",
    "           'tpcds.workload_phase_transition.png',\n",
    "           'tpcds.phase_portrait.png',\n",
    "           'tpcds.funnel.png',\n",
    "           'comparison.png',\n",
    "           'refusal.png']\n",
    "for name in figures:\n",
    "    path = os.path.join(OUT, name)\n",
    "    if os.path.exists(path):\n",
    "        print(name)\n",
    "        display(Image(path))\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Download everything as one zip\n\n",
    "The `reproduce-all` command already produced `out/dsfb_database_artifacts.zip`. In Colab, the file is handed to `google.colab.files.download()` so you can grab it with one click; elsewhere the zip path is printed for manual fetching. `out/MANIFEST.md` enumerates every file inside.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "zip_path = os.path.join(OUT, 'dsfb_database_artifacts.zip')\n",
    "print('artifact zip ready:', zip_path, f'({os.path.getsize(zip_path)/1e6:.2f} MB)')\n",
    "try:\n",
    "    from google.colab import files as _files  # type: ignore\n",
    "    _files.download(zip_path)\n",
    "except Exception as e:\n",
    "    print('(not in Colab: fetch from', zip_path, ')')\n",
    "    print('  reason:', e)\n",
    "\n",
    "print()\n",
    "with open(os.path.join(OUT, 'MANIFEST.md')) as f:\n",
    "    print(f.read())\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. (Optional) Full-corpus fetch paths\n",
    "\n",
    "The bundled samples above are **real data** (sliced directly from the\n",
    "authoritative public dumps), large enough to exercise every motif the\n",
    "dataset can support. If you want to process the **full** public corpus\n",
    "of each dataset \u2014 e.g. for an independent re-analysis \u2014 the fetch\n",
    "scripts under `scripts/` pull the authoritative artefacts and write the\n",
    "same CSV schema. They are heavy (Snowset is ~7.5 GB gzipped; JOB\n",
    "requires downloading the 1.17 GB IMDb dump and loading it into DuckDB;\n",
    "CEB downloads a ~1 GB pickle tarball from Dropbox) and are\n",
    "**not required** to reproduce the paper's figures or any of the\n",
    "fingerprints pinned by the test suite.\n",
    "\n",
    "```bash\n",
    "cd crates/dsfb-database\n",
    "bash scripts/fetch_ceb.sh             # CEB pickles -> data/ceb_subset.csv      (~50 MB)\n",
    "bash scripts/fetch_job.sh             # IMDb 1.17 GB + DuckDB load + 113 x N EXPLAIN ANALYZE\n",
    "bash scripts/fetch_snowset_subset.sh  # Snowset 7.5 GB dump -> data/snowset_shard.csv\n",
    "bash scripts/fetch_sqlshare.sh        # SQLShare queries.txt (already bundled in examples/data/)\n",
    "bash scripts/build_tpcds.sh           # in-process; no download required\n",
    "\n",
    "./target/release/dsfb-database run --dataset ceb           --path data/ceb_subset.csv\n",
    "./target/release/dsfb-database run --dataset job           --path data/job_trace.csv\n",
    "./target/release/dsfb-database run --dataset snowset       --path data/snowset_shard.csv\n",
    "./target/release/dsfb-database run --dataset sqlshare-text --path examples/data/queries.txt\n",
    "```\n",
    "\n",
    "Every output stream is fingerprinted; re-running any of these commands\n",
    "on the same input is bytewise-deterministic."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}