midstream 0.2.0

Real-time LLM streaming with inflight analysis
Documentation
# 0022 — Persistence layer: delete `ruvector.db`, adopt `redb` when persistence is needed

- **Status:** Proposed
- **Date:** 2026-05-13
- **Deciders:** @ruv
- **Tags:** persistence, storage, hygiene

## Context and Problem Statement

There is a **1.5 MB binary file `ruvector.db`** committed at the repo
root (`/home/ruvultra/projects/midstream/ruvector.db`,
`file(1)` reports it as generic `data`). Nothing in the workspace
references it:

- `grep -rn "ruvector" Cargo.toml crates/*/Cargo.toml hyprstream-main/Cargo.toml` → no hits.
- `grep -rn "ruvector.db" src/ crates/*/src/ examples/` → no hits in
  first-party code; only stray references inside `.claude*/` metadata.
- It is **not** an embedded asset (`include_bytes!`/`include_str!`
  return nothing).

So it is an **orphan binary blob** in the repo. Most likely it's the
state file of an unrelated tool (ruflo / ruvector) that was run inside
this directory and accidentally committed; the `.gitignore` should
have excluded it.

Separately, the only persistence dependency in the dependency graph is
**`duckdb = "1.1.1"` with `features = ["bundled"]`** in
`hyprstream-main/Cargo.toml:23` — a 200+MB compile cost driven by a
crate ([ADR-0002](0002-unvendor-hyprstream.md)) that nobody calls
(grep for `hyprstream_core`/`use hyprstream` returns zero hits in
first-party code).

So midstream today has:

- one orphan committed `.db` file (1.5 MB),
- one massive transitively-included database engine (`duckdb
  bundled`) that nothing imports,
- no chosen first-party persistence layer.

Several upcoming ADRs *do* need persistence:

- [ADR-0008]0008-lock-free-scheduler-cache.md `moka` cache is
  in-memory; nice to have a warm-restart cache file.
- [ADR-0012]0012-streaming-input-bounds.md `MetricRing` could
  optionally flush to disk on rotation.
- ReasoningBank-style pattern persistence (not in this repo yet but
  referenced from `CLAUDE.md`).

## Decision Drivers

- **No orphaned blobs.** Committed binary files must have a documented
  purpose, or they leave.
- **Don't pay for what you don't use.** `duckdb bundled` adds
  significant compile time for code that isn't invoked.
- **Pick once, use everywhere.** When persistence is needed, the
  pick should be uniform — not a free-for-all of per-crate choices.
- **Embedded, single-file, async-friendly, ACID.** The streaming use
  case is small, local, write-mostly. We don't want a SQL engine; we
  want a fast embedded KV/typed-tree.

## Considered Options

1. **Status quo.** Keep `ruvector.db` and the unused `duckdb`
   dependency.
2. **Delete `ruvector.db`, drop `duckdb bundled`, declare "no
   persistence" until a use-case forces a decision.** Cheapest;
   correct for today.
3. **Adopt `sled`** when persistence is needed. Battle-tested in
   production, but the project's lead has stated unmaintenance risk
   (no 1.0 release; last activity is intermittent).
4. **Adopt `redb`** when persistence is needed. Pure-Rust ACID
   embedded KV with typed tables, single-file, MIT, actively
   maintained, MSRV stable, MMAP-friendly.
5. **Adopt `fjall`** when persistence is needed. LSM-tree, newer
   pure-Rust, async-aware, designed for write-heavy workloads.
6. **Adopt `duckdb`** (already in deps via hyprstream-main).
   Heavy-weight SQL engine; mismatched to a KV use-case.

## Decision Outcome

**Chosen option: Option 2 immediately, plus a forward-binding
preference for Option 4 (`redb`) when persistence is needed.**

Concretely now:

- Delete `ruvector.db` from the working tree. Add `ruvector.db` and
  `*.db` to `.gitignore` so similar accidents are caught.
- Drop the `duckdb` dependency by un-vendoring `hyprstream-main`
  (already covered by [ADR-0002]0002-unvendor-hyprstream.md). If
  that ADR cannot land immediately, gate `duckdb` behind a feature
  flag so default builds skip it.

Forward direction:

- When persistence is needed (cache warm-restart, metric ring spill,
  pattern store), the implementation crate adds `redb = "2"` and
  exposes a typed table via the workspace's persistence facade
  trait (TBD in a follow-up ADR when the first concrete use-case
  lands).
- `sled` and `fjall` are evaluated for the *first* real write-heavy
  use-case; `redb` is the default until evidence overturns it.

### Positive consequences

- Stops shipping an orphan 1.5 MB binary file in the repo.
- Drops a massive transitively-included database engine that nothing
  imports.
- Establishes the persistence story without prematurely committing to
  a heavy choice.

### Negative consequences

- Deleting `ruvector.db` may surface a missing dependency for any
  external tool that *was* using it from this directory. Mitigated by
  documenting the deletion in the same commit and asking the user
  before deleting (this ADR proposes; the actual `git rm` is a
  follow-up PR).
- The "no persistence today" position means features that would
  benefit from a warm cache (e.g. `moka`'s on-disk persistence add-on)
  stay in-memory-only for now.

## Implementation notes

- Verify the orphan claim once more in a follow-up PR:
  `rg -uu --no-ignore --binary 'ruvector\\.db' .` from the repo root.
- Add `*.db` and `ruvector.db` to root `.gitignore`. Remove the file
  via `git rm ruvector.db` in the same PR; commit message explains
  why and links to this ADR.
- Bisect the `duckdb` cost: confirm `cargo tree --workspace` shows
  `duckdb` only via `hyprstream-main`. Once
  [ADR-0002]0002-unvendor-hyprstream.md lands, this cost is gone.
- Open a follow-up ADR for the *first* concrete persistence use-case
  (e.g. "ADR-NNNN — Pattern store on disk via `redb`").

## Links

- Related: [ADR-0002]0002-unvendor-hyprstream.md,
  [ADR-0008]0008-lock-free-scheduler-cache.md,
  [ADR-0012]0012-streaming-input-bounds.md.
- `redb`: https://docs.rs/redb/
- `sled`: https://docs.rs/sled/
- `fjall`: https://docs.rs/fjall/