# 03 — Query
The query engine answers `SELECT pgrdf.sparql($1)` where `$1` is a
SPARQL 1.1 string. User-facing documentation for the full surface
lives in [`guide/03-querying.md`](../guide/03-querying.md); this
page is the engineering-side view of how it's wired.
## Pipeline (v0.3 — full Phase 3 SPARQL surface + steps 1-2 storage perf)
```
SPARQL string
│
▼ spargebra::SparqlParser::new().parse_query
algebra AST (Distinct ▸ Project ▸ Slice ▸ OrderBy ▸ Filter ▸ LeftJoin ▸ Union ▸ Minus ▸ Bgp)
│
▼ src/query/executor.rs::parse_select
ParsedSelect { projected, bgp, filters, optionals, minuses,
union_branches, distinct, order_by, limit, offset }
│
▼ build_bgp_sql (or build_union_sql for UNION queries)
dynamic SQL string with constant dict IDs already inlined
│
▼ Spi::connect_mut(|c| c.update(sql, None, &[]))
result rows ─► SETOF JSONB
```
## Modules
- `src/query/parser.rs` — `pgrdf.sparql_parse(q TEXT) → JSONB`,
returns the spargebra algebra shape (form, projected vars, BGP
triples, `unsupported_algebra` tags). Used for introspection +
for the executor's "is this expression translatable" check.
- `src/query/executor.rs` — `pgrdf.sparql(q TEXT) → SETOF JSONB`.
Walks the wrapper algebra into `ParsedSelect`, then dispatches:
- **Single-branch path** (`build_single_branch_outer`): one
SELECT statement with FROM + WHERE + the hidden-column trick
for ORDER-BY-on-unprojected-var.
- **UNION path** (`build_union_sql`): each branch becomes its
own SELECT via `build_branch_sql`, combined with `UNION ALL`,
wrapped by an outer SELECT for DISTINCT / ORDER BY / LIMIT /
OFFSET.
- Shared subroutine `build_from_and_where` emits explicit
`INNER JOIN` syntax for mandatory BGP patterns 2..N, then
`LEFT JOIN` per OPTIONAL block, then `WHERE NOT EXISTS (…)`
per MINUS block.
## Translation strategy
| Single BGP `?s :p ?o` | `FROM _pgrdf_quads q1 WHERE …` | First-occurrence anchors record `(alias, col)` per variable |
| Multi-pattern BGP | `q1 INNER JOIN q2 ON (q2.col = q1.col …)` | Shared vars become equality predicates that fold into the join |
| Constant in any position | `qN.col = <resolved dict id>` | Unknown IRIs/literals resolve to `-1` → zero rows (spec-correct "no solutions") |
| FILTER identity (`=`, `!=`, `sameTerm`) | dict-id equality | Sound because `_pgrdf_dictionary` dedups by (type, lex, datatype, lang) |
| FILTER numeric ordering | `CASE WHEN datatype_iri_id IN (…XSD numeric…) THEN lex::numeric ELSE NULL END` | Type-safe; non-numeric drops the row via NULL comparison |
| FILTER REGEX | Postgres `~` / `~*` against `lexical_value` | `i` flag → case-insensitive |
| FILTER BOUND | `qN.col IS NOT NULL` | Correct for OPTIONAL vars (nullable); trivially TRUE for mandatory |
| OPTIONAL { triple } | `LEFT JOIN _pgrdf_quads qOPT_K ON (…)` | Per-block FILTER lands in the ON clause |
| UNION { A } { B } | `(SELECT … FROM A) UNION ALL (SELECT … FROM B)` | Each branch SELECTs `NULL::TEXT` for vars it doesn't bind |
| MINUS { triple } | `WHERE NOT EXISTS (SELECT 1 FROM _pgrdf_quads qMIN_K WHERE …)` | Elided at translation time when there are no shared variables (SPARQL no-op) |
| DISTINCT / REDUCED | `SELECT DISTINCT …` | REDUCED → DISTINCT (safe over-approximation per spec) |
| ORDER BY ?v | `ORDER BY (SELECT lex …) ASC/DESC NULLS LAST` or by ordinal | Unprojected ?v → hidden trailing SELECT column |
| LIMIT N / OFFSET N | `LIMIT N` / `OFFSET N` | Postgres-native |
## Prepared-plan cache (LLD §4.2, **shipped — Phase 3 step 2**)
Lives in [`src/query/plan_cache.rs`](../src/query/plan_cache.rs).
The flow:
```
parse → translate → ExecPlan { sql: "...$1...$2...", params: [..] }
│
▼
Spi::connect_mut
│
├── plan_cache.contains(sql) ?
│ │
│ └── miss → client.prepare(sql, &[INT8OID; n])
│ │
│ └── .keep() → OwnedPreparedStatement
│ │
│ plan_cache.insert(sql, ↑)
│ │
│ └── hit → record_hit()
│
▼
client.update(&owned, None, &datums)
```
Concrete shape:
- **Parameterisation.** Every dict-id constant in the dynamic SQL
(subject / predicate / object literals in BGP triples; constants
in FILTER `=` `!=` `IN(…)`; the xsd:numeric dict-id list inside
numeric-comparison sub-SELECTs) becomes a `$N` positional
placeholder. A `thread_local!` `PARAM_BUF` collects the resolved
i64s in declaration order. `translate()` snapshots the buffer
into `ExecPlan { sql, params }`. The SQL string itself is the
canonical cache key — same algebra shape ⇒ same SQL byte-for-byte
⇒ same key, no extra hashing layer.
- **Cache.** Per-backend `thread_local!`
`RefCell<HashMap<String, OwnedPreparedStatement>>`. Lifetime-
promoted via `PreparedStatement::keep()` (`SPI_keepplan`).
Capacity is unbounded for v1; typical backends touch a few
dozen distinct shapes per session. Eviction (bounded LRU) is a
v0.4 polish.
- **Counters.** `plan_cache_hits / misses / inserts` live in shmem
(`PgAtomic<AtomicU64>`) so a multi-backend benchmark reads a
single fleet-wide view through `pgrdf.stats()`. Per-backend
`plan_cache_local_size` is also exposed in stats — useful for
catching unbounded growth in a misbehaving session.
- **Invalidation.** Plans are parameterised, so dict-id reshuffles
from `DROP EXTENSION; CREATE EXTENSION` don't invalidate the SQL
itself — only the parameter VALUES change next call. Postgres's
own SPI cached-plan invalidation handles relation drops. For
paranoia, `pgrdf.plan_cache_clear() -> bigint` empties THIS
backend's cache and returns the count.
- **Acceptance criterion** (LLD §4.2): repeated structural queries
with varying constants reuse the cached plan. `tests/regression/sql/51-plan-cache.sql`
verifies: 5 identical queries → 1 miss + 4 hits; 2 queries with
same shape but different IRI constants → 1 miss + 1 hit; a
structurally distinct query → 1 miss + 0 hits.
## Surface today (v0.3 SPARQL surface complete)
- ✅ Basic Graph Patterns (1..N triples)
- ✅ `SELECT` (explicit projection or `SELECT *`); `ASK`
- ✅ FILTER — identity, boolean, term-type, BOUND, numeric
ordering, REGEX, IN, STR, LANG, DATATYPE, UCASE, LCASE,
STRLEN, CONTAINS, STRSTARTS, STRENDS, arithmetic
- ✅ Solution modifiers — DISTINCT, REDUCED, LIMIT, OFFSET, ORDER BY
- ✅ OPTIONAL (single triple per block, chained)
- ✅ UNION (n-way; per-branch FILTERs / OPTIONALs / MINUSes)
- ✅ MINUS — single AND multi-triple sub-pattern, shared-var keyed
- ✅ Aggregates — `COUNT(*)`, `COUNT(?v)`, `COUNT(DISTINCT ?v)`,
`SUM`, `AVG`, type-aware `MIN` / `MAX` (numeric path on
`xsd:numeric`, lex fallback), `GROUP_CONCAT`, `SAMPLE` with
`GROUP BY`
- ✅ `HAVING` — both by aggregate alias (`HAVING(?total > c)`)
AND inline (`HAVING(SUM(?v) > c)`)
- ✅ `BIND(expr AS ?v)` for projection — Literal / NamedNode /
Variable, STR / LANG / DATATYPE / UCASE / LCASE / STRLEN,
arithmetic, CONCAT
- ⏳ `CONSTRUCT`, `DESCRIBE` — different output shape; v0.4
- ⏳ Property paths beyond simple sequence (`*`, `+`, `?`, `^`, `\|`) — v0.4
- ⏳ Named-graph `GRAPH { … }` — needs graph-IRI→graph_id mapping; v0.4
- ⏳ `VALUES` inline data — needs derived-table refactor; v0.4
- ⏳ Aggregates over UNION; multi-triple OPTIONAL; BIND-in-FILTER — v0.4
- ❌ Federated `SERVICE` — out of scope for v0.x
## Postgres custom scan hooks
Aspirational — out of scope for v0.3. With the prepared-plan cache
now in place (Phase 3 step 2), the next performance lever after
COPY BINARY (step 3) is bypassing the standard executor for specific
quad-shape access patterns via the Postgres custom scan API.
Earliest v0.4 target.
## See also
- User-facing surface: [`guide/03-querying.md`](../guide/03-querying.md)
- Implementation: [`src/query/executor.rs`](../src/query/executor.rs)
+ [`src/query/parser.rs`](../src/query/parser.rs)
- Tests: `tests/regression/sql/3[0-8]-sparql-*.sql`