sqlite-graph
An embeddable graph database built entirely on SQLite. Recursive CTEs for traversal, bi-temporal edges, FTS5 full-text search, and vector fusion — in a single file.
No Docker. No JVM. No network hop. No docker-compose.yml you'll fight with for an hour.
use ;
let graph = open_or_create?;
let alice = new;
let bob = new;
graph.add_entity?;
graph.add_entity?;
graph.add_edge?;
// Multi-hop traversal via recursive CTE
let = graph.traverse?;
Why this exists
If you want a knowledge graph today, the typical stack looks like this:
| Layer | Traditional | sqlite-graph |
|---|---|---|
| Graph storage | Neo4j (Docker) | SQLite file |
| Entity extraction | OpenAI API | Your choice (or ctxgraph for local ONNX) |
| Embedding storage | Pinecone / Qdrant | BLOB columns |
| Semantic search | External vector DB | Cosine similarity in-process |
| Full-text search | Elasticsearch | FTS5 (built into SQLite) |
Four services, two network dependencies — or one library and one file.
The core insight: a graph database is really two things — a storage format for nodes and edges, and a query engine that can walk those edges. SQLite handles the first trivially. For the second, recursive CTEs give you everything you need for multi-hop traversal.
Install
[]
= "0.1"
Features
| Feature | How it works |
|---|---|
| Graph traversal | Recursive CTEs — multi-hop walks in a single SQL statement |
| Bi-temporal edges | valid_from/valid_until (real-world) + recorded_at (system time) |
| Full-text search | FTS5 across episodes, entities, and edges with trigger-based sync |
| Hybrid search | FTS5 + vector cosine similarity fused with Reciprocal Rank Fusion |
| Entity deduplication | Jaro-Winkler fuzzy matching with alias resolution |
| Embeddings | Store and query f32 vectors as BLOB columns |
| Single file | Standard SQLite — cp graph.db backup.db and you're done |
Usage
Episodes (events)
Episodes are the fundamental unit of information — a decision, a message, a commit, an incident.
use ;
let graph = in_memory?;
let ep = builder
.source
.tag
.meta
.build;
graph.add_episode?;
Entities and edges
use ;
let graph = in_memory?;
let pg = new;
let billing = new;
graph.add_entity?;
graph.add_entity?;
let mut edge = new;
edge.fact = Some;
graph.add_edge?;
Graph traversal
Traverse the graph from any entity. Uses a recursive CTE under the hood — bidirectional, depth-bounded, with cycle detection via UNION.
// Find everything within 2 hops of PostgreSQL
let = graph.traverse?;
// Get an entity's immediate neighborhood
let ctx = graph.get_entity_context?;
println!;
Full-text search
Three FTS5 indexes (episodes, entities, edges) are automatically maintained via 9 triggers — no manual reindexing.
// Search episodes
let results = graph.search?;
for in &results
// Search entities
let entities = graph.search_entities?;
Hybrid search (FTS5 + vectors + RRF)
Keyword search misses semantic similarity. Vector search misses exact matches. We run both and fuse the results using Reciprocal Rank Fusion.
// You provide the embedding (from any model — OpenAI, sentence-transformers, etc.)
let query_embedding: = your_embed_fn;
let results = graph.search_fused?;
for result in &results
RRF formula: score(d) = sum(1 / (k + rank_i(d))) for each mode where d appears (k=60). Rank-based fusion avoids the need to normalize incompatible score distributions (BM25 scores vs cosine similarities).
Bi-temporal edges
Every edge tracks two time dimensions. Facts expire but never disappear.
valid_from/valid_until— when the fact was true in the real worldrecorded_at— when the system learned about the fact
This distinction matters for debugging. "What did we know about the system on March 10th?" is a different question from "What was actually true on March 10th?" Bi-temporal modeling answers both.
use Edge;
// Alice worked at Google, then moved to Meta
let mut edge = new;
edge.fact = Some;
graph.add_edge?;
// Later: invalidate the old edge (sets valid_until = now)
graph.invalidate_edge?;
// Add the new one
let new_edge = new;
graph.add_edge?;
// Traverse with history to see both
let = graph.traverse_with_history?;
Invalidation never deletes — it sets valid_until, preserving the full audit trail:
UPDATE edges SET valid_until = ?1 WHERE id = ?2 AND valid_until IS NULL
Entity deduplication
"PostgreSQL", "Postgres", and "PG" resolve to the same node via Jaro-Winkler similarity + an alias table.
let pg = new;
let = graph.add_entity_deduped?;
// merged = false (new entity)
let pg2 = new;
let = graph.add_entity_deduped?;
// merged = true, id2 == id (matched "PostgreSQL" at 0.92 similarity)
Dedup is two-level: exact alias lookup first (SQL, O(1)), then Jaro-Winkler against all entities of the same type (Rust, computed in-process). The 0.85 threshold is tuned for software terminology — "PostgreSQL" vs "Postgres" scores 0.92, "React" vs "Redux" scores 0.73 (correctly rejected).
Architecture
Schema
7 tables, 3 FTS5 virtual tables, 8 indexes, 9 triggers. The full schema is applied via migration on first open.
-- Core tables
episodes -- raw events (decisions, messages, incidents)
entities -- graph nodes (people, services, technologies)
edges -- relationships (bi-temporal, with confidence scores)
episode_entities -- junction: which entities appear in which episodes
aliases -- deduplication: alias_name → canonical_id
communities -- clustering (reserved for community detection)
_migrations -- schema versioning
-- FTS5 indexes (external content, trigger-synced)
episodes_fts -- content, source, metadata
entities_fts -- name, entity_type, summary
edges_fts -- fact, relation
How traversal works
The recursive CTE that replaces Cypher's MATCH (a)-[*1..3]-(b):
WITH RECURSIVE traversal(entity_id, depth) AS (
-- Base case: start at the given entity
SELECT ?1, 0
UNION
-- Recursive step: walk edges in both directions
SELECT
CASE WHEN e.source_id = t.entity_id THEN e.target_id
ELSE e.source_id END,
t.depth + 1
FROM traversal t
JOIN edges e ON (e.source_id = t.entity_id OR e.target_id = t.entity_id)
WHERE t.depth < ?2
AND e.valid_until IS NULL -- only current edges
)
SELECT ent.id, ent.name, ent.entity_type, ent.summary,
ent.created_at, ent.metadata
FROM entities ent
WHERE ent.id IN (SELECT DISTINCT entity_id FROM traversal)
How it works:
- Base case — seed with the starting entity at depth 0
- Recursive step — for each discovered entity, find all edges where it's source or target (bidirectional traversal)
CASEexpression — picks the other end of the edgeUNION(notUNION ALL) — deduplicates, preventing cycles- Depth limit —
WHERE t.depth < ?2caps traversal hops - Temporal filter —
valid_until IS NULLrestricts to current edges
After collecting entities, a second query grabs all edges between them to return the complete subgraph.
The equivalent Cypher would be:
MATCH path = (start:Entity {id: $id})-[*1..3]-(neighbor)
WHERE ALL(r IN relationships(path) WHERE r.valid_until IS NULL)
RETURN DISTINCT neighbor, length(path) AS depth
ORDER BY depth
Cypher is more concise. But the SQL version is self-contained — no external database process, no Bolt protocol, no connection pooling.
Index strategy
8 indexes optimized for the workload:
| Index | Optimizes |
|---|---|
idx_edges_source / idx_edges_target |
Graph traversal JOIN — without these, every CTE step is a full table scan |
idx_edges_relation |
Filtering edges by relation type |
idx_edges_valid |
Temporal queries — composite (valid_from, valid_until) for current vs historical |
idx_entities_type |
Entity listing/filtering, used during fuzzy dedup |
idx_episode_entities |
Reverse lookup: which episodes mention this entity |
idx_episodes_source / idx_episodes_recorded |
Source filtering and time-range queries |
Plus 3 FTS5 virtual tables maintaining their own inverted indexes, synced via 9 triggers (INSERT/UPDATE/DELETE for each of episodes, entities, edges).
FTS5 sync triggers
FTS5 external content tables don't auto-sync. Each base table has three triggers:
-- On INSERT: add to FTS index
INSERT INTO episodes_fts(rowid, content, source, metadata)
VALUES (new.rowid, new.content, new.source, new.metadata);
END;
-- On DELETE: remove from FTS index
INSERT INTO episodes_fts(episodes_fts, rowid, content, source, metadata)
VALUES ('delete', old.rowid, old.content, old.source, old.metadata);
END;
-- On UPDATE: delete-then-insert
INSERT INTO episodes_fts(episodes_fts, rowid, content, source, metadata)
VALUES ('delete', old.rowid, old.content, old.source, old.metadata);
INSERT INTO episodes_fts(rowid, content, source, metadata)
VALUES (new.rowid, new.content, new.source, new.metadata);
END;
The VALUES ('delete', ...) syntax is FTS5's way of removing an entry. 9 triggers total (3 tables x 3 operations).
Performance pragmas
PRAGMA journal_mode = WAL; -- concurrent readers during writes
PRAGMA synchronous = NORMAL; -- balance durability vs speed
PRAGMA foreign_keys = ON; -- enforce referential integrity
WAL allows concurrent reads while writing — the typical workload is one writer ingesting data, multiple readers searching and traversing. NORMAL sync is fine for non-financial data. Foreign keys are off by default in SQLite — we turn them on because edges must reference valid entities.
Performance characteristics
Designed for graphs up to ~100K nodes. Single-digit milliseconds for traversal at depth 2-3.
| Concern | Status | Mitigation |
|---|---|---|
| Traversal | O(branching^depth), bounded by max_depth |
Fine for <100K with proper indexes |
| Vector search | Brute-force cosine O(n) — ~50ms at 100K embeddings | Drop in sqlite-vec for HNSW |
| FTS5 search | BM25 ranking, millisecond latency | Trigger-synced, zero manual maintenance |
| Concurrent readers | Unlimited with WAL mode | Built-in |
| Concurrent writers | Single writer at a time | Fine for embedded/single-process use |
| Database size | ~15MB per 100K episodes with 384-dim embeddings | Standard SQLite, no special storage |
When to graduate
- >100K entities with deep traversals — consider Neo4j or Memgraph
- Multi-user concurrent writes — consider PostgreSQL with a graph schema
- Complex pattern matching — if you need Cypher regularly, use a graph database
- Distributed systems — SQLite doesn't replicate or shard
The schema maps directly to labeled property graphs — export to JSON and import into Neo4j, Memgraph, or DGraph when you outgrow this.
Extracted from
This library powers the storage layer of ctxgraph, a knowledge graph engine that achieves 0.800 combined F1 on extraction benchmarks with zero API calls. ctxgraph adds local ONNX-based NER (GLiNER) and relation extraction on top of sqlite-graph.
Read the full technical deep-dive: We Replaced Neo4j with 45 SQL Statements
License
MIT