sqlite-graph 0.1.0

An embeddable graph database built on SQLite — recursive CTEs, bi-temporal edges, FTS5 search, and vector fusion in a single file
Documentation

sqlite-graph

An embeddable graph database built entirely on SQLite. Recursive CTEs for traversal, bi-temporal edges, FTS5 full-text search, and vector fusion — in a single file.

No Docker. No JVM. No network hop. No docker-compose.yml you'll fight with for an hour.

use sqlite_graph::{Graph, Entity, Edge};

let graph = Graph::open_or_create("my.db".as_ref())?;

let alice = Entity::new("Alice", "person");
let bob = Entity::new("Bob", "person");
graph.add_entity(alice.clone())?;
graph.add_entity(bob.clone())?;

graph.add_edge(Edge::new(&alice.id, &bob.id, "knows"))?;

// Multi-hop traversal via recursive CTE
let (entities, edges) = graph.traverse(&alice.id, 3)?;

Why this exists

If you want a knowledge graph today, the typical stack looks like this:

Layer Traditional sqlite-graph
Graph storage Neo4j (Docker) SQLite file
Entity extraction OpenAI API Your choice (or ctxgraph for local ONNX)
Embedding storage Pinecone / Qdrant BLOB columns
Semantic search External vector DB Cosine similarity in-process
Full-text search Elasticsearch FTS5 (built into SQLite)

Four services, two network dependencies — or one library and one file.

The core insight: a graph database is really two things — a storage format for nodes and edges, and a query engine that can walk those edges. SQLite handles the first trivially. For the second, recursive CTEs give you everything you need for multi-hop traversal.

Install

[dependencies]
sqlite-graph = "0.1"

Features

Feature How it works
Graph traversal Recursive CTEs — multi-hop walks in a single SQL statement
Bi-temporal edges valid_from/valid_until (real-world) + recorded_at (system time)
Full-text search FTS5 across episodes, entities, and edges with trigger-based sync
Hybrid search FTS5 + vector cosine similarity fused with Reciprocal Rank Fusion
Entity deduplication Jaro-Winkler fuzzy matching with alias resolution
Embeddings Store and query f32 vectors as BLOB columns
Single file Standard SQLite — cp graph.db backup.db and you're done

Usage

Episodes (events)

Episodes are the fundamental unit of information — a decision, a message, a commit, an incident.

use sqlite_graph::{Graph, Episode};

let graph = Graph::in_memory()?;

let ep = Episode::builder("Team chose Postgres for the billing service")
    .source("standup")
    .tag("architecture")
    .meta("author", "alice")
    .build();

graph.add_episode(ep)?;

Entities and edges

use sqlite_graph::{Graph, Entity, Edge};

let graph = Graph::in_memory()?;

let pg = Entity::new("PostgreSQL", "technology");
let billing = Entity::new("Billing Service", "component");
graph.add_entity(pg.clone())?;
graph.add_entity(billing.clone())?;

let mut edge = Edge::new(&pg.id, &billing.id, "used_by");
edge.fact = Some("Billing service uses PostgreSQL for persistence".into());
graph.add_edge(edge)?;

Graph traversal

Traverse the graph from any entity. Uses a recursive CTE under the hood — bidirectional, depth-bounded, with cycle detection via UNION.

// Find everything within 2 hops of PostgreSQL
let (entities, edges) = graph.traverse(&pg.id, 2)?;

// Get an entity's immediate neighborhood
let ctx = graph.get_entity_context(&pg.id)?;
println!("{} has {} neighbors", ctx.entity.name, ctx.neighbors.len());

Full-text search

Three FTS5 indexes (episodes, entities, edges) are automatically maintained via 9 triggers — no manual reindexing.

// Search episodes
let results = graph.search("postgres billing", 10)?;
for (episode, score) in &results {
    println!("[{:.2}] {}", score, episode.content);
}

// Search entities
let entities = graph.search_entities("PostgreSQL", 5)?;

Hybrid search (FTS5 + vectors + RRF)

Keyword search misses semantic similarity. Vector search misses exact matches. We run both and fuse the results using Reciprocal Rank Fusion.

// You provide the embedding (from any model — OpenAI, sentence-transformers, etc.)
let query_embedding: Vec<f32> = your_embed_fn("postgres billing");

let results = graph.search_fused("postgres billing", &query_embedding, 10)?;
for result in &results {
    println!("[{:.4}] {}", result.score, result.episode.content);
}

RRF formula: score(d) = sum(1 / (k + rank_i(d))) for each mode where d appears (k=60). Rank-based fusion avoids the need to normalize incompatible score distributions (BM25 scores vs cosine similarities).

Bi-temporal edges

Every edge tracks two time dimensions. Facts expire but never disappear.

  • valid_from / valid_until — when the fact was true in the real world
  • recorded_at — when the system learned about the fact

This distinction matters for debugging. "What did we know about the system on March 10th?" is a different question from "What was actually true on March 10th?" Bi-temporal modeling answers both.

use sqlite_graph::Edge;

// Alice worked at Google, then moved to Meta
let mut edge = Edge::new(&alice_id, &google_id, "works_at");
edge.fact = Some("Alice works at Google".into());
graph.add_edge(edge.clone())?;

// Later: invalidate the old edge (sets valid_until = now)
graph.invalidate_edge(&edge.id)?;

// Add the new one
let new_edge = Edge::new(&alice_id, &meta_id, "works_at");
graph.add_edge(new_edge)?;

// Traverse with history to see both
let (entities, edges) = graph.traverse_with_history(&alice_id, 2)?;

Invalidation never deletes — it sets valid_until, preserving the full audit trail:

UPDATE edges SET valid_until = ?1 WHERE id = ?2 AND valid_until IS NULL

Entity deduplication

"PostgreSQL", "Postgres", and "PG" resolve to the same node via Jaro-Winkler similarity + an alias table.

let pg = Entity::new("PostgreSQL", "technology");
let (id, merged) = graph.add_entity_deduped(pg, 0.85)?;
// merged = false (new entity)

let pg2 = Entity::new("Postgres", "technology");
let (id2, merged) = graph.add_entity_deduped(pg2, 0.85)?;
// merged = true, id2 == id (matched "PostgreSQL" at 0.92 similarity)

Dedup is two-level: exact alias lookup first (SQL, O(1)), then Jaro-Winkler against all entities of the same type (Rust, computed in-process). The 0.85 threshold is tuned for software terminology — "PostgreSQL" vs "Postgres" scores 0.92, "React" vs "Redux" scores 0.73 (correctly rejected).

Architecture

Schema

7 tables, 3 FTS5 virtual tables, 8 indexes, 9 triggers. The full schema is applied via migration on first open.

-- Core tables
episodes          -- raw events (decisions, messages, incidents)
entities          -- graph nodes (people, services, technologies)
edges             -- relationships (bi-temporal, with confidence scores)
episode_entities  -- junction: which entities appear in which episodes
aliases           -- deduplication: alias_name → canonical_id
communities       -- clustering (reserved for community detection)
_migrations       -- schema versioning

-- FTS5 indexes (external content, trigger-synced)
episodes_fts      -- content, source, metadata
entities_fts      -- name, entity_type, summary
edges_fts         -- fact, relation

How traversal works

The recursive CTE that replaces Cypher's MATCH (a)-[*1..3]-(b):

WITH RECURSIVE traversal(entity_id, depth) AS (
    -- Base case: start at the given entity
    SELECT ?1, 0

    UNION

    -- Recursive step: walk edges in both directions
    SELECT
        CASE WHEN e.source_id = t.entity_id THEN e.target_id
             ELSE e.source_id END,
        t.depth + 1
    FROM traversal t
    JOIN edges e ON (e.source_id = t.entity_id OR e.target_id = t.entity_id)
    WHERE t.depth < ?2
      AND e.valid_until IS NULL  -- only current edges
)
SELECT ent.id, ent.name, ent.entity_type, ent.summary,
       ent.created_at, ent.metadata
FROM entities ent
WHERE ent.id IN (SELECT DISTINCT entity_id FROM traversal)

How it works:

  1. Base case — seed with the starting entity at depth 0
  2. Recursive step — for each discovered entity, find all edges where it's source or target (bidirectional traversal)
  3. CASE expression — picks the other end of the edge
  4. UNION (not UNION ALL) — deduplicates, preventing cycles
  5. Depth limitWHERE t.depth < ?2 caps traversal hops
  6. Temporal filtervalid_until IS NULL restricts to current edges

After collecting entities, a second query grabs all edges between them to return the complete subgraph.

The equivalent Cypher would be:

MATCH path = (start:Entity {id: $id})-[*1..3]-(neighbor)
WHERE ALL(r IN relationships(path) WHERE r.valid_until IS NULL)
RETURN DISTINCT neighbor, length(path) AS depth
ORDER BY depth

Cypher is more concise. But the SQL version is self-contained — no external database process, no Bolt protocol, no connection pooling.

Index strategy

8 indexes optimized for the workload:

Index Optimizes
idx_edges_source / idx_edges_target Graph traversal JOIN — without these, every CTE step is a full table scan
idx_edges_relation Filtering edges by relation type
idx_edges_valid Temporal queries — composite (valid_from, valid_until) for current vs historical
idx_entities_type Entity listing/filtering, used during fuzzy dedup
idx_episode_entities Reverse lookup: which episodes mention this entity
idx_episodes_source / idx_episodes_recorded Source filtering and time-range queries

Plus 3 FTS5 virtual tables maintaining their own inverted indexes, synced via 9 triggers (INSERT/UPDATE/DELETE for each of episodes, entities, edges).

FTS5 sync triggers

FTS5 external content tables don't auto-sync. Each base table has three triggers:

-- On INSERT: add to FTS index
CREATE TRIGGER episodes_ai AFTER INSERT ON episodes BEGIN
    INSERT INTO episodes_fts(rowid, content, source, metadata)
    VALUES (new.rowid, new.content, new.source, new.metadata);
END;

-- On DELETE: remove from FTS index
CREATE TRIGGER episodes_ad AFTER DELETE ON episodes BEGIN
    INSERT INTO episodes_fts(episodes_fts, rowid, content, source, metadata)
    VALUES ('delete', old.rowid, old.content, old.source, old.metadata);
END;

-- On UPDATE: delete-then-insert
CREATE TRIGGER episodes_au AFTER UPDATE ON episodes BEGIN
    INSERT INTO episodes_fts(episodes_fts, rowid, content, source, metadata)
    VALUES ('delete', old.rowid, old.content, old.source, old.metadata);
    INSERT INTO episodes_fts(rowid, content, source, metadata)
    VALUES (new.rowid, new.content, new.source, new.metadata);
END;

The VALUES ('delete', ...) syntax is FTS5's way of removing an entry. 9 triggers total (3 tables x 3 operations).

Performance pragmas

PRAGMA journal_mode = WAL;      -- concurrent readers during writes
PRAGMA synchronous = NORMAL;    -- balance durability vs speed
PRAGMA foreign_keys = ON;       -- enforce referential integrity

WAL allows concurrent reads while writing — the typical workload is one writer ingesting data, multiple readers searching and traversing. NORMAL sync is fine for non-financial data. Foreign keys are off by default in SQLite — we turn them on because edges must reference valid entities.

Performance characteristics

Designed for graphs up to ~100K nodes. Single-digit milliseconds for traversal at depth 2-3.

Concern Status Mitigation
Traversal O(branching^depth), bounded by max_depth Fine for <100K with proper indexes
Vector search Brute-force cosine O(n) — ~50ms at 100K embeddings Drop in sqlite-vec for HNSW
FTS5 search BM25 ranking, millisecond latency Trigger-synced, zero manual maintenance
Concurrent readers Unlimited with WAL mode Built-in
Concurrent writers Single writer at a time Fine for embedded/single-process use
Database size ~15MB per 100K episodes with 384-dim embeddings Standard SQLite, no special storage

When to graduate

  • >100K entities with deep traversals — consider Neo4j or Memgraph
  • Multi-user concurrent writes — consider PostgreSQL with a graph schema
  • Complex pattern matching — if you need Cypher regularly, use a graph database
  • Distributed systems — SQLite doesn't replicate or shard

The schema maps directly to labeled property graphs — export to JSON and import into Neo4j, Memgraph, or DGraph when you outgrow this.

Extracted from

This library powers the storage layer of ctxgraph, a knowledge graph engine that achieves 0.800 combined F1 on extraction benchmarks with zero API calls. ctxgraph adds local ONNX-based NER (GLiNER) and relation extraction on top of sqlite-graph.

Read the full technical deep-dive: We Replaced Neo4j with 45 SQL Statements

License

MIT