silk-graph 0.1.1

Merkle-CRDT graph engine for distributed, conflict-free knowledge graphs
Documentation
# Why Silk Exists

## The Problem

Knowledge graphs today force a choice:

**Centralized engines** (Neo4j, TigerGraph, Amazon Neptune) give you schema, queries, and algorithms — but require a server. Every write goes through a coordinator. Offline operation is not a concept. Multi-region means replication lag, conflict resolution is your problem, and the server is a single point of failure.

**Distributed CRDTs** (Automerge, Yjs, cr-sqlite) give you offline-first, conflict-free replication — but they are document-oriented or row-oriented. No schema enforcement at write time. No graph traversal primitives. You bolt on your own validation, your own BFS, your own impact analysis. The CRDT handles merge; everything else is your responsibility.

**No tool combines all four:**
1. Distributed (no server required)
2. Schema-enforced (validated at write time)
3. Graph-native (traversal, algorithms, pattern matching built in)
4. Conflict-free (mathematical convergence guarantee)

Silk is that tool.

## What Silk Does Differently

### Schema-enforced

Silk validates every write against an ontology defined at store creation. Node types, edge types, source/target constraints, required properties, property types — all checked before the entry hits the DAG. Invalid writes are rejected, not silently stored. The ontology defines the minimum (required properties, type constraints); unknown properties are accepted (open-world, D-026). Your schema evolves without migrations.

### Conflict-free

Silk uses a Merkle-CRDT: every mutation is a content-addressed entry in a DAG, with Lamport clocks for causal ordering. Concurrent writes to different properties on the same node both survive — per-property last-writer-wins. Concurrent add + remove → add wins. Two stores that exchange sync messages in both directions are mathematically guaranteed to converge to the same graph state. No coordinator, no consensus protocol, no leader election.

### Offline-first

Every Silk instance is a self-contained graph database. Reads and writes work with zero network connectivity. When connectivity returns, a single sync round-trip (offer → payload → merge) brings both sides to convergence. There is no "primary" — any instance can sync with any other. The sync protocol uses Bloom filters to minimize data transfer, sending only entries the peer lacks.

### Graph-native

BFS, shortest path, impact analysis, subgraph extraction, pattern matching, topological sort, cycle detection — built into the engine, operating on the materialized graph. Not a layer on top. Not a query language that compiles to table scans. Graph structure is a first-class citizen of the storage and query model.

## Proof by Example

Four example scripts demonstrate these properties with real code, real measurements, and assertions that verify correctness.

### `examples/offline_first.py` — Two-peer offline sync

Two devices each write 500 nodes independently (simulating offline operation). A single bidirectional sync merges them to 1,000 nodes on each side. Verifies identical node sets. Measures sync latency.

**What it proves:** Offline writes accumulate without conflict. Sync is a single round-trip. No server involved.

### `examples/partition_heal.py` — Three-peer partition healing

Three peers start synced, then diverge (200 nodes each, all different). After healing via mesh sync (A-B, B-C, A-C), all three converge to the same 600-node graph.

**What it proves:** Network partitions are a non-event. Diverged state merges cleanly. Convergence is independent of partition duration.

### `examples/concurrent_writes.py` — Per-property LWW

Two stores modify the same node concurrently. Store A updates `status`. Store B updates `status` AND adds `location`. After sync, both stores have the LWW winner for `status` and the non-conflicting `location` property. No data loss.

**What it proves:** Conflicts are resolved per-property, not per-node. Non-conflicting concurrent writes are never discarded.

### `examples/ring_topology.py` — Zero-coordination scale

Ten peers in a ring topology. Each writes 100 nodes. Ring sync propagates data around the ring until all 10 peers have all 1,000 nodes. Reports how many rounds were needed.

**What it proves:** Convergence works with arbitrary topologies. No peer is special. No election, no leader, no coordinator.

## Benchmarks

Measured on Apple M4 Max (16 cores, 128 GB RAM), macOS 15.7, Rust 1.94.0. Run `cargo bench --no-default-features` for your hardware.

| What | 100 nodes | 1,000 nodes | 10,000 nodes |
|------|-----------|-------------|--------------|
| Write + materialize | 129 µs | 1.5 ms | 16.8 ms |
| Sync offer generation | 24 µs | 282 µs | 3.3 ms |
| Full sync (zero overlap) | 111 µs | 1.3 ms ||
| Incremental sync (10% delta) || 611 µs ||
| Partition heal (500/side) || 833 µs ||
| BFS traversal || 564 ns | 580 ns |
| Shortest path || 706 ns | 717 ns |
| Impact analysis || 108 ns | 105 ns |

Key takeaways:
- **Sub-millisecond sync** for typical workloads (< 1,000 divergent entries)
- **Graph algorithms don't scale with graph size** for targeted queries (BFS/shortest path access specific subgraphs)
- **Partition healing is cheap**: 500 divergent writes per side merge in 833 µs
- **2.2M entries/sec** creation throughput (449 ns/entry)

## When Silk Is the Right Tool

**Good fit:**
- **Local-first applications** — offline-capable apps that sync when connected (note-taking, task management, field data collection)
- **Edge computing** — devices that operate independently and sync periodically (IoT gateways, drones, retail POS)
- **Multi-device sync** — phone, laptop, server — all converge automatically
- **Peer-to-peer systems** — no central server, any node can sync with any other
- **Knowledge graphs with schema** — when you need both structure enforcement and graph traversal
- **Audit trails** — every change is an immutable, content-addressed entry in a Merkle-DAG

**Not the right tool:**
- **High-throughput analytics** — DuckDB, ClickHouse
- **SQL queries** — SQLite, Postgres
- **Document storage** — MongoDB, CouchDB
- **Blob storage** — S3
- **Streaming data** — Kafka, Redpanda

## Architecture

```
Write (add_node, add_edge, update_property, remove_node, remove_edge)
  |
  v
Ontology Validation
  |  reject invalid writes here, before they enter the DAG
  v
Entry { hash: BLAKE3(content), op, clock: Lamport, author, parents: [head_hashes] }
  |
  v
OpLog (append-only Merkle-DAG, content-addressed, immutable)
  |
  |--- MaterializedGraph (live view)
  |      |
  |      |-- Nodes: id -> { type, label, subtype, properties }
  |      |-- Edges: id -> { type, source, target, properties }
  |      |-- Indexes: by_type, by_subtype, by_property
  |      |
  |      +-- Query / Algorithms
  |            bfs, shortest_path, impact_analysis, pattern_match,
  |            topological_sort, has_cycle, subgraph
  |
  +--- Sync Protocol
         |
         |-- generate_sync_offer()   -> Bloom filter of known hashes
         |-- receive_sync_offer()    -> entries the peer is missing
         +-- merge_sync_payload()    -> apply remote entries, re-materialize
                                        (per-property LWW, add-wins semantics)
```

**Convergence invariant:** For any two stores S1 and S2, if S1 and S2 have exchanged sync messages in both directions, then `S1.all_nodes() == S2.all_nodes()` and `S1.all_edges() == S2.all_edges()`. This holds regardless of write order, network topology, or partition duration.