prollytree 0.4.0

A prolly (probabilistic) tree for efficient storage, retrieval, and modification of ordered data.
Documentation

ProllyTree

Crates.io Documentation License Downloads

A probabilistic B-tree with Merkle properties — a content-addressed, Git-versioned key-value store with branching, three-way merge, cryptographic proofs, optional SQL, and an optional vector / text-search index. Written in Rust with first-class Python bindings.

A prolly tree's shape is a deterministic function of its contents, so two replicas holding the same key-value set converge to the same root hash regardless of insertion order. That property is what makes the rest — Git-style versioning, efficient diff/sync between replicas, and verifiable subtree sharing across history — fall out for free.

Features

Capability What it gives you
Versioned KV store Git-backed branch / commit / diff / three-way merge on raw key-value state
Namespaced KV store Many isolated prolly trees in one Git repo, atomic across namespaces
Text / vector search Optional versioned ANN index inside any namespace; bundled MiniLM, hash, and callable embedders
Multi-chunk indexing Split docs into chunks at index time, dedup on search by document
Cascade mode One primary write auto-mirrors into every registered text index
Large-value externalization Values above a threshold land in content-addressed blobs; gc_blobs() reclaims them
Cryptographic proofs Merkle inclusion / absence proofs on every value
Multiple storage backends In-memory, File, RocksDB, Git-backed
SQL interface Query the tree as relational tables via GlueSQL
Python bindings Full surface via PyO3 — versioning, namespaces, text search, SQL
git-prolly CLI Git-style command surface over the versioning + SQL layers

Quick start

Rust

[dependencies]
prollytree = { version = "0.4.0", features = ["git", "sql"] }
# Add `proximity` for the text-search surface, `proximity_text` for bundled MiniLM.

Python

pip install prollytree   # ships with git, sql, proximity, proximity_text by default

Examples

Verifiable key-value store

The raw ProllyTree ships a Merkle inclusion proof for every key — useful when data crosses trust boundaries.

use prollytree::tree::{ProllyTree, Tree};
use prollytree::storage::InMemoryNodeStorage;

let mut tree = ProllyTree::new(InMemoryNodeStorage::<32>::new(), Default::default());
tree.insert(b"user:alice".to_vec(), b"Alice".to_vec());

let proof = tree.generate_proof(b"user:alice");
assert!(tree.verify(proof, b"user:alice", Some(b"Alice")));

Git-backed versioning

The git feature stores tree nodes as Git objects, so commits, branches, and merges work natively on key-value state.

use prollytree::git::versioned_store::StoreFactory;

let mut store = StoreFactory::git::<32, _>("data")?;
store.insert(b"config/api_key".to_vec(), b"v1".to_vec())?;
store.commit("Initial config")?;

store.create_branch("experimental")?;
store.insert(b"config/api_key".to_vec(), b"v2".to_vec())?;
store.commit("Try new key")?;
// → diff, merge, history available; see the user guide

Multiple namespaces in one store

NamespacedKvStore holds many independent prolly trees in one Git repo. Each namespace owns its own key space and (optionally) its own search indexes; one commit covers them all.

from prollytree import NamespacedKvStore

store = NamespacedKvStore("./data")
store.ns_insert("users",    b"u:alice", b"Alice")
store.ns_insert("settings", b"theme",   b"dark")
store.commit("seed users + settings")

store.branch("experiment")
store.ns_insert("settings", b"theme", b"light")
store.commit("flip theme on experiment")

store.checkout("main")
store.ns_get("settings", b"theme")   # b"dark" — main is unchanged

Optional text / vector search

A namespace can host one or more text indexes that ride on the same storage as the primary tree. Every search hit is just an id; resolve back to the original bytes via the primary tree.

from prollytree import NamespacedKvStore, MiniLmEmbedder

store = NamespacedKvStore("./data")
store.text_index_open("docs", "by_body", MiniLmEmbedder())
store.set_cascade("docs", ["by_body"])      # primary writes auto-index

store.ns_insert("docs", b"doc:1", b"the quick brown fox")
store.ns_insert("docs", b"doc:2", b"a lazy dog asleep on the mat")
store.commit("seed corpus")

for doc_id, distance in store.text_index_search("docs", "by_body", "vulpine animal", k=3):
    print(doc_id, distance, store.ns_get("docs", doc_id))

See examples/ (Rust) and python/examples/ for the full set: namespaces, text search, cascade, merge resolvers, SQL, blob GC.

Good fits

The combination of content-addressed Merkle structure + Git-style versioning + optional semantic search makes ProllyTree a natural fit for a few non-trivial use cases:

  • Auditable application state. Anywhere you'd otherwise reach for "an event log + a current-state snapshot" — config systems, feature-flag rollout state, policy rules — gets a real Git history with diff, blame, rollback, and proofs for free.
  • Distributed / multi-replica data. Two peers that hold the same keys converge to the same root hash. Subtree sharing makes diff and sync O(changes), not O(corpus).
  • AI agent memory. Per-agent namespaces give isolated key spaces in one store; commits make every memory mutation auditable; branches isolate speculative reasoning; the optional text index gives semantic recall without a separate vector database. The text-search guide walks through this pattern in detail.
  • Versioned analytical datasets. SQL over a Git-tracked KV store — git checkout a historical commit and run the same query against the data as it existed then. See the SQL guide.
  • Content-addressed indexes. Any place a Merkle tree already makes sense (verifiable logs, proof systems, gossip-friendly indexes) — ProllyTree gives you the data-structure ergonomics of a B-tree on top.

Embedders (when you use the text-search feature)

Embedder Pulls in Use it for
HashEmbedder nothing extra Tests, demos, exact-match recall
MiniLmEmbedder Candle (pure Rust) + ~90 MB weights Real semantic search, offline-friendly
CallableEmbedder your callable OpenAI, Cohere, sentence-transformers, your own model

Embedder identity (id + version) is persisted with the index. Reopening with a mismatched embedder surfaces a clear error — no silent mixing of vectors from different models.

Feature flags

Feature Description Default
git Git-backed versioned storage with branching, merging, history Yes
sql SQL query interface via GlueSQL Yes
proximity Vector index + text-search infrastructure (ML-free) No
proximity_text Bundled Candle + all-MiniLM-L6-v2 embedder No
rocksdb_storage RocksDB persistent storage backend No
python Python bindings via PyO3 No
tracing Observability via the tracing crate No

Python PyPI wheels ship git, sql, rocksdb_storage, proximity, and proximity_text enabled. Rust users opt in:

[dependencies.prollytree]
version = "0.4.0"
features = ["git", "sql", "proximity", "proximity_text"]

Documentation

CLI

cargo install prollytree --features git
git-prolly --help

See the user guide for the full CLI walkthrough.

Contributing

Contributions welcome — see CONTRIBUTING.md.

License

Licensed under the Apache License 2.0. See LICENSE.