prollytree 0.4.0

# Text Indexing & Vector Search

ProllyTree includes a **version-controlled approximate-nearest-neighbour (ANN) index** that sits inside any namespace of a `NamespacedKvStore`. You can do semantic similarity search on the same data that the rest of the store versions, branches, and merges — without standing up a separate vector database.

!!! tip "Try it in your browser"
    Open the [interactive demo](text_search_demo.html) for a self-contained walkthrough — namespaced store + text indexes + cascade + live search, no install required.

For the conceptual model see [Architecture → Proximity / text-search layer](architecture.md#7-proximity-text-search-layer). For runnable code see [Examples → Text Search](examples/text_search.md).

## When to use it

- You're already using ProllyTree (or want to) for versioned storage.
- You need top-k nearest-neighbour over short-ish text (notes, docs, logs, chunks).
- You want index + primary data to commit, branch, and merge **atomically** — no separate sync job between a vector DB and your source of truth.

It is not a replacement for a high-end ANN library on billion-scale corpora. The current implementation targets per-namespace corpora in the thousands-to-millions range with deterministic, history-independent index shape.

## How it works in 60 seconds

```
NamespacedKvStore (one git repo)
├── namespace "docs"
│   ├── primary tree           ← source of truth: doc_id → body bytes
│   ├── text sub-index "by_body"      ← (id, vector) pairs, ANN-searchable
│   └── text sub-index "by_summary"   ← multiple indexes per namespace
└── ...
```

A text index turns each document into one or more vectors via a configurable **embedder** and stores them inside a Dolt-style proximity tree (Merkle ANN structure — see [the design discussion](https://www.dolthub.com/blog/2025-06-23-vector-index-deep-dive/)). The proximity tree's *shape* is a pure function of the current `(id, vector)` set, so two replicas with the same data converge to the same root hash regardless of insertion order — the same content-defined-shape property the prolly tree itself has.

The proximity index stores only `(id, vector)` pairs — never the source text. **The primary KV tree is the source of truth.** Search results give you ids; you resolve back to the original text via the primary tree.

## Feature flags

ProllyTree's text-search surface is gated behind two Cargo features:

| Feature | Pulls in | Purpose |
|---|---|---|
| `proximity` | nothing extra | Raw vector index + text-index infrastructure. ML-free; ships [`HashEmbedder`](#hashembedder) for tests and the [`CallableEmbedder`](#callableembedder) shim for "bring your own embedder". |
| `proximity_text` | Candle (pure-Rust ML), tokenizers, ureq | Adds the bundled [`MiniLmEmbedder`](#minilmembedder). First call downloads ~90 MB of weights into `$PROLLYTREE_EMBEDDER_CACHE` (default `~/.cache/prollytree/embedders`). |

PyPI wheels and `cargo install`'d builds ship both features by default.

## Supported backends

v1 of the proximity index is fully supported on the **File** and **RocksDB** storage backends. `InMemoryNodeStorage` works for testing. `GitNodeStorage` is mechanically functional but its hash-mapping is only flushed by the higher-level commit path, so production use of git-backed proximity is exercised through `NamespacedKvStore` rather than direct `ProximityIndex<_, GitNodeStorage<_>>` constructions.

## API tour (Python)

A complete runnable walkthrough lives in [Examples → Text Search](examples/text_search.md). The minimum surface you need:

### Open or re-open an index

```python
from prollytree import NamespacedKvStore, MiniLmEmbedder

store = NamespacedKvStore("./data")
emb = MiniLmEmbedder()                       # or HashEmbedder / CallableEmbedder

# Creates the index on first call; on subsequent calls validates that the
# supplied embedder's id + version match what's persisted on disk.
store.text_index_open("docs", "by_body", emb)
```

The embedder's `id` and `version` are persisted on first open and re-checked on every reopen. Mismatch raises a clear error so you don't silently mix vectors produced by different models.

### Dual-write (the canonical pattern)

```python
docs = {
    b"doc:1": "the quick brown fox",
    b"doc:2": "lazy dog asleep on the mat",
}
for doc_id, text in docs.items():
    store.ns_insert("docs", doc_id, text.encode())            # primary tree (truth)
    store.text_index_insert("docs", "by_body", doc_id, text)  # index (pointer)
store.commit("seed corpus")
```

Both writes land in the **same git commit** atomically. The primary tree carries the source bytes; the index carries the vectors. If you ever change embedders, you can re-embed every doc from the primary tree.

### Cascade — replace the dual write with one call

```python
store.text_index_open("docs", "by_body", emb)
store.set_cascade("docs", ["by_body"])                        # opt-in once

store.ns_insert("docs", b"doc:3", b"branching is first-class")
store.commit("cascade-driven indexing")                       # also updated the index
```

Cascade is per-namespace and runtime-only (not persisted in the store registry). `ns_delete` cascades too. A namespace can cascade into multiple indexes — e.g. `["by_body", "by_title"]` — and each index can run its own value-transformer for non-UTF-8 primary values.

### Search and resolve back to text

```python
for doc_id, score in store.text_index_search("docs", "by_body", "vulpine animal", k=5):
    body = store.ns_get("docs", doc_id).decode()
    print(f"{doc_id} (distance={score:.3f}): {body}")
```

Returns `(id_bytes, distance)` tuples ordered by ascending distance (closer first). Multi-chunk indexes (see below) automatically dedup so each document appears once at its best chunk's score.

### Multi-chunk indexing

Pass a chunker by name when opening the index. The chunk-id encoding is `[doc_id_len:4][doc_id_bytes][chunk_idx:4]`, so every chunk for a doc shares a prefix and `text_index_delete` can prefix-scan-remove them all.

```python
store.text_index_open("logs", "by_line", emb, chunker="line")
store.text_index_insert("logs", "by_line", b"log:2026-05-20",
                        "alpha\nbeta\ngamma")     # 3 chunks under one doc id
store.text_index_len("logs", "by_line")           # 1 (distinct documents)
store.text_index_chunk_count("logs", "by_line")   # 3 (raw chunks)
```

Built-in chunkers: `"identity"` (default — one chunk per doc) and `"line"` (one chunk per non-empty line).

## Embedders

### `HashEmbedder`

Deterministic SHA-256-based embedder. Pure Rust, no deps, useful for tests and exact-match lookup. **Not semantic** — `"a cat sat"` and `"a feline rested"` will land in unrelated parts of the vector space.

```python
from prollytree import HashEmbedder
emb = HashEmbedder(dim=384, seed=0)
```

### `MiniLmEmbedder`

Bundled Candle + `sentence-transformers/all-MiniLM-L6-v2` (384-d). Real semantic search. First call downloads weights (~90 MB) to `~/.cache/prollytree/embedders/`.

```python
from prollytree import MiniLmEmbedder
emb = MiniLmEmbedder()                                       # default model + revision
emb = MiniLmEmbedder(model_id="...", revision="main")        # override either field
```

Requires the `proximity_text` feature. Set `PROLLYTREE_EMBEDDER_CACHE` to relocate the cache directory.

### `CallableEmbedder`

Wrap any Python embedding function as an Embedder — use this to plug in OpenAI, Cohere, sentence-transformers, your own pipeline, etc.

```python
from prollytree import CallableEmbedder
from openai import OpenAI

client = OpenAI()
def openai_embed(text):
    return client.embeddings.create(
        input=text, model="text-embedding-3-small"
    ).data[0].embedding

emb = CallableEmbedder(
    id="openai:text-embedding-3-small",
    version="2024-01",
    dim=1536,
    embed_fn=openai_embed,
)
```

The `id` and `version` are what's persisted. **Change `version` whenever the embedding distribution changes** (model upgrade, new tokenizer) so reopens correctly surface the mismatch.

## Drift management

If you write to the primary tree without cascade and forget to mirror into the index — or you change the embedder mid-history — the index can drift from the primary. Detection and repair are first-class:

```python
report = store.audit_text_index("docs", "by_body")
# {"orphans_in_index": [...], "missing_from_index": [...], "is_in_sync": False}

store.purge_text_index_orphans("docs", "by_body")    # remove index entries
                                                     # that have no primary row
```

Filling `missing_from_index` is your call — typically a loop over the listed ids that re-inserts each from the primary tree.

## Branching and merging

Every text index is owned by its namespace and versioned alongside the namespace's primary tree. The store-wide `branch`, `checkout`, `merge` operations move every namespace's primary tree **and** every sub-index together. Switching branches gives you that branch's view of both data and search results.

Three-way merge for text indexes runs the same nine-case logic as the primary KV merge, routed through a `ProximityConflictResolver`. Built-in resolvers (in the Rust crate; not yet surfaced in Python):

- `TakeSourceProximityResolver` / `TakeDestinationProximityResolver`
- `LatestVectorResolver<F>` — timestamp-extractor function picks the newer vector
- `MeanVectorResolver` — averages conflicting vectors (L2 / Cosine only)

## Externalisation + blob GC

Large documents (set via `store.set_externalize_threshold(bytes)`) are stored as content-addressed blobs alongside the prolly tree, with only a 44-byte envelope inline in the leaf. `store.gc_blobs()` walks the current store and reclaims unreferenced blobs. Useful when you're indexing real document bodies rather than short snippets.

```python
store.set_externalize_threshold(64 * 1024)        # 64 KiB threshold
report = store.gc_blobs()
# {"total": 12, "referenced": 12, "removed": 0, "errors": []}
```

Externalisation is currently supported on the **File** and **RocksDB** backends.

## Where to go next

- [Examples → Text Search](examples/text_search.md) — runnable end-to-end demos.
- [Python API](api/python.md#namespacedkvstore) — full method-level reference.
- [Architecture → Proximity / text-search layer](architecture.md#7-proximity-text-search-layer) — how the proximity tree integrates with the rest of the stack.