triblespace-search 0.41.2

Content-addressed BM25 + HNSW indexes on top of triblespace piles.
Documentation

triblespace-search

Content-addressed BM25 + HNSW indexes on top of triblespace piles.

Two blob types, loaded zero-copy via anybytes and jerky:

  • SuccinctBM25Index (SB25 blob, schema id 68C03764D04D05DF65E49589FBBA1441) — lexical / associative retrieval. Terms are 32-byte triblespace Inlines, so the index handles text search, entity co-occurrence, and tag weighting with the same schema. Postings bit-packed via jerky CompactVector.
  • SuccinctHNSWIndex (SH25 blob, schema id A96890DE5F85A4F2285C365549B21BC2) — approximate cosine similarity over caller-supplied embedding handles. Graph stored as per-(layer, node) CSR in two jerky CompactVectors. Nodes are Handle<Embedding> values; the caller's doc-to-embedding mapping is a trible they own, not a shadow datamodel inside the index.

Both indexes are rebuilt-and-replaced (no mutation). The resulting blob handle is persisted wherever the caller likes — branch metadata, commit metadata, a plain trible, or an in-memory cache.

See docs/DESIGN.md for the full design.

Status

Pre-alpha. Tracks the workspace version (0.36.0); the API shapes are settling but not yet stable for downstream pinning. Both the naive and succinct paths are shipped end-to-end (see docs/DESIGN.md for the full picture and CHANGELOG.md for the recent shape changes). The remaining open items are perf/encoding refinements, not architecture.

What works today

  • BM25Index (naive in-memory): build + multi-term query, content-addressed byte serialization, plus a single triblespace Constraint: matches(doc, &terms, score_floor) — binds doc only; score is a fixed parameter. Pair with idx.score(&doc, terms) to recompute precise scores after the engine filters, same pattern as HNSW's similar/recompute split.
  • SuccinctBM25Index: jerky-backed zero-copy view — doc keys via CompressedUniverse, terms as a typed View<[[u8; 32]]> row table, doc-lengths + postings via CompactVector. The index is its blob: every section lives in one shared anybytes::ByteArea, so ToBlob/TryFromBlob are O(1) refcounted handovers.
  • FlatIndex: brute-force exact cosine baseline. Same similar(a, b, score_floor) constraint as HNSW — useful for ground truth and small corpora.
  • HNSWIndex (naive Malkov & Yashunin 2018) with deterministic level sampling, ef-search, byte serialization. Validated at 1 000 handles / 32-dim against FlatIndex at ≥ 70 % above-threshold recall.
  • SuccinctHNSWIndex: jerky-backed zero-copy view — a View<[[u8; 32]]> row table of embedding handles plus a CSR graph encoded as two CompactVectors, all in one canonical Bytes. Nodes IS the handle; the caller's doc → embedding mapping lives in their tribles, not here.
  • Binary-relation similarity constraint similar(a, b, score_floor) produced by the similar() method on any attached view. a and b are Variable<Handle< Embedding>>; score_floor is a fixed cosine threshold. Callers who need the exact score fetch both embeddings and compute cosine directly — no u16 quantization.
  • Shared constraint trait SimilaritySearch (HNSW, Flat, SuccinctHNSW) + BM25Queryable (naive + succinct BM25) — the same constraints work against either backend.
  • matches_text(doc, text, floor) + score_text(doc, text): word-hash-keyed sugar over matches and score — tokenises the query string with hash_tokens internally, available on indexes whose term schema is WordHash (the default).
  • tokens::hash_tokens: opt-in whitespace + lowercase + Blake3 tokenizer producing 32-byte term values.
  • tokens::ngram_tokens: character n-gram tokenizer (n namespaced into the hash) for prefix / typo matching. Compose with hash_tokens to get both exact and fuzzy matching through a single BM25 index.
  • tokens::code_tokens: identifier tokenizer — splits on camelCase, snake_case, digit boundaries, and acronyms (HTMLParserhtml, parser). Lowercased output hashes the same as hash_tokens, so code and prose can share one index.
  • tokens::bigram_tokens: word-level bigram tokenizer namespaced into "2w:" so bigrams and single-word hashes coexist in one index. Compose with hash_tokens to answer both single-word and phrase queries — bigram_tokens("quick brown") hashes only the ordered pair, so a doc matches iff the two words appear adjacently.
  • schemas::F32LE: InlineEncoding for packing f32 scores into 32-byte Inline<F32LE>s. Used by the scored BM25 constraint.
  • Eight runnable examples:
    • query_demo — text search, multi-term ranking via filter+rescore, value-as-term citation search.
    • compose_bm25_and_pattern — BM25 + pattern! over a TribleSet in one find!.
    • multi_term_bm25_search — multi-term matches filter joined with a pattern! author filter, ranked by post-collect idx.score.
    • compose_hnsw_and_pattern — similarity + pattern! composition via the binary Similar relation.
    • hybrid_search — BM25 + similarity + pattern! in one find!; both filters active simultaneously.
    • blob_sizes_at_scale — naive vs. SB25 blob size + parallel build speedup at 1k / 5k / 10k / 50k docs.
    • query_latency — p50/p99 latency for BM25 queries and HNSW threshold walks.
    • phrase_searchhash_tokens + bigram_tokens in two typed indexes; same corpus answers single-word and phrase queries.
  • 154 tests across unit, scale (1k-doc equivalence + naive-vs-SB25 size guard), engine-integration (IntersectionConstraint joins + find! / pattern! composition + find! over both succinct paths), and doctests.

What's next

  • Wavelet-matrix BM25 term table (would shrink the term column at large vocabularies; correctness-first is winning today).
  • Direct SuccinctBM25Index builder that skips the naive intermediate (memory win at large build-time scale).

See docs/DESIGN.md, docs/QUERY_ENGINE_INTEGRATION.md, docs/HNSW_GRAPH_ENCODING.md, and docs/FACULTY_INTEGRATION.md for the rust-script faculty consumption pattern.

License

Dual-licensed under either MIT or Apache-2.0, at your option.