Skip to main content

Module embed

Module embed 

Source
Expand description

Embedding substrate for semantic memory (Pillar A).

An Embedder turns text into a fixed-dimension vector so events can be retrieved by meaning, not just keyword (FTS5). The real semantic backend is a pure-Rust static model (model2vec, behind the embed feature); when it is absent every caller falls back to FTS5, so the journal’s zero-cost, offline-by-default behaviour is preserved.

This module is dependency-free on purpose: the trait, the cosine/recency math, the SQLite blob codec, and a deterministic HashEmbedder all build and test without pulling a model. The model2vec backend is added as an isolated, feature-gated step on top.

Structs§

HashEmbedder
A deterministic, dependency-free embedder using the feature-hashing trick: each token is hashed into one of dim buckets and the resulting bag-of-words vector is L2-normalised. It is lexical, not semantic — its job is to make the trait, storage, ingest and ranking code testable without a model, and to serve as a crude offline fallback. The real semantic quality comes from the model2vec backend.
Model2VecEmbedder
True semantic embedder backed by a model2vec static model (pure-Rust, no onnxruntime). The model is downloaded once via the HuggingFace hub and cached locally; later loads read the cache. Behind the embed feature.

Constants§

DEFAULT_EMBED_MODEL
Default model2vec repo — multilingual so RU/EN prose both embed well. Overridable via TJ_EMBED_MODEL.

Traits§

Embedder
A text embedder. Implementations return exactly one vector per input, all of the same dim, produced by the model named by model_id.

Functions§

cosine
Cosine similarity of two vectors. Returns 0.0 on a length mismatch or a zero-norm input — callers rank with this, they don’t assert on it, so it must never panic.
default_embedder
The embedder the journal uses unless overridden. With the embed feature (on by default) it loads the model2vec static model for true semantic recall; if that can’t load — offline first run, download failure, or TJ_EMBED=hash — it falls back to the dependency-free lexical HashEmbedder so the journal never breaks.
from_blob
Decode a little-endian byte blob back into an f32 vector. Trailing bytes that don’t form a full f32 are ignored (defensive; should never happen for blobs produced by to_blob).
is_embeddable
Whether an event’s text is worth embedding. Skips empties and very short boilerplate (e.g. the [open] marker) that carry no retrievable meaning.
to_blob
Encode an f32 vector as a little-endian byte blob for SQLite BLOB storage. Round-trips with from_blob.