Expand description
Embedding substrate for semantic memory (Pillar A).
An Embedder turns text into a fixed-dimension vector so events can be
retrieved by meaning, not just keyword (FTS5). The real semantic backend is
a pure-Rust static model (model2vec, behind the embed feature); when it is
absent every caller falls back to FTS5, so the journal’s zero-cost,
offline-by-default behaviour is preserved.
This module is dependency-free on purpose: the trait, the cosine/recency
math, the SQLite blob codec, and a deterministic HashEmbedder all build
and test without pulling a model. The model2vec backend is added as an
isolated, feature-gated step on top.
Structs§
- Hash
Embedder - A deterministic, dependency-free embedder using the feature-hashing trick:
each token is hashed into one of
dimbuckets and the resulting bag-of-words vector is L2-normalised. It is lexical, not semantic — its job is to make the trait, storage, ingest and ranking code testable without a model, and to serve as a crude offline fallback. The real semantic quality comes from the model2vec backend. - Model2
VecEmbedder - True semantic embedder backed by a model2vec static model (pure-Rust, no
onnxruntime). The model is downloaded once via the HuggingFace hub and
cached locally; later loads read the cache. Behind the
embedfeature.
Constants§
- DEFAULT_
EMBED_ MODEL - Default model2vec repo — multilingual so RU/EN prose both embed well.
Overridable via
TJ_EMBED_MODEL.
Traits§
- Embedder
- A text embedder. Implementations return exactly one vector per input, all of
the same
dim, produced by the model named bymodel_id.
Functions§
- cosine
- Cosine similarity of two vectors. Returns
0.0on a length mismatch or a zero-norm input — callers rank with this, they don’t assert on it, so it must never panic. - default_
embedder - The embedder the journal uses unless overridden. With the
embedfeature (on by default) it loads the model2vec static model for true semantic recall; if that can’t load — offline first run, download failure, orTJ_EMBED=hash— it falls back to the dependency-free lexicalHashEmbedderso the journal never breaks. - from_
blob - Decode a little-endian byte blob back into an
f32vector. Trailing bytes that don’t form a fullf32are ignored (defensive; should never happen for blobs produced byto_blob). - is_
embeddable - Whether an event’s text is worth embedding. Skips empties and very short
boilerplate (e.g. the
[open]marker) that carry no retrievable meaning. - to_blob
- Encode an
f32vector as a little-endian byte blob for SQLiteBLOBstorage. Round-trips withfrom_blob.