lattice-embed 0.1.2

# Design: lattice-embed

## Current Design

**Multi-model embedding service.** 8 model variants in `EmbeddingModel` enum (`model.rs:102-135`), `#[non_exhaustive]` for future extension:

| Variant                        | Architecture  | Native Dims | Max Tokens | MRL | Notes                                   |
| ------------------------------ | ------------- | ----------- | ---------- | --- | --------------------------------------- |
| `BgeSmallEnV15` (enum default) | BERT encoder  | 384         | 512        | no  | English-only                            |
| `BgeBaseEnV15`                 | BERT encoder  | 768         | 512        | no  | English-only                            |
| `BgeLargeEnV15`                | BERT encoder  | 1024        | 512        | no  | English-only                            |
| `MultilingualE5Small`          | BERT encoder  | 384         | 512        | no  | Recommended production default          |
| `MultilingualE5Base`           | BERT encoder  | 768         | 512        | no  | Multilingual                            |
| `Qwen3Embedding0_6B`           | Qwen3 decoder | 1024        | 8192       | yes | Last-token pooling, instruction prefix  |
| `Qwen3Embedding4B`             | Qwen3 decoder | 2560        | 8192       | yes | Sharded safetensors, instruction prefix |
| `TextEmbedding3Small`          | OpenAI API    | 1536        | 8191       | no  | Remote only, no local weights           |

`ModelConfig` (`model.rs:381`) pairs a model variant with optional `output_dim` for MRL truncation. `validate()` enforces bounds. `to_embedding_key()` produces a unique key per (model, dimension) pair -- different dimensions = different embedding spaces = separate HNSW indexes and cache files. Production default is `multilingual-e5-small` via application config, not the enum default (`BgeSmallEnV15`). The enum default serves code paths without application config (tests, standalone usage). Qwen3 models require instruction prefix for queries (`model.rs:227-234`): `"Instruct: Given a web search query, retrieve relevant passages...\nQuery: "`. Documents get no prefix (`document_instruction()` returns `None`).

**MRL truncation.** Matryoshka Representation Learning allows Qwen3-Embedding models to produce native-dimension vectors (1024d for 0.6B, 2560d for 4B) that are truncated to smaller dimensions while preserving quality. Only Qwen3 models support this (`supports_output_dim()` at `model.rs:263-268`); non-MRL models reject any `output_dim` value (`model.rs:420-425`). Truncation to 1024d or 1536d retains >95% quality (per Qwen3 paper) at 40-60% storage reduction. Implementation: `NativeEmbeddingService` passes `output_dim` to `QwenModel::set_output_dim()` at `native.rs:230`. Truncation happens inside the inference crate after forward pass + pooling, before L2 normalization -- the embed crate only passes the config. Min dimension: `MIN_MRL_OUTPUT_DIM = 32` (`model.rs:375`). Max: model's `native_dimensions()`. `LATTICE_EMBED_DIM` env var overrides `output_dim` for Qwen models (`native.rs:74-93`). Different `output_dim` values produce different `EmbeddingKey` (`model.rs:447-456`), ensuring separate HNSW indexes per dimension -- mixing dimensionalities in one index would produce garbage similarity scores.

**Two-layer cache.** Layer 1 (embed crate): in-memory sharded LRU. `EmbeddingCache` at `cache.rs:131` -- 16 shards, each `RwLock<LruCache<CacheKey, Arc<[f32]>>>` with per-shard `AtomicU64` hit/miss counters. Shard selection: `key[0] & 0xF`. Default capacity 4000 entries (250/shard). Blake3-based cache key: `text_bytes + model_config.to_embedding_key().canonical_bytes()` -- different MRL dims produce different keys. Eviction: LRU per-shard. Layer 2 (inference crate): `QwenModel` persistent disk cache at `~/.lattice/cache/embed_{model}_{dim}d.bin` (`native.rs:219-248`). Loaded at model init, saved explicitly. Uses `HashMap<u64, Vec<f32>>` with flush-all eviction at 10K cap (see `crates/inference/DESIGN.md` KC-3). BERT models have no persistent cache. The two layers use different key formats: Layer 1 uses blake3 of text + model config (embed crate doesn't have token IDs), Layer 2 uses hash of token ID sequence (inference crate operates post-tokenization). This avoids coupling between crates. `CachedEmbeddingService` (`cached.rs:42`) wraps any `EmbeddingService`: compute blake3 keys -> batch `get_many()` -> embed misses only -> `put_many()` -> merge preserving order. FP-035 fix at `cached.rs:153-159` validates embedding count matches input count.

**Batch processing.** Entry: `NativeEmbeddingService::embed()` at `native.rs:326-357`. Input validation: model mismatch -> error, empty input -> error, batch <= 1000 (`DEFAULT_MAX_BATCH_SIZE`), per-text <= 32768 chars (`MAX_TEXT_CHARS`). Model loads exactly once via `OnceLock` on a blocking thread (`native.rs:161-189`) -- cancellation-safe, survives async drops. BERT models: native batch processing via `BertModel::encode_batch()`. Qwen models: sequential per-item `QwenModel::encode()` (checks internal cache per item) -- sequential because `QwenModel` holds a `Mutex<ForwardBuffers>` preventing parallel forward passes. Batch parallelism would require per-item buffer allocation, adding memory pressure for marginal gain (GPU forward is already fast). Output: `Vec<Vec<f32>>` with order preserved. Core trait: `EmbeddingService` (`service/mod.rs:55-86`) with `embed()`, `embed_one()`, `model_config()`, `supports_model()`, `name()`.

## Alternatives Rejected

| Alternative                                             | Rejected Because                                                                                                                                                                                                        | Date |
| ------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- |
| Single global LRU cache (no sharding)                   | Write contention under concurrent requests. 16-shard design gives ~16x write throughput at cost of slightly less optimal global LRU ordering. Blake3 key distribution ensures balanced sharding.                        | --   |
| Runtime dimension selection (no ModelConfig validation) | Would allow invalid dimensions (0, negative, above native) to reach inference. Validation at config construction catches errors early with clear messages.                                                              | --   |
| Unified cache key format across both layers             | Layer 1 needs text-level keys (embed crate doesn't have token IDs). Layer 2 needs token-level keys (inference crate operates post-tokenization). Forcing one format adds unnecessary coupling between crates.           | --   |
| Async model loading (non-blocking)                      | Model load involves mmap + GPU init -- CPU-bound, not I/O-bound. `spawn_blocking` + `OnceLock` is simpler and correct. Async model load would require `async fn init()` on the trait, complicating all implementations. | --   |

## Invariants

**INV-1.** Different `output_dim` values produce different `EmbeddingKey` (`model.rs:447-456`). If broken: embeddings of different dimensionality mixed in same HNSW index, producing garbage similarity scores.
**INV-2.** MRL only for models where `supports_output_dim()` returns true (Qwen3 only, `model.rs:263-268`). If broken: truncation of non-MRL model produces semantically invalid embeddings (BERT embeddings are not trained for truncation).
**INV-3.** MRL dimension in `[MIN_MRL_OUTPUT_DIM(32), native_dimensions()]` (`model.rs:426-437`). If broken: dimension below training minimum degrades quality severely; above native is padding with zeros.
**INV-4.** Model loads exactly once per `NativeEmbeddingService` lifetime (`OnceLock` at `native.rs:63-64`). If broken: double model load wastes memory, or concurrent load attempts race.
**INV-5.** Cache key includes `model_config.to_embedding_key().canonical_bytes()` (`cache.rs:194-198`). If broken: same text with different model/dimension shares cache entry, returning wrong-dimension embeddings.
**INV-6.** `CachedEmbeddingService` validates `embeddings.len() == texts_to_embed.len()` (FP-035, `cached.rs:153-159`). If broken: cache merge produces misaligned results (embedding N assigned to text M).
**INV-7.** Batch size <= 1000, text length <= 32768 chars. If broken: OOM on large batches or excessive tokenization time. Validated at both `native.rs` and `cached.rs` entry points.
**INV-8.** Requested model must equal loaded model (`native.rs:327-332`). If broken: model produces embeddings in wrong dimension/space.

## Known Concerns Acknowledged

**KC-1.** Enum default (`BgeSmallEnV15`) differs from production default (`multilingual-e5-small`). Intentional: enum default serves code paths without application config (tests, standalone usage). Production routing goes through application config. Documenting both prevents confusion.

**KC-2.** Qwen models process batch items sequentially, not in parallel (`native.rs:29-32`). Intentional: `QwenModel`'s internal cache check is per-item, and the model holds a `Mutex<ForwardBuffers>` preventing parallel forward passes. Batch parallelism would require per-item buffer allocation, adding memory pressure for marginal gain (GPU forward is already fast). (ref: `native.rs:29-32`)

**KC-3.** `EmbeddingCache` marked Unstable in public API (`lib.rs:87-94`). Intentional: cache internals (shard count, key format, LRU parameters) may change. The stable surface is `EmbeddingService` trait + `ModelConfig`.

**KC-4.** 36 unsafe blocks in the `simd/` module (`distance.rs`, `normalize.rs`, `cosine.rs`, `dot_product.rs`, `quantized.rs`, `binary.rs`, `int4.rs`). Intentional: NEON/AVX2 SIMD intrinsics for vectorized distance computation and normalization in performance-critical embedding operations. TCB-justified -- safe abstractions would impose measurable overhead on hot-path distance calculations.

## Baseline Metrics

| Dimension | Metric                                       | Value                   | Measured   | Threshold | Command                                                           |
| --------- | -------------------------------------------- | ----------------------- | ---------- | --------- | ----------------------------------------------------------------- |
| perf      | Embed latency (mE5-small, 100 texts, cached) | pending                 | --         | +20%      | `cargo bench -p lattice-embed --bench embed`                      |
| perf      | Embed latency (Qwen3-0.6B, 100 texts, cold)  | pending                 | --         | +20%      | `cargo bench -p lattice-embed --bench embed`                      |
| perf      | Cache hit rate (steady state)                | pending                 | --         | -10%      | (logged in cache stats)                                           |
| quality   | MRL 1024d vs native 2560d cosine similarity  | >0.95 (per Qwen3 paper) | 2026-04-29 | <0.93     | pending eval harness                                              |
| security  | unsafe block count                           | 36                      | 2026-04-29 | +5        | `grep -rw "unsafe {" crates/embed/src/ --include="*.rs" \| wc -l` |
| quality   | Model variant count                          | 8                       | 2026-04-29 | -1        | count `EmbeddingModel` enum variants                              |

## Change Protocol

1. Read this DESIGN.md before modifying any file in `crates/embed/`.
2. Check Baseline Metrics -- run the measurement commands and compare against thresholds.
3. Check Known Concerns -- ensure your change doesn't re-introduce a concern already acknowledged.
4. Re-measure baselines after your change and update the table if values shift.