embeddenator 0.20.0-alpha.1

# Crate structure + concurrency strategy (future inference + balanced ternary compute)

This document proposes a workspace layout and concurrency strategy that keeps “sister” subcrates isolated so they can be extracted later with minimal dependency entanglement.

Goals
- Keep core math/representation crates dependency-light and extraction-ready.
- Keep IO/CLI/FUSE concerns out of core compute.
- Make CPU-bound compute scale via data parallelism without forcing an async runtime.
- Preserve today’s public API surface via a facade crate to reduce churn.

Non-goals
- This does not redesign algorithms; it only restructures code boundaries and runtime patterns.
- This does not require adopting nightly or pervasive `unsafe`.

---

## (1) Recommended workspace / crates layout

Proposed top-level layout:

```
embeddenator/
  Cargo.toml                # workspace + facade crate (optional)
  crates/
    embeddenator-ternary/
    embeddenator-vsa/
    embeddenator-codebook/
    embeddenator-correction/
    embeddenator-embrfs/
    embeddenator-fuse/      # optional, Linux/macOS only
    embeddenator-inference/ # future
    embeddenator-runtime/   # optional: shared concurrency config
  src/                      # facade crate (keeps current `embeddenator` API stable)
  tests/
  benches/
  docs/
```

Members and dependency direction (acyclic):

- `embeddenator-ternary` (leaf)
- `embeddenator-vsa` (leaf-ish; may depend on `embeddenator-ternary` if you unify “trit” types)
- `embeddenator-codebook` → (`ternary`, `vsa`)
- `embeddenator-correction` → (`ternary`, `vsa`, `codebook`)
- `embeddenator-embrfs` → (`ternary`, `vsa`, `codebook`, `correction`)
- `embeddenator-fuse` → (`embrfs`) + `fuser`/`libc` (platform-specific)
- `embeddenator-inference` → (`vsa`, `codebook`, maybe `ternary`) + (optional async/network deps later)
- `embeddenator-runtime` → (no functional deps; only config/traits/utilities)
- Facade crate `embeddenator` → re-exports from the above, plus current `src/cli.rs` and the `embeddenator` bin

This structure keeps extractable “math crates” independent of filesystem/CLI and keeps platform-specific `fuse` isolated.

---

## (2) What belongs in each crate

### `crates/embeddenator-ternary`
Purpose: balanced ternary primitives and fast, deterministic arithmetic.

Belongs here
- `Trit`, `Tryte*`, `Word*` types, conversion, formatting.
- Saturating / wrapping ternary arithmetic rules.
- Bit/tryte packing, compact encodings (if any).
- SIMD-optional kernels (feature-gated) for ternary ops.

Does not belong here
- VSA vector operations, hashing, random generation.
- File formats, IO, serde formats (unless the type’s canonical encoding is intrinsic).

Public API guidance
- Prefer small, `#[repr(transparent)]` newtypes around integers when possible.
- Make invariants explicit (e.g., `Tryte` always in -13..=13). Enforce on construction.

### `crates/embeddenator-vsa`
Purpose: VSA representations (sparse ternary vectors) + algebraic operations.

Belongs here
- `SparseVec`, `HyperVec` (if it’s a core representation) and operations: bundle/bind/permutation/similarity.
- Deterministic PRNG strategy for vector generation (seeded / reproducible).
- “Kernel” layer traits if you want multiple backends later (dense, sparse, GPU).

Does not belong here
- Codebook semantics, manifests, correction stores.
- Filesystem chunking policy (except generic block encode/decode primitives).

API guidance
- Keep pure functions where feasible: `fn bundle(a, b) -> c` style or `impl` methods returning new values.
- If you need in-place ops for performance, expose them as explicit `*_in_place` methods.

### `crates/embeddenator-codebook`
Purpose: mapping symbols/chunks/words to vectors and metadata.

Belongs here
- `Codebook`, `BalancedTernaryWord`, `ProjectionResult`, `WordMetadata`, scoring utilities.
- Projection/search utilities that don’t require filesystem concepts.

Does not belong here
- On-disk manifest layout of a filesystem.
- Correction layer persistence.

### `crates/embeddenator-correction`
Purpose: correctness and reconstruction guarantees.

Belongs here
- `CorrectionStore`, `ReconstructionVerifier`, `ChunkCorrection` and algorithms that restore 100% fidelity.
- Statistics + validation utilities.

Does not belong here
- CLI rendering.
- FUSE plumbing.

### `crates/embeddenator-embrfs`
Purpose: “holographic filesystem” domain model and ingest/export logic.

Belongs here
- `EmbrFS`, `Engram`, `Manifest`, `FileEntry`, chunking policy.
- Directory ingestion (sync by default; async optionally via features).
- Serialization/deserialization glue for engrams/manifests.

Does not belong here
- Platform-specific `fuser` integration.

### `crates/embeddenator-fuse` (feature / platform gated)
Purpose: isolate all OS/FFI boundary code.

Belongs here
- `EngramFS` and `fuser`-related types.
- Any `unsafe` required for FUSE/FFI should be concentrated here.

### `crates/embeddenator-inference` (future)
Purpose: inference/retrieval logic that consumes VSA + codebook to answer queries.

Belongs here
- Query model: “retrieve top-k”, “explain match”, “compose hypothesis”, etc.
- Index structures (sharded similarity search, caches).
- Pipeline orchestration (batch queries, streaming results) without binding to CLI.

Design constraint
- Avoid taking a dependency on filesystem/EmbrFS unless truly necessary; prefer passing in traits like “vector store” and “document source”.

### `crates/embeddenator-runtime` (optional)
Purpose: shared concurrency configuration and execution traits.

Belongs here
- A small abstraction layer so core crates don’t hard-depend on `rayon`/`tokio`.
- Example: `trait Executor { fn scope(...); fn parallel_for(...); }`

Pragmatic note
- If you want maximum simplicity, skip this crate and let `embrfs`/`inference` choose concurrency directly.

### Facade crate `embeddenator` (root)
Purpose: keep the existing public API and binary stable while you refactor.

Belongs here
- `pub use ...` re-exports to preserve `use embeddenator::SparseVec` etc.
- CLI entrypoint and current `src/cli.rs`.

---

## (3) Concurrency primitives and patterns

### Rule of thumb: CPU compute != async runtime
- Balanced ternary kernels and VSA operations are CPU-bound → prefer **Rayon** or explicit threads.
- File IO (ingest directories, read/write engrams) can be sync or async, but should not force an async dependency on core compute crates.

### Recommended: Rayon for data-parallel compute
Use Rayon where work is “embarrassingly parallel”:
- Encoding blocks/chunks in parallel (per-file or per-block).
- Bundling many vectors via map-reduce.
- Similarity scoring across a candidate set.

Patterns
- Map-reduce: `par_iter().map(f).reduce(|| id, combine)`
- Chunked parallelism: avoid tiny tasks; operate on blocks (e.g., 64–1024 items per job).
- Deterministic results: ensure your reduction is associative/commutative or define a deterministic reduction order (e.g., stable chunk boundaries).

Thread pool strategy
- Build a dedicated pool for heavy workloads (especially in a library context) rather than relying on the global pool, so embedding applications can control threads.
- Expose a “configure threads” entrypoint at the `embrfs`/`inference` boundary.

### Inference-specific execution model (recommended)
For “future inference logic” (retrieval + scoring), the most reliable scaling pattern is *read-mostly sharded state + per-request parallel scoring*.

State layout
- Store codebooks/indices as immutable snapshots: `Arc<IndexState>`.
- For hot reload / incremental updates, swap the snapshot atomically (copy-on-write): readers never lock; writers build a new snapshot then publish.
  - If you want a crate for this later, consider `arc-swap` (optional dependency) at the `embeddenator-inference` boundary.

Query parallelism
- **Across queries**: bound concurrency with a work queue (Crossbeam bounded channel) or a Tokio semaphore (service mode).
- **Within a query**: use Rayon for scoring candidates in parallel.

Top-k pattern (deterministic + low contention)
- Each worker computes a thread-local top-k (small binary heap or fixed-size array).
- Reduce thread-local top-k sets into a single top-k at the end (single-thread or tree-reduction).
- Avoid pushing into a shared heap from multiple threads.

Sharding pattern
- Partition candidates into shards (by id range or hash) so each worker touches mostly local memory.
- Keep shard data contiguous (arrays/slices) where possible to improve cache locality.

### Crossbeam for pipelines and scoped threads
Use Crossbeam where you need explicit backpressure or a producer/consumer pipeline:
- Ingest pipeline: filesystem walker → reader → encoder → bundler.
- Streaming inference: query stream → candidate generation → scoring → top-k aggregation.

Primitives
- `crossbeam::channel` for bounded queues (backpressure).
- `crossbeam::scope` for spawning threads borrowing stack references (avoids `Arc` cloning when safe).

### Tokio only when you truly need async
Tokio makes sense if inference becomes network-facing or you need high-concurrency IO:
- Serving requests (HTTP/gRPC) in a future service.
- Async file IO on platforms where it helps (often limited on Linux; still can help when multiplexing many operations).

Recommended boundary
- Keep Tokio confined to `embeddenator-inference` (service mode) or CLI.
- Run CPU-heavy work via `tokio::task::spawn_blocking` and then inside Rayon/threads.

### When to use plain threads
Plain `std::thread` is fine for:
- A small, fixed number of long-lived worker threads.
- Pinning specific responsibilities (e.g., one writer thread).

If you go this route, prefer channels + owned messages; avoid sharing mutable state.

---

## (4) Memory safety considerations (and performance-friendly safety)

### Minimize shared mutable state
- Prefer immutable data + functional transforms.
- For shared read-mostly structures (codebooks, indices): store as `Arc<T>` and update by swapping an `Arc` (copy-on-write model).

### Use lock types intentionally
- `RwLock` (or `parking_lot::RwLock` if you choose) for read-heavy maps.
- `Mutex` for infrequent mutation.
- Atomics for counters/metrics only.

Avoid
- Fine-grained locking inside tight inner loops (similarity scoring, ternary kernels).

### Ensure thread-safe RNG
Vector generation often uses randomness:
- Don’t share a single RNG across threads.
- Use per-thread RNG seeded deterministically from a master seed + thread index (or hash of task id) to get reproducibility.

### Avoid aliasing in parallel reductions
For parallel bundling/scoring:
- Compute into thread-local accumulators, then merge.
- Do not write into a shared `Vec` from multiple threads without chunk partitioning.

### Keep `unsafe` isolated and justified
- If you introduce `unsafe` for SIMD or unchecked indexing, keep it behind a small module boundary (ideally inside `embeddenator-ternary` or `embeddenator-vsa` kernel modules) with extensive tests.
- Keep all FFI `unsafe` inside `embeddenator-fuse`.

### Data representation invariants
- For sparse vectors: maintain sorted/unique `pos`/`neg` indices; enforce on construction or normalize after ops.
- Consider smaller index types (`u32`) if `DIM` fits, to reduce memory bandwidth.

### Avoid accidental quadratic behavior in hot paths
- Prefer set-like operations using two-pointer merge on sorted `pos`/`neg` lists over `HashSet` in inner loops.
- If an operation needs a scratch buffer, prefer reusing it per-thread (thread-local) rather than allocating per vector.

### Prefer “owned-message passing” across threads
- When crossing thread boundaries, prefer `Arc<[T]>`, `Arc<Vec<T>>`, or `Vec<T>` moved through a channel.
- Avoid sharing `&mut` across threads; let each worker own its chunk of output.

---

## (5) Migration steps (incremental, low-risk)

1) Convert to a workspace without breaking users
- Update root `Cargo.toml` to include `[workspace]` and add members under `crates/`.
- Keep the existing `embeddenator` crate as a facade (same name) to preserve public API and binary entrypoints.

2) Extract leaf crates first
- Move `src/ternary.rs` → `crates/embeddenator-ternary/src/lib.rs`.
- Move `src/vsa.rs` → `crates/embeddenator-vsa/src/lib.rs`.
- Add minimal `Cargo.toml`s with only needed deps.

3) Re-export through the facade
- In the root crate, replace `pub mod ternary; pub mod vsa;` with `pub use embeddenator_ternary::*; pub use embeddenator_vsa::*;` (or re-export specific items to preserve names).

4) Extract dependent crates in order
- `codebook` → `correction` → `embrfs` → `fuse`.
- Keep tests compiling by pointing them at the facade crate until the dust settles.

5) Introduce feature flags for concurrency
- Add `parallel` feature (enables Rayon) in `vsa`/`inference` (or only at higher layers).
- Add `async` feature (Tokio) only in `inference`/CLI if needed.

6) Add a single concurrency “entrypoint”
- Provide one place to configure thread pools and batch sizing (ideally `embrfs` and `inference`).
- Avoid having core crates create global pools implicitly.

7) Validate correctness + performance
- Keep existing tests; add targeted tests for parallel determinism (same input → same output).
- Add Criterion benches at the crate boundary (e.g., bundle N vectors, score M candidates).

---

## Suggested defaults (practical)
- Use Rayon for CPU-heavy: bundling, scoring, encoding blocks.
- Use Crossbeam channels for multi-stage ingest/inference pipelines.
- Keep Tokio out of core unless you’re building a service.
- Keep unsafe minimal and boxed into kernel/FFI crates.