Module ingest

Expand description

Handler for the ingest CLI subcommand.

Bulk-ingests every file under a directory that matches a glob pattern. Each matched file is persisted as a separate memory using the same validation, chunking, embedding and persistence pipeline as remember, but executed in-process so the ONNX model is loaded only once per invocation. This is the v1.0.32 Onda 4B (finding A2) refactor that replaced a fork-spawn-per-file pipeline (every file paid the ~17s ONNX cold-start cost) with an in-process loop reusing the warm embedder (daemon when available, in-process Embedder::new otherwise).

Memory names are derived from file basenames (kebab-case, lowercase, ASCII alphanumerics + hyphens). Output is line-delimited JSON: one object per processed file (success or error), followed by a final summary object. Designed for streaming consumption by agents.

§Incremental pipeline (v1.0.43)

Phase A runs on a rayon thread pool (size = --ingest-parallelism): read + chunk + embed + NER per file. Results are sent immediately via a bounded mpsc::sync_channel to Phase B so persistence starts as soon as the first file completes — no waiting for all files to finish Phase A.

Phase B runs on the main thread: receives staged files from the channel, writes to SQLite per-file (WAL absorbs individual commits), and emits NDJSON progress events to stderr as each file is persisted. Connection is not Sync so it never crosses thread boundaries.

This fixes B1: with the old 2-phase design, a 50-file corpus with 27s/file NER would spend ~22min in Phase A alone, exceeding the user’s 900s timeout before Phase B (and any DB writes) could begin. With this pipeline, the first file is committed within seconds of starting.

Structs§

IngestArgs

Functions§

run

Module ingest

Module ingest Copy item path

§Incremental pipeline (v1.0.43)

Structs§

Functions§

Module ingest