lantern 0.2.4

Local-first, provenance-aware semantic search for agent activity
Documentation
# Lantern TODO

Tracking issues and improvements. The daily agent should pick one per session.

## P0 — Highest Leverage

### 1. MCP Server (`lantern mcp`)
**Status: DONE** — `src/mcp.rs` exposes 6 tools via rmcp. `lantern mcp --store <path>` (stdio) or `--port` (TCP).

### 2. Agent-Aware Ingestion
**Why:** Nothing in the ingest path is agent-aware beyond JSONL. No sessions, no turn linkage, no tool-call vs message distinction, no auto-capture hook.
- [x] JSONL extractor should tag: role (user/assistant/tool), turn_id, tool_name, timestamp
- [x] Add session/turn metadata to ingested chunks
- [x] Streaming ingest mode or filesystem hook (watch for new transcripts) — `lantern ingest <path> --follow [--follow-interval-secs N]` polls on an interval; unchanged files remain no-ops via the existing content-hash check
- [x] Append-only stdin ingest — `lantern ingest --stdin --uri <base> [--append]` now threads through `ingest::StdinIngestOptions { append: true }` so repeated stdin batches under the same base label accumulate as distinct sources (unique `{base}#{suffix}` URI per call) instead of overwriting; JSONL role/session/turn/tool/timestamp metadata is preserved per batch. Default behavior unchanged.
- [x] Named-pipe / FIFO stream ingest — `lantern ingest <fifo>` auto-detects Unix FIFOs (via `FileTypeExt::is_fifo`) and routes them through `ingest::ingest_fifo`, which reads until the writer closes and delegates to the stdin-append path. Each reader session lands as its own source under a `fifo://{abs_path}#{suffix}` URI, and `.jsonl`-named FIFOs still flow through the transcript extractor. Regular files, directories, and stdin are unchanged.
- Progress: follow mode now supports an optional idle timeout, so long-running transcript watches can exit cleanly after a quiet period.
- Progress: JSONL extractor now folds `tool_name` into the chunk prefix when `role="tool"` (OpenAI-style tool messages), so a line like `{"role":"tool","tool_name":"search",...}` becomes `[tool:search] ...` instead of the name-stripped `[tool] ...`. Previously the tool name survived as chunk metadata but was invisible to keyword search.
- Progress: directory `--follow` now installs a notify-backed filesystem watcher when possible, waking early on new/changed transcript files and falling back to polling when the watcher can't be created.
- Progress: JSONL extractor now also recognizes camelCase aliases (`sessionId`, `conversationId`, `threadId`, `turnId`, `messageId`, `toolName`, `createdAt`) alongside the existing snake_case keys, so transcripts emitted by JS/TS-style agent runtimes carry session/turn/tool/timestamp metadata without a manual rewrite. Snake_case keys still win when both are present, keeping prior behavior unchanged; focused tests cover each new alias plus the precedence guarantee.
- Covered by tests: FIFO follow re-opens on each writer close and multiplexes consecutive batches under `--follow`; follow loops can now terminate on idle timeout; directory follow wakes before the polling interval when new files appear; `role="tool"` + `tool_name` renders as `[tool:{name}]`; compact defaults now start decaying after two weeks, so moderately stale chunks decay sooner while the 30-day half-life remains the same.

### 3. Reciprocal Rank Fusion (RRF) for Hybrid Search
**Status: DONE** — Replaced normalize-by-max BM25 with RRF (k=60). Dropped `--weight` parameter. `blend_hits` now ranks each result list and sums `1/(k+rank)` contributions per chunk_id.
- Replace `normalize_bm25` with RRF: `score = 1 / (k + rank)` where k=60
- Three lines of code, handles the "only one side has this hit" case naturally
- Drop the `--weight` parameter (RRF doesn't need it)

## P1 — Important

### 4. sqlite-vec Vector Index
**Status: DONE** — `src/store.rs` loads sqlite-vec, default-model embeddings dual-write into `chunks_vec_nomic_768`, semantic search auto-routes when eligible, and existing stores backfill the mirror on upgrade (schema v5).

### 5. Decay Weighting / Confidence Scoring
**Why:** Lantern treats all chunks as equally true. agent-memory-mcp has explicit usefulness ranking.
- Track access count, recency, and user feedback per chunk
- Decay older chunks that are never retrieved
- Surface confidence score in search results
- Consider: was this chunk ever used to answer a query successfully?

**Progress (slice 1):** `chunks.access_count` / `chunks.last_accessed_at` (schema v7)
are now read by keyword, semantic, vec, and hybrid search paths and surfaced on
`SearchHit`. `compute_confidence(now, last_accessed_at, timestamp_unix,
access_count)` replaces the old timestamp-only helper with a deterministic blend
of freshness decay (30-day time constant, 0.25 floor, `last_accessed_at`
preferred over `timestamp_unix`) and access-count saturation (`1 - exp(-n/5)`).
Ranking now uses confidence as a deterministic secondary tie-breaker when the primary score is equal. Retrieval still bumps access_count and last_accessed_at
for returned hits across keyword, semantic, vec, and hybrid search paths.
- Progress: added `chunks.feedback_score` (schema v8) plus `lantern::feedback::{record_feedback,get_feedback_score}`; `SearchHit` now surfaces `feedback_score`, and confidence scoring folds it in with a neutral `0` default so existing stores are unchanged. Focused tests cover the new round-trip and confidence behavior, and `lantern feedback <chunk_id> up|down` now provides a direct CLI write path for the signal.
- Progress: added `lantern compact` as a background maintenance pass for stale access metadata. It decays `access_count` using a separate `access_decay_at` checkpoint so repeated runs stay idempotent, and search touches now refresh that checkpoint alongside `last_accessed_at`. Focused tests cover fresh vs stale rows and the search/compact round trip.
- Progress: compact defaults now start decaying after one week instead of waiting a full month, while keeping the 30-day half-life and CLI override knobs intact. A regression test now pins the default minimum-age gate, nudging the background maintenance pass a bit more aggressive while the broader automation tuning remains open.
- Progress: MCP now exposes chunk feedback through `lantern_feedback`, mirroring the CLI feedback write path and covered by a focused sync-path test.
- Progress: end-to-end test now asserts that negative feedback strictly lowers a chunk's `confidence` below a neutral peer after a second search, closing the symmetric gap next to the existing positive-feedback test.
- Progress: `lantern compact` now supports `--dry-run` previews so operators can inspect the hypothetical decay impact without mutating the store. The report and text output surface the preview mode, and focused tests cover the no-write path alongside the existing mutation tests.
- Progress: `lantern compact` now reports `skipped_recent_chunks` — the count of scanned chunks held back because their age was below `minimum_age_secs`. Surfaced in both text output (`skipped_recent=N`) and the JSON report so operators can see how much of a pass was no-op due to recency. Focused tests cover the recent-only path and a mixed recent/stale store where one chunk decays and one is skipped.
- Progress: `lantern compact` now also reports `decay_fraction` in the maintenance summary, so automation can see how much of a pass actually decayed chunks at a glance. The JSON report carries the field directly and focused tests cover the empty-store and mixed recent/stale cases.
- Progress: pinned the asymmetric `last_accessed_at.or(timestamp_unix)` precedence rule with a focused regression test — a stale `last_accessed_at` now provably shadows a fresh `timestamp_unix` (the decayed floor wins), guarding against an accidental "max" rewrite of the reference-picking logic.
- Progress: MCP now exposes compact decay through `lantern_compact` / `LanternServer::compact_sync`, and a dry-run regression test covers the transport-free code path without mutating the store.
- Progress: added `--min-confidence <f64>` to `lantern search` / `lantern query`. `SearchOptions::min_confidence` / `SemanticOptions::min_confidence` enforce the floor across keyword, semantic, vec-semantic, and (post-blend) hybrid paths, before `bump_access_metadata` — filtered chunks do not count as retrievals. Default `None` preserves existing behavior; MCP now exposes the same floor too, so the decay-aware filter is consistent across CLI and tool calls.
- Progress: search summary/text metadata now surfaces `access_count` and `feedback_score` alongside role/session/turn/tool/timestamp so confidence explanations show the access/feedback inputs directly.
- Progress: search summary/text metadata now also surfaces `last_accessed_at` when present, so the freshness component behind confidence is visible in the human-readable path too.
- Progress: search results now also surface `access_decay_at` in the detailed metadata line and JSON result payload, so the decay checkpoint used by compact is inspectable alongside confidence inputs.
- Progress: search results now expose a structured `confidence_breakdown` in JSON (`freshness`, `access_boost`, `base`, `feedback_factor`) so callers can explain why a hit's confidence landed where it did without re-deriving the formula.
- Progress: human-readable search output now includes a compact `breakdown=...` metadata token for the same confidence components, so `search` / `query` output stays inspectable without switching to JSON.
- Progress: the confidence-breakdown contract is now pinned by regression tests, including the JSON field set and the reconstruction invariant that the breakdown recomposes the public confidence score exactly.
- Progress: search confidence now also surfaces `freshness_source` (`last_accessed_at`, `timestamp_unix`, or `none`) in both JSON and text output, so the freshness component explains which timestamp actually drove the score. Focused tests pin the new serialized contract.
- Progress: added a focused regression test that pins the neutral-feedback no-op explicitly: when `feedback_score == 0`, `compute_confidence_breakdown` returns `feedback_factor == 0.0` and `confidence == base`, guarding the migration-default behavior from accidental drift.
- Progress: `lantern compact` now has regression coverage for stale decays, checkpoint idempotency, dry-run non-mutation, and decay-fraction reporting, so the access-metadata maintenance slice is pinned end-to-end.
- Progress: added `query_success_count` (schema v11), plus `lantern::query_success::{record_query_success,get_query_success_count}`; `SearchHit` now surfaces `query_success_count`, confidence scoring folds in a positive-only query-success factor, and focused tests cover the neutral default, positive lift, formatter breakdowns, and export/inspect/reindex schema-version bumps.
- Progress: tightened the `min_confidence` enforcement path so the floor and `bump_access_metadata` run exactly once per call across keyword, semantic, vec, and hybrid search. Each path now splits into a public function that applies the floor and bumps survivors and a private `*_candidates` helper that does neither, so hybrid can fuse raw candidates and only bump the blended survivors. A new regression test pins the hybrid invariant: when the blended floor drops a chunk, its `access_count` / `last_accessed_at` stay untouched even though the inner keyword and semantic passes saw it.

### 6. Knowledge Graph / Entity Extraction
**Why:** Lantern is flat — mcp-memory-service, Agentwong's server build typed entities with relationships.
- Extract entities (people, projects, concepts, files) from ingested content
- Build typed relationships between entities
- Queryable graph layer on top of the flat chunk store
- Use LLM extraction or NER — can be a post-ingest step
- Progress: URL and email entity extraction now run during ingest. `src/entities.rs` extracts `http(s)://` URLs plus simple ASCII email addresses from chunk text, persists them into the schema-v10 `entities` / `chunk_entities` tables, and is covered by focused tests that verify deduplication and linking.
- Progress: `src/entities.rs` now also extracts conservative backtick-wrapped file-path literals (for example `src/main.rs` and `Cargo.toml`) into a new `filepath` entity kind, keeping the regex-free, deduplicated provenance layer moving toward richer graph edges.
- Progress: `src/entities.rs` now extracts `@mention` handles into a new `mention` entity kind. Conservative rules: the `@` must not be preceded by an email-local character (so emails stay emails), the body must be at least two characters from `[A-Za-z0-9._-]`, and at least one ASCII letter is required (so `@2024-01-15` and `@1.2.3` do not become mentions). Schema v10's `(kind, value)` shape absorbs the new kind without a migration, and focused tests cover basic mentions, mention/email coexistence, dedup, trailing-separator trimming, short-handle rejection, and the digit/date guard.
- Progress: entity data now has a small read API: `entity_kind_from_str` parses CLI/MCP kind filters and `list_entities` returns ordered, filterable entity listings with chunk-reference counts, plus regression tests for ordering, filtering, literal substring matching, and limit handling.
- Progress: `lantern entities` CLI surface now exposes the listing API, with `--kind {url|email|filepath|mention}`, `--value-contains <substr>`, `--limit` (default 50), and `--format text|json`. Output reuses `entities::print_text` / `print_json`, so the human-readable and JSON shapes stay in lockstep with the library. Focused parse tests pin the default-only invocation, the full filter set, and rejection of unknown kinds.
- Progress: MCP now exposes the same entity listing surface through `lantern_entities` / `LanternServer::entities_sync`, so agents can inspect URL/email/filepath/mention entities without shelling out. A focused MCP test now covers the transport-free code path alongside the existing CLI and library coverage.

### 7. Autonomous Consolidation
**Why:** No background process summarizes old content into denser representations. mcp-memory-service does this.
- Periodic job that merges related chunks into summaries
- Replace 50 near-duplicate chunks with 1 dense summary
- Triggered by threshold (chunk count, store size, time since last consolidation)
- Keep provenance chain — summaries link back to source chunks

### 8. Multi-Session Memory Linkage
**Why:** If you ingest 50 support sessions, the store knows they're 50 JSONL sources but has no notion that they're related to the same student or topic.
- Add session grouping metadata (topic, user, project, thread)
- Auto-detect session relationships (shared entities, temporal proximity, overlapping terms)
- Query across linked sessions: "what do we know about student X?"
- Session bundles — group sources into logical collections

## P2 — Polish

### 9. Pluggable Embedding Backend
**Why:** Ollama-only is a philosophical choice but blocks CI/server use (no GPU). An `EmbeddingBackend` trait with Ollama as default costs nothing.
- Define `EmbeddingBackend` trait: `fn embed(&self, texts: &[String]) -> Result<Vec<Vec<f32>>>`
- Implement for Ollama (default), OpenAI, Voyage
- `--embed-backend ollama|openai|voyage` flag
- Config file for API keys

### 10. Record Embedding Model in Query Envelope
**Status: DONE** — `search --format json` now includes the query embedding model for semantic/hybrid searches; keyword searches omit it.

## P2 — Polish

### 9. Pluggable Embedding Backend
**Why:** Ollama-only is a philosophical choice but blocks CI/server use (no GPU). An `EmbeddingBackend` trait with Ollama as default costs nothing.
- Define `EmbeddingBackend` trait: `fn embed(&self, texts: &[String]) -> Result<Vec<Vec<f32>>>`
- Implement for Ollama (default), OpenAI, Voyage
- `--embed-backend ollama|openai|voyage` flag
- Config file for API keys

### 10. Record Embedding Model in Query Envelope
**Status: DONE** — `search --format json` now includes the query embedding model for semantic/hybrid searches; keyword searches omit it.

### 11. `--include-ext` Flag for Ingest
**Why:** `is_supported_file` is a hardcoded allowlist. `.proto`, `.sql`, `.tf`, `.nix`, `.el`, `.vim` — forever adding one at a time.
- Add `--include-ext proto,sql,tf` CLI flag
- Consider MIME sniffing + UTF-8 validation as fallback (extension list as fast path)
- Or `--include-all-text` to accept any text file
- Progress: `lantern ingest` now accepts `--include-ext` as a comma-separated list, normalizes away leading dots and case, and threads the extra extensions into the ingest allowlist. Focused tests cover CLI normalization, rejection of empty values, and end-to-end ingest of a custom `.proto` file.

### 12. Code-Aware Chunking (Treesitter)
**Why:** Paragraph-break chunking slices function bodies mid-block. For code recall, function/class boundary chunking is a meaningful quality jump.
- Add `tree-sitter` dependency
- Chunk at function/class/module boundaries for supported languages
- Fall back to paragraph chunking for non-code files
- Languages: Rust, Python, Go, TypeScript first

### 13. Surface MAX_INGEST_BYTES in JSON Report
**Status: DONE** — skipped ingest entries now carry a structured `skipped_reason` code in JSON (`too_large`, `unchanged`, `error`) alongside the human-readable message.

### 14. Library API Surface
**Why:** Crate exposes modules but they're CLI-shaped (print to stdout, CLI-ish defaults). An agent framework wrapping this discovers `print_summary` is hardcoded to `println!`.
- Refactor modules to return data, not print
- `print_summary` / `print_json` become display layer only
- Core functions return `Result<Vec<SearchHit>>`, display functions format them
- Document the library API for embedding in other Rust projects

### 15. `.lantern-allow` for Ingest Allowlist Overrides
**Why:** The default file allowlist is still hardcoded. A project-local allow file would let users add or narrow what Lantern treats as ingestible without editing code.
- Add `.lantern-allow` alongside `.lantern-ignore`
- Allow overriding the preset allowed file extensions / patterns
- Keep it explicit so users can opt into extra file types like `.proto`, `.sql`, `.tf`, `.nix`, `.el`, `.vim`

### 16. PDF Ingestion
**Why:** A lot of useful knowledge lives in PDFs, and Lantern can't ingest them yet.
- Add a PDF extractor path for text-based PDFs
- Decide whether to use a Rust-native parser, PDF-to-text tooling, or a fallback pipeline
- Preserve provenance and page/byte metadata where possible
- Skip or report scanned/image-only PDFs cleanly if extraction fails

### 17. Embedding Progress Bar
**Why:** Long embedding runs are hard to judge from a distance, especially on big repos over Termux/remote Ollama. A progress bar would make ingest feel much less opaque.
- Show chunks processed vs total during embedding
- Surface current file / source being embedded
- Keep it quiet in JSON mode; only render when interactive/TTY

## Done
- [x] Clippy warnings fixed (clamp, redundant .into_iter(), collapsible if)
- [x] Source/chunk IDs widened from 8 to 16 bytes (SHA-256[:16])
- [x] Forget defaults to dry-run, requires --apply to delete
- [x] MCP server (rmcp, 6 tools, stdio + TCP)
- [x] Core CLI (init, ingest, search, query, show, inspect, export, diff, forget, reindex, stash, version)
- [x] FTS5 keyword search (BM25)
- [x] Semantic search (Ollama embeddings, cosine similarity)
- [x] Hybrid search (BM25 + cosine)
- [x] Batch embedding (32 chunks/request)
- [x] JSONL transcript ingestion
- [x] Stash/archive system
- [x] .lantern-ignore support (gitignore-style patterns + defaults)
- [x] Static builds (musl Linux x86_64 + aarch64, macOS aarch64)
- [x] Tagged release pipeline with checksums
- [x] File size limit (50MB)
- [x] Pre-v0.1.0 code review fixes
- [x] sqlite-vec rollout (extension load, vec mirror, auto-routing, and upgrade backfill)