lantern 0.2.3 - Docs.rs

# Lantern TODO

Tracking issues and improvements. The daily agent should pick one per session.

## P0 — Highest Leverage

### 1. MCP Server (`lantern mcp`)
**Status: DONE** — `src/mcp.rs` exposes 6 tools via rmcp. `lantern mcp --store <path>` (stdio) or `--port` (TCP).

### 2. Agent-Aware Ingestion
**Why:** Nothing in the ingest path is agent-aware beyond JSONL. No sessions, no turn linkage, no tool-call vs message distinction, no auto-capture hook.
- [x] JSONL extractor should tag: role (user/assistant/tool), turn_id, tool_name, timestamp
- [x] Add session/turn metadata to ingested chunks
- [x] Streaming ingest mode or filesystem hook (watch for new transcripts) — `lantern ingest <path> --follow [--follow-interval-secs N]` polls on an interval; unchanged files remain no-ops via the existing content-hash check
- [x] Append-only stdin ingest — `lantern ingest --stdin --uri <base> [--append]` now threads through `ingest::StdinIngestOptions { append: true }` so repeated stdin batches under the same base label accumulate as distinct sources (unique `{base}#{suffix}` URI per call) instead of overwriting; JSONL role/session/turn/tool/timestamp metadata is preserved per batch. Default behavior unchanged.
- [x] Named-pipe / FIFO stream ingest — `lantern ingest <fifo>` auto-detects Unix FIFOs (via `FileTypeExt::is_fifo`) and routes them through `ingest::ingest_fifo`, which reads until the writer closes and delegates to the stdin-append path. Each reader session lands as its own source under a `fifo://{abs_path}#{suffix}` URI, and `.jsonl`-named FIFOs still flow through the transcript extractor. Regular files, directories, and stdin are unchanged.
- Progress: follow mode now supports an optional idle timeout, so long-running transcript watches can exit cleanly after a quiet period.
- Progress: JSONL extractor now folds `tool_name` into the chunk prefix when `role="tool"` (OpenAI-style tool messages), so a line like `{"role":"tool","tool_name":"search",...}` becomes `[tool:search] ...` instead of the name-stripped `[tool] ...`. Previously the tool name survived as chunk metadata but was invisible to keyword search.
- Progress: directory `--follow` now installs a notify-backed filesystem watcher when possible, waking early on new/changed transcript files and falling back to polling when the watcher can't be created.
- Covered by tests: FIFO follow re-opens on each writer close and multiplexes consecutive batches under `--follow`; follow loops can now terminate on idle timeout; directory follow wakes before the polling interval when new files appear; `role="tool"` + `tool_name` renders as `[tool:{name}]`; compact defaults now start decaying after two weeks, so moderately stale chunks decay sooner while the 30-day half-life remains the same.

### 3. Reciprocal Rank Fusion (RRF) for Hybrid Search
**Status: DONE** — Replaced normalize-by-max BM25 with RRF (k=60). Dropped `--weight` parameter. `blend_hits` now ranks each result list and sums `1/(k+rank)` contributions per chunk_id.
- Replace `normalize_bm25` with RRF: `score = 1 / (k + rank)` where k=60
- Three lines of code, handles the "only one side has this hit" case naturally
- Drop the `--weight` parameter (RRF doesn't need it)

## P1 — Important

### 4. sqlite-vec Vector Index
**Status: DONE** — `src/store.rs` loads sqlite-vec, default-model embeddings dual-write into `chunks_vec_nomic_768`, semantic search auto-routes when eligible, and existing stores backfill the mirror on upgrade (schema v5).

### 5. Decay Weighting / Confidence Scoring
**Why:** Lantern treats all chunks as equally true. agent-memory-mcp has explicit usefulness ranking.
- Track access count, recency, and user feedback per chunk
- Decay older chunks that are never retrieved
- Surface confidence score in search results
- Consider: was this chunk ever used to answer a query successfully?

**Progress (slice 1):** `chunks.access_count` / `chunks.last_accessed_at` (schema v7)
are now read by keyword, semantic, vec, and hybrid search paths and surfaced on
`SearchHit`. `compute_confidence(now, last_accessed_at, timestamp_unix,
access_count)` replaces the old timestamp-only helper with a deterministic blend
of freshness decay (30-day time constant, 0.25 floor, `last_accessed_at`
preferred over `timestamp_unix`) and access-count saturation (`1 - exp(-n/5)`).
Ranking is unchanged. Retrieval now bumps access_count and last_accessed_at
for returned hits across keyword, semantic, vec, and hybrid search paths.
- Progress: added `chunks.feedback_score` (schema v8) plus `lantern::feedback::{record_feedback,get_feedback_score}`; `SearchHit` now surfaces `feedback_score`, and confidence scoring folds it in with a neutral `0` default so existing stores are unchanged. Focused tests cover the new round-trip and confidence behavior, and `lantern feedback <chunk_id> up|down` now provides a direct CLI write path for the signal.
- Progress: added `lantern compact` as a background maintenance pass for stale access metadata. It decays `access_count` using a separate `access_decay_at` checkpoint so repeated runs stay idempotent, and search touches now refresh that checkpoint alongside `last_accessed_at`. Focused tests cover fresh vs stale rows and the search/compact round trip.
- Progress: compact defaults now start decaying after two weeks instead of waiting a full month, while keeping the 30-day half-life and CLI override knobs intact. Broader aggressiveness/automation tuning remains open.
- Progress: MCP now exposes chunk feedback through `lantern_feedback`, mirroring the CLI feedback write path and covered by a focused sync-path test.
- Progress: end-to-end test now asserts that negative feedback strictly lowers a chunk's `confidence` below a neutral peer after a second search, closing the symmetric gap next to the existing positive-feedback test.
- Progress: `lantern compact` now reports `skipped_recent_chunks` — the count of scanned chunks held back because their age was below `minimum_age_secs`. Surfaced in both text output (`skipped_recent=N`) and the JSON report so operators can see how much of a pass was no-op due to recency. Focused tests cover the recent-only path and a mixed recent/stale store where one chunk decays and one is skipped.

### 6. Knowledge Graph / Entity Extraction
**Why:** Lantern is flat — mcp-memory-service, Agentwong's server build typed entities with relationships.
- Extract entities (people, projects, concepts, files) from ingested content
- Build typed relationships between entities
- Queryable graph layer on top of the flat chunk store
- Use LLM extraction or NER — can be a post-ingest step

### 7. Autonomous Consolidation
**Why:** No background process summarizes old content into denser representations. mcp-memory-service does this.
- Periodic job that merges related chunks into summaries
- Replace 50 near-duplicate chunks with 1 dense summary
- Triggered by threshold (chunk count, store size, time since last consolidation)
- Keep provenance chain — summaries link back to source chunks

### 8. Multi-Session Memory Linkage
**Why:** If you ingest 50 support sessions, the store knows they're 50 JSONL sources but has no notion that they're related to the same student or topic.
- Add session grouping metadata (topic, user, project, thread)
- Auto-detect session relationships (shared entities, temporal proximity, overlapping terms)
- Query across linked sessions: "what do we know about student X?"
- Session bundles — group sources into logical collections

## P2 — Polish

### 9. Pluggable Embedding Backend
**Why:** Ollama-only is a philosophical choice but blocks CI/server use (no GPU). An `EmbeddingBackend` trait with Ollama as default costs nothing.
- Define `EmbeddingBackend` trait: `fn embed(&self, texts: &[String]) -> Result<Vec<Vec<f32>>>`
- Implement for Ollama (default), OpenAI, Voyage
- `--embed-backend ollama|openai|voyage` flag
- Config file for API keys

### 10. Record Embedding Model in Query Envelope
**Status: DONE** — `search --format json` now includes the query embedding model for semantic/hybrid searches; keyword searches omit it.

## P2 — Polish

### 9. Pluggable Embedding Backend
**Why:** Ollama-only is a philosophical choice but blocks CI/server use (no GPU). An `EmbeddingBackend` trait with Ollama as default costs nothing.
- Define `EmbeddingBackend` trait: `fn embed(&self, texts: &[String]) -> Result<Vec<Vec<f32>>>`
- Implement for Ollama (default), OpenAI, Voyage
- `--embed-backend ollama|openai|voyage` flag
- Config file for API keys

### 10. Record Embedding Model in Query Envelope
**Status: DONE** — `search --format json` now includes the query embedding model for semantic/hybrid searches; keyword searches omit it.

### 11. `--include-ext` Flag for Ingest
**Why:** `is_supported_file` is a hardcoded allowlist. `.proto`, `.sql`, `.tf`, `.nix`, `.el`, `.vim` — forever adding one at a time.
- Add `--include-ext proto,sql,tf` CLI flag
- Consider MIME sniffing + UTF-8 validation as fallback (extension list as fast path)
- Or `--include-all-text` to accept any text file

### 12. Code-Aware Chunking (Treesitter)
**Why:** Paragraph-break chunking slices function bodies mid-block. For code recall, function/class boundary chunking is a meaningful quality jump.
- Add `tree-sitter` dependency
- Chunk at function/class/module boundaries for supported languages
- Fall back to paragraph chunking for non-code files
- Languages: Rust, Python, Go, TypeScript first

### 13. Surface MAX_INGEST_BYTES in JSON Report
**Status: DONE** — skipped ingest entries now carry a structured `skipped_reason` code in JSON (`too_large`, `unchanged`, `error`) alongside the human-readable message.

### 14. Library API Surface
**Why:** Crate exposes modules but they're CLI-shaped (print to stdout, CLI-ish defaults). An agent framework wrapping this discovers `print_summary` is hardcoded to `println!`.
- Refactor modules to return data, not print
- `print_summary` / `print_json` become display layer only
- Core functions return `Result<Vec<SearchHit>>`, display functions format them
- Document the library API for embedding in other Rust projects

### 15. `.lantern-allow` for Ingest Allowlist Overrides
**Why:** The default file allowlist is still hardcoded. A project-local allow file would let users add or narrow what Lantern treats as ingestible without editing code.
- Add `.lantern-allow` alongside `.lantern-ignore`
- Allow overriding the preset allowed file extensions / patterns
- Keep it explicit so users can opt into extra file types like `.proto`, `.sql`, `.tf`, `.nix`, `.el`, `.vim`

### 16. PDF Ingestion
**Why:** A lot of useful knowledge lives in PDFs, and Lantern can't ingest them yet.
- Add a PDF extractor path for text-based PDFs
- Decide whether to use a Rust-native parser, PDF-to-text tooling, or a fallback pipeline
- Preserve provenance and page/byte metadata where possible
- Skip or report scanned/image-only PDFs cleanly if extraction fails

### 17. Embedding Progress Bar
**Why:** Long embedding runs are hard to judge from a distance, especially on big repos over Termux/remote Ollama. A progress bar would make ingest feel much less opaque.
- Show chunks processed vs total during embedding
- Surface current file / source being embedded
- Keep it quiet in JSON mode; only render when interactive/TTY

## Done
- [x] Clippy warnings fixed (clamp, redundant .into_iter(), collapsible if)
- [x] Source/chunk IDs widened from 8 to 16 bytes (SHA-256[:16])
- [x] Forget defaults to dry-run, requires --apply to delete
- [x] MCP server (rmcp, 6 tools, stdio + TCP)
- [x] Core CLI (init, ingest, search, query, show, inspect, export, diff, forget, reindex, stash, version)
- [x] FTS5 keyword search (BM25)
- [x] Semantic search (Ollama embeddings, cosine similarity)
- [x] Hybrid search (BM25 + cosine)
- [x] Batch embedding (32 chunks/request)
- [x] JSONL transcript ingestion
- [x] Stash/archive system
- [x] .lantern-ignore support (gitignore-style patterns + defaults)
- [x] Static builds (musl Linux x86_64 + aarch64, macOS aarch64)
- [x] Tagged release pipeline with checksums
- [x] File size limit (50MB)
- [x] Pre-v0.1.0 code review fixes
- [x] sqlite-vec rollout (extension load, vec mirror, auto-routing, and upgrade backfill)