lantern 0.3.0 - Docs.rs

# Lantern TODO

Tracking issues and improvements. The daily agent should pick one per session.

## P0 — Highest Leverage

### 1. MCP Server (`lantern mcp`)
**Status: DONE** — `src/mcp.rs` exposes 11 tools via rmcp. `lantern mcp --store <path>` (stdio) or `--port` (TCP).

### 2. Agent-Aware Ingestion
**Why:** The core transcript path is now agent-aware, but the remaining gap is automation and richer capture workflows around it rather than basic metadata plumbing.
- [x] JSONL extractor should tag: role (user/assistant/tool), turn_id, tool_name, timestamp
- [x] Add session/turn metadata to ingested chunks
- [x] Streaming ingest mode or filesystem hook (watch for new transcripts) — `lantern ingest <path> --follow [--follow-interval-secs N]` polls on an interval; unchanged files remain no-ops via the existing content-hash check
- [x] Append-only stdin ingest — `lantern ingest --stdin --uri <base> [--append]` now threads through `ingest::StdinIngestOptions { append: true }` so repeated stdin batches under the same base label accumulate as distinct sources (unique `{base}#{suffix}` URI per call) instead of overwriting; JSONL role/session/turn/tool/timestamp metadata is preserved per batch. Default behavior unchanged.
- [x] Named-pipe / FIFO stream ingest — `lantern ingest <fifo>` auto-detects Unix FIFOs (via `FileTypeExt::is_fifo`) and routes them through `ingest::ingest_fifo`, which reads until the writer closes and delegates to the stdin-append path. Each reader session lands as its own source under a `fifo://{abs_path}#{suffix}` URI, and `.jsonl`-named FIFOs still flow through the transcript extractor. Regular files, directories, and stdin are unchanged.
- Progress: follow mode now supports an optional idle timeout, so long-running transcript watches can exit cleanly after a quiet period.
- Progress: JSONL extractor now folds `tool_name` into the chunk prefix when `role="tool"` (OpenAI-style tool messages), so a line like `{"role":"tool","tool_name":"search",...}` becomes `[tool:search] ...` instead of the name-stripped `[tool] ...`. Previously the tool name survived as chunk metadata but was invisible to keyword search.
- Progress: directory `--follow` now installs a notify-backed filesystem watcher when possible, waking early on new/changed transcript files and falling back to polling when the watcher can't be created.
- Progress: JSONL extractor now also recognizes camelCase aliases (`sessionId`, `conversationId`, `threadId`, `turnId`, `messageId`, `toolName`, `createdAt`) alongside the existing snake_case keys, so transcripts emitted by JS/TS-style agent runtimes carry session/turn/tool/timestamp metadata without a manual rewrite. Snake_case keys still win when both are present, keeping prior behavior unchanged; focused tests cover each new alias plus the precedence guarantee.
- Covered by tests: FIFO follow re-opens on each writer close and multiplexes consecutive batches under `--follow`; follow loops can now terminate on idle timeout; directory follow wakes before the polling interval when new files appear; `role="tool"` + `tool_name` renders as `[tool:{name}]`; compact defaults now start decaying after two weeks, so moderately stale chunks decay sooner while the 30-day half-life remains the same.
- Progress: JSONL extractor now also captures `tool_call_id` (snake_case + `toolCallId` / `call_id` / `callId` aliases), persisted into a new `chunks.tool_call_id` column (schema v12) and threaded through keyword/vec/semantic/hybrid search, export, and `show`. Both human-readable metadata lines and JSON payloads now surface `tool_call_id=...`, so tool-result provenance stays distinct from the turn id. Focused tests cover the JSONL alias set, the chunks-table round-trip, the search/show/export render paths, and a JSONL ingest → export integration assertion that the value survives end-to-end.
- Progress: parent-turn lineage now flows from JSONL aliases (`parent_turn_id`, `parentTurnId`, `parent_message_id`, `parentMessageId`, `parent_id`, `parentId`, `reply_to`, `replyTo`) into a new `chunks.parent_turn_id` column (schema v13) and surfaces in ingest, search, export, and `show` metadata. Focused tests cover alias extraction, chunk-table persistence, and render paths.
- Progress: JSONL extractor now recognizes the nested `message` envelope used by Claude Code and other Anthropic-style wrappers (`{"type":"assistant","message":{"role":"assistant","content":[...]} ,"sessionId":"...","timestamp":"..."}`). When `message` is an object it becomes a secondary scope: content/role/per-message metadata fall through to it after the outer object, so flat shapes and the legacy string-typed `message` fallback are unchanged. Pre-fix these lines were dropped because the outer object had no extractable content. Focused unit tests cover string content, Anthropic block arrays, inner-only metadata pickup, the outer-wins precedence rule for shared keys, the string-`message` regression, the outer-content + inner-role mixed case, and an empty-envelope drop; an end-to-end integration test in `tests/jsonl_ingest.rs` ingests a Claude Code-style envelope transcript and asserts both turns are searchable with their session/timestamp metadata intact.
- Progress: JSONL extractor now ingests structured tool-call payloads instead of silently dropping them. `arguments` and `input` accept JSON objects and arrays in addition to strings/numbers/bools, and are emitted as compact JSON (e.g. `{"role":"tool","tool_name":"search","arguments":{"q":"hello"}}` becomes `[tool:search] {"q":"hello"}` in the chunk text). serde_json's default `BTreeMap` keeps the serialization deterministic and searchable across runs. Empty containers (`{}` / `[]`) still return `None` so they fall through to later fallbacks instead of polluting chunks. Inner-`message` envelopes get the same `arguments` / `input` lookup after the outer scope, preserving the outer-wins precedence rule used elsewhere. Pre-fix these lines were dropped because `scalar_text` only accepted string/number/bool values. Focused tests cover outer object/array payloads, deterministic key ordering, string-typed regression for both keys, the empty-container drop, fallthrough from empty `arguments` to a usable `input`, content-wins-over-arguments precedence, inner-`message` envelope pickup, and the outer-wins-over-inner precedence rule.
- Progress: JSONL extractor now also recognizes `tool_use_id` / `toolUseId` as aliases for `tool_call_id`, so Anthropic-style `tool_result` content blocks (and Claude Code transcripts that surface the field at the top level or under a `message` envelope) survive ingest with their tool-result provenance intact instead of dropping the identifier. `tool_call_id` / `toolCallId` / `call_id` / `callId` continue to win when present, preserving existing transcripts. Focused unit tests in `src/jsonl.rs` cover snake/camelCase extraction, the canonical-key-wins precedence rule, and inner-`message`-envelope pickup; a new `tests/jsonl_ingest.rs` integration test ingests a transcript with both alias forms and asserts the values land in `chunks.tool_call_id` end-to-end.
- Progress: `lantern ingest --follow --format json` is now automation-friendly instead of bailing. The follow loop emits a newline-delimited JSON stream of three event kinds — `follow_start` (with `path`, `interval_secs`, optional `idle_timeout_secs`), `follow_pass` (with the 0-based `pass` index and the full `IngestReport`), and `follow_exit` (with `reason` = `max_iterations` or `idle_timeout`, plus `idle_timeout_secs` only on the idle branch). Each event is a single compact JSON object on its own line, parseable one `json.loads`-per-line. Text mode behavior is preserved unchanged. Three new public helpers in `src/ingest.rs` (`write_follow_start_json` / `write_follow_pass_json` / `write_follow_exit_json` plus stdout-bound `print_*` wrappers) make the writers testable against a `Vec<u8>` buffer. Focused unit tests in `src/ingest.rs` cover each event's shape, optional-field omission, the one-line-per-event invariant, and an end-to-end three-event stitch; a new integration test in `tests/follow_ingest.rs` drives `follow_path_with` with the JSON callback through a real polling loop and verifies the full 4-line stream (start + 2 passes + exit) parses cleanly.
- Progress: JSONL extraction now recurses through Anthropic-style content-block objects whose `content` field is itself a nested string/array payload, so `tool_result` blocks embedded inside `content` arrays are searchable instead of being silently dropped. The recursive fallback stays narrow and deterministic: direct block `text` still wins, empty nested containers are still omitted, and focused `src/jsonl.rs` tests pin nested text-block arrays, direct-string nested content, the empty-content drop path, and the outer-text precedence guard.
- Progress: JSONL extraction now also captures Anthropic-style `tool_use` content blocks (`{"type":"tool_use","id":"toolu_...","name":"search","input":{...}}`) inside `content` arrays, the symmetric companion to the existing `tool_result` recursive handling. Pre-fix these blocks had no `text` and no nested `content`, so the array walk silently dropped them and the entire tool call disappeared from the chunk. After the fix the call renders as `[tool_use:NAME] {compact_input_json}` and mixes with surrounding text blocks in order: a turn with `[text, tool_use, text]` becomes `let me search\n[tool_use:search] {"q":"hello"}\ndone`. Structured `input` objects/arrays serialize via serde_json's BTreeMap so the output is deterministic and keyword-searchable; string/number/bool input values pass through directly; empty `{}`/`[]` and missing `input` collapse to just the prefix; missing or blank `name` drops the block (no `[tool_use:]` pollution), and a `text` field on the block still wins so existing text-block transcripts are unaffected. The fallback only fires for `type=tool_use`, so unknown typed blocks without text/content keep dropping. Focused `src/jsonl.rs` tests pin all of these cases (structured object/array/string inputs, sorted-key determinism, missing/blank/empty inputs, name fallback, text-wins precedence, mixed block ordering, and the type-narrowing guard), and a new `tests/jsonl_ingest.rs` integration test ingests a real assistant turn with `[text, tool_use]` content and asserts both the tool name and the structured input survive end-to-end keyword search.
- Progress: JSONL extraction now also recognizes the Anthropic `is_error: true` flag on nested `tool_result` content blocks and prefixes the rendered text with `[tool_error]` so failed tool calls are keyword-searchable. Pre-fix the recursive content walk surfaced the inner stderr/error payload without preserving the failure signal, so "find every failed tool call" required scanning the original transcript file. The new branch in `content_to_text`'s array walk only fires when the block carries `type=tool_result` AND `is_error=true` — non-error tool_result blocks (missing flag or `false`) fall through to the existing recursive `content` path with their previous unprefixed rendering, so already-ingested transcripts are unchanged. Inner `content` may be a direct string or another content-block array; both shapes flow through `content_to_text` so multi-line stderr lands as `[tool_error] line one\nline two`. A direct `text` field on the block still wins (mirrors the existing block-text precedence guard), an empty `content` still drops the block, and the flag is type-narrowed so `is_error: true` on a non-`tool_result` block doesn't accidentally trigger the prefix. Focused `src/jsonl.rs` tests pin all of these cases (basic prefix, nested-text-block array payload, missing/false flag regression, empty-content drop, text-field-wins precedence, mixed-block interleave order, and the type-narrowing guard); a new `tests/jsonl_ingest.rs` integration test ingests a transcript with one error and one non-error tool_result block and asserts the `tool_error` keyword search hits the failed call only.
- Progress: JSONL extraction now also recognizes OpenAI-style `tool_calls` arrays (snake + `toolCalls` camelCase) when a line carries no regular `content` / `text`. Previously assistant turns shaped like `{"role":"assistant","tool_calls":[{"id":"call_abc","type":"function","function":{"name":"search","arguments":"{...}"}}]}` were silently dropped because `lookup_content` had no first-class tool-call fallback. After the fix each entry renders as `[tool_use:NAME] {compact_arguments}` (multiple parallel calls join with `\n`, mirroring the existing tool_use content-block path), and `tool_name` / `tool_call_id` recover from the first usable array entry — `tool_name` from `function.name` (or a flat top-level `name`) and `tool_call_id` from the standard OpenAI `id` plus the existing `tool_call_id` / `toolCallId` / `tool_use_id` / `toolUseId` / `call_id` / `callId` alias set. Both common entry shapes are supported (`{type:"function", function:{name, arguments}}` and the flat `{id, name, arguments}` variant); `arguments` may be the JSON-encoded *string* OpenAI emits (passes through unparsed so its tokens stay searchable) or a structured object/array (re-serialized via serde_json's BTreeMap so output is deterministic). Precedence preserved end-to-end: outer `content` / `text` and the inner-`message` envelope still win for chunk text; direct top-level / inner-message `tool_name` / `tool_call_id` and the existing nested content-block recovery still win over the new array fallback; empty arrays / missing-name entries / empty `arguments` containers fall through cleanly without polluting chunks. Inner-`message` envelopes get the same `tool_calls` lookup after the outer scope. Focused `src/jsonl.rs` tests cover basic recovery, the camelCase alias, structured-object arguments with sorted-key determinism, the flat entry shape, multi-entry join order, the nameless-entry skip, an empty-array drop + fallthrough to a usable later fallback, content-wins-over-tool_calls precedence, outer-tool_name / outer-tool_call_id precedence over the array, inner-message envelope pickup (both content recovery and the inner-direct-wins-over-outer-array rule), the content-block-wins-over-tool_calls rule, the `id`-as-tool_call_id alias, OpenAI's JSON-encoded-string `arguments` passthrough, and the empty-input-keeps-prefix case. Two new integration tests in `tests/jsonl_ingest.rs` ingest an OpenAI-style transcript end-to-end and assert (a) chunk text + `chunks.tool_name` / `chunks.tool_call_id` recover from the array entry when the outer message provides neither, and (b) explicit outer `tool_name` / `tool_call_id` still win the chunks-table columns even when an inner `tool_calls` entry would have offered different values.
- Progress: JSONL extraction now also recognizes Anthropic-style `thinking` content blocks (`{"type":"thinking","thinking":"...","signature":"..."}`) inside `content` arrays, the reasoning-trace companion to the existing `tool_use` / `tool_result` block handling. Pre-fix these blocks had no `text` and no nested `content`, so the array walk silently dropped them and an assistant turn whose only content was extended thinking disappeared from search. After the fix the trace renders as `[thinking] {text}` and mixes with surrounding text blocks in order: a turn with `[thinking, text]` becomes `[thinking] reasoning trace\nfinal answer`, so reasoning is keyword-searchable behind a stable prefix. Missing / blank `thinking` fields drop the block (no `[thinking]` placeholder pollution), a direct `text` field on the block still wins (mirrors the existing block-text precedence guard), and the companion `redacted_thinking` block — which carries encrypted bytes rather than searchable text — stays dropped. The fallback only fires for `type=thinking`, so unrelated blocks that happen to carry a `thinking` field aren't accidentally prefixed. Focused `src/jsonl.rs` tests pin all of these cases (basic prefix, blank/missing field drops, mixed-block interleave order, text-wins precedence, the type-narrowing guard, and the redacted_thinking opt-out); a new `tests/jsonl_ingest.rs` integration test ingests a real assistant turn with `[thinking, text]` content and asserts both the reasoning text and the surrounding text survive end-to-end keyword search in a single chunk joined in order.
- Progress: JSONL extraction now also recognizes Anthropic-style `server_tool_use` content blocks (`{"type":"server_tool_use","id":"srvtoolu_...","name":"web_search","input":{...}}`), the server-managed companion to the existing client-side `tool_use` block handling. Pre-fix these blocks had no `text` and no nested `content`, so the array walk silently dropped them and Anthropic-issued server-side calls (web_search, computer_use, code_execution, etc.) disappeared from the chunk entirely. After the fix each call renders behind a distinct `[server_tool_use:NAME] {compact_input_json}` prefix — distinct on purpose from `[tool_use:NAME]` so keyword search can tell client-side from server-managed calls apart — and mixes with surrounding text blocks in order: a turn with `[text, server_tool_use, text]` becomes `prefix\n[server_tool_use:web_search] {compact_input}\ntrailing`. Structured `input` objects/arrays serialize via serde_json's BTreeMap so the output is deterministic and keyword-searchable; string/number/bool input values pass through directly; empty `{}`/`[]` and missing `input` collapse to just the prefix; missing or blank `name` drops the block; a `text` field on the block still wins. The fallback is type-narrowed to `server_tool_use`, so other unknown typed blocks without text/content keep dropping. `tool_call_id_from_block` and `tool_name_from_block` were extended in lockstep: the server-side `id` field (Anthropic's `srvtoolu_...` shape) recovers as `tool_call_id` for `server_tool_use` blocks only — pre-existing `tool_use` / `tool_result` alias behavior is unchanged so the 6-alias contract stays pinned — and `tool_name` now also recovers from the first `server_tool_use` block's `name`. Outer-scope and inner-message-envelope precedence rules still hold. Focused `src/jsonl.rs` tests pin all of these cases (structured-object input + deterministic sorted keys, missing/blank/empty inputs keeping prefix, name fallback drop, text-wins precedence, mixed-block interleave order, nested id and name recovery, and outer-wins precedence for both fields); a new `tests/jsonl_ingest.rs` integration test ingests a real assistant turn with `[text, server_tool_use]` content and asserts the server-side tool name, structured input, and `srvtoolu_...` identifier all survive end-to-end into keyword search and `chunks.tool_name` / `chunks.tool_call_id`.
- Progress: JSONL extraction now also recovers `tool_call_id` and `tool_name` from nested Anthropic `tool_use` / `tool_result` content blocks when the outer message (and the inner `message` envelope) don't already provide them. `tool_call_id` falls back to the first block carrying `tool_use_id` / `toolUseId` / `tool_call_id` / `toolCallId` / `call_id` / `callId`; `tool_name` falls back to the first `tool_use` block's `name`. Outer scope and inner-message-envelope scope still win, preserving the documented "direct top-level / message-envelope fields must continue to win" contract — only blocks with `type=tool_use` (or `tool_result` for the id) are inspected, so plain text blocks carrying lookalike fields can't poison the fallback. Focused `src/jsonl.rs` tests pin the tool_use / tool_result extraction paths, each of the six id aliases, outer-vs-nested and inner-message-vs-nested precedence, the type-narrowing guards on both fields, and a deep envelope fallback case; a new `tests/jsonl_ingest.rs` integration test ingests a transcript whose tool-call provenance lives only inside content blocks (flat outer assistant turn with a `tool_use` block, flat outer user turn with a `tool_result` block, and a message-envelope-wrapped `tool_result` block) and asserts `chunks.tool_call_id` / `chunks.tool_name` survive ingest end-to-end.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style `function_call_output` content blocks (`{"type":"function_call_output","call_id":"call_123","output":...}`) inside `content` arrays, the tool-result companion to the existing `function_call` block handling. Pre-fix the array walk silently dropped them — these blocks carry no `text`, no nested `content`, no `name`, and no `input`, so a Responses-style tool-result turn whose only payload was the `output` disappeared from the chunk while the surrounding prose stayed searchable. After the fix the reply renders behind a distinct `[tool_result] {output}` prefix — distinct on purpose from the call-initiation `[tool_use:NAME]` shape, so keyword search can scope to results vs initiations — and mixes with surrounding text/`function_call` blocks in array order: a turn with `[text, function_call, function_call_output, text]` becomes `prefix\n[tool_use:NAME] {args}\n[tool_result] {output}\ntrailing`. `output` reuses `arguments_value_to_text`, so OpenAI's JSON-encoded-string form passes through unparsed (keyword tokens stay searchable), structured object/array forms re-serialize via serde_json's BTreeMap-sorted path for deterministic output, scalar numbers/bools pass through directly, and empty `{}`/`[]` / missing / blank-string `output` *drop the block entirely* — a bare `[tool_result]` carries no useful signal (no name, no payload), so dropping avoids polluting the chunk. Missing/wrong-type guards: only `type=function_call_output` feeds the extractor, so plain text blocks carrying a stray `output` field don't accidentally produce a `[tool_result]` prefix, and a direct `text` field on the block still wins. `tool_call_id_from_block` was widened in lockstep: `function_call_output` uses the standard 6-alias id list (`tool_call_id` / `toolCallId` / `tool_use_id` / `toolUseId` / `call_id` / `callId`) with no `id` widening — Responses surfaces the call id as `call_id`, already in the set, so a defensive test guards against accidentally matching a top-level `id`. `tool_name_from_block` is unchanged because these blocks have no `name` by spec — a stray `name` field on a `function_call_output` block must not poison tool_name recovery. Outer-scope and inner-`message`-envelope precedence rules still hold end-to-end. Focused `src/jsonl.rs` tests pin all of these cases (string / number / bool / structured-object / structured-array outputs with sorted-key determinism, missing/empty-object/empty-array/blank-string drops, text-field-wins precedence, mixed-block interleave order with both `function_call` and `function_call_output` present, the standalone tool-result turn shape, the 6-alias coverage, the `id`-not-recognized guard, outer-tool_call_id precedence, inner-`message` envelope recovery, the tool_name-no-contribution rule, and the type-narrowing guard against a non-`function_call_output` block carrying `output`); a new `tests/jsonl_ingest.rs` integration test ingests a real tool turn with a `function_call_output` block carrying a structured `output` payload and asserts the `[tool_result]` prefix, the compact-JSON inner text, and the `call_responses_456` identifier all survive end-to-end into keyword search and `chunks.tool_call_id` while `chunks.tool_name` correctly stays empty.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style `reasoning_text` content blocks (`{"type":"reasoning_text","text":"reasoning trace..."}`) inside `content` arrays, the Responses-API companion to Anthropic's `thinking` block handling. Pre-fix the generic text-field precedence surfaced the trace as raw prose, so reasoning was searchable but indistinguishable from ordinary assistant text. After the fix the trace renders behind the shared provider-agnostic `[thinking] {text}` prefix and mixes with surrounding text blocks in array order: a turn with `[reasoning_text, text]` becomes `[thinking] reasoning trace\nfinal answer`, so reasoning is keyword-searchable behind a stable prefix regardless of provider. The new handler runs ahead of the generic text-field precedence in `content_to_text`'s array walk and is type-narrowed to `reasoning_text`, so ordinary `type=text` blocks (and every other type) fall through to the existing text-field path unchanged. Missing/blank `text` drops the block (no `[thinking]` placeholder pollution), mirroring the thinking-block drop rule. Focused `src/jsonl.rs` tests pin all of these cases (basic prefix, blank-field drop, missing-field drop, mixed-block interleave order, and the type-narrowing guard against ordinary `type=text` blocks); a new `tests/jsonl_ingest.rs` integration test ingests a real assistant turn with `[reasoning_text, text]` content and asserts both the reasoning text and the surrounding text survive end-to-end keyword search in a single chunk joined in order.
- Progress: JSONL extraction now also recognizes Anthropic-style `mcp_tool_use` content blocks (`{"type":"mcp_tool_use","id":"mcptoolu_...","server_name":"linear","name":"create_issue","input":{...}}`) inside `content` arrays — the MCP-connector companion to the existing client-side `tool_use` and server-managed `server_tool_use` block handling, particularly relevant because Lantern itself ships an MCP server (`lantern mcp`). Pre-fix the array walk silently dropped these blocks because the type isn't `tool_use` / `server_tool_use` and they don't carry `text` or nested `content`, so the call name + structured input + `mcptoolu_...` identifier all disappeared. After the fix the call renders behind a distinct `[mcp_tool_use:SERVER/NAME] {compact_input_json}` prefix — distinct on purpose from `[tool_use:NAME]` and `[server_tool_use:NAME]` so keyword search can scope to MCP-routed calls — and the `server_name` is folded into the prefix so an MCP call to `linear/create_issue` is searchable as a single token without colliding with a client-side `create_issue`. Missing/blank `server_name` falls back to a name-only `[mcp_tool_use:NAME]` prefix that's still distinct from the other two shapes. Structured `input` objects/arrays re-serialize via serde_json's BTreeMap-sorted path for deterministic output; string/number/bool input values pass through; empty `{}`/`[]` and missing `input` collapse to just the prefix; missing or blank `name` drops the block; a direct `text` field on the block still wins; the fallback is type-narrowed to `mcp_tool_use` so unrelated blocks aren't accidentally prefixed. `tool_call_id_from_block` and `tool_name_from_block` were widened in lockstep: `mcp_tool_use` mirrors `server_tool_use`'s `id` widening (since the MCP shape surfaces the call id as `id` like `mcptoolu_...`), and `tool_name` recovers from the first `mcp_tool_use` block's `name` when neither outer scope nor the inner-`message` envelope already provides one. Outer-scope and inner-message-envelope precedence rules still hold. Focused `src/jsonl.rs` tests pin server/name prefix rendering, deterministic key ordering, missing/blank-name drops, the name-only fallback when `server_name` is absent, mixed-block join order, nested id/name fallback, the outer-wins precedence rule for both metadata fields, and the type-narrowing guard, while a new `tests/jsonl_ingest.rs` integration test ingests a real assistant turn with `[text, mcp_tool_use]` content and asserts the `mcptoolu_...` id plus structured input survive end-to-end keyword search and chunk metadata.

- Progress: JSONL extraction now also recognizes Anthropic-style `mcp_tool_result` content blocks (`{"type":"mcp_tool_result","tool_use_id":"mcptoolu_...","is_error":false,"content":...}`) inside `content` arrays, the reply-side companion to `mcp_tool_use`. Pre-fix these blocks fell through to generic `content` recursion, so the inner text was searchable but the MCP provenance signal disappeared, and `is_error: true` failures were indistinguishable from ordinary tool results. After the fix the block renders as `[mcp_tool_result] ...` for non-error replies and `[mcp_tool_error] ...` when `is_error=true`, with empty/missing nested content still dropped to avoid bare-prefix noise. The existing `tool_use_id` alias path now also recovers `chunks.tool_call_id` from nested `mcp_tool_result` blocks without widening to a stray top-level `id`, mirroring the regular Anthropic `tool_result` contract. Focused `src/jsonl.rs` tests cover string vs nested-array payloads, success/error prefix selection, empty-content drops, text-field precedence, join ordering, the MCP-only type guard, tool-call-id recovery, and the no-top-level-`id` widening guard; a new `tests/jsonl_ingest.rs` integration test ingests a small `mcp_tool_use` + success reply + error reply transcript and asserts both prefixes plus the recovered `tool_call_id` values survive end-to-end.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style `refusal` content blocks (`{"type":"refusal","refusal":"I can't help with that"}`) inside `content` arrays, the refusal-mode companion to the existing `output_text` (covered by the generic text-field precedence) and `reasoning_text` block handling. Pre-fix the array walk silently dropped these blocks because they carry no `text` field and no nested `content`, so a refusal-only assistant turn disappeared from the chunk entirely and refusals surrounded by prose lost the refusal-specific provenance signal. After the fix the block renders as `[refusal] {text}` so refusals survive ingest behind a stable prefix distinct from `[thinking]` / `[tool_use:NAME]` / `[tool_error]`, making it easy to scope keyword search to refusal turns specifically. The new handler runs ahead of the generic text-field precedence in `content_to_text`'s array walk (alongside `reasoning_text_block_to_text`) so a refusal block carrying a stray `text` field still gets the `[refusal]` prefix — the `refusal` field is the canonical content for this block type, so a forward-compat / stray `text` field must not silently relabel the refusal as plain text. Missing/blank `refusal` drops the block (no `[refusal]` placeholder pollution), and the fallback is type-narrowed to `refusal` so a `type=text` block carrying a stray `refusal` field falls through to the existing text-field precedence unchanged. Focused `src/jsonl.rs` tests pin all of these cases (basic prefix, blank-field drop, missing-field drop, mixed-block interleave order with a preceding `text` block, refusal-wins-over-stray-text precedence, and the type-narrowing guard against ordinary `type=text` blocks); a new `tests/jsonl_ingest.rs` integration test ingests a real assistant turn with `[text, refusal]` content and asserts both the surrounding text and the refusal payload survive end-to-end keyword search in a single chunk joined in array order.

- Progress: JSONL extraction now also recognizes Anthropic-style `image` content blocks (`{"type":"image","source":{"type":"url","url":"..."}}` and `{"type":"image","source":{"type":"base64","media_type":"image/png","data":"..."}}`) inside `content` arrays. Pre-fix the array walk silently dropped these because image blocks have no `text` and no nested `content` — and the base64 `data` payload is binary noise that must never be serialized into a chunk (would balloon the store with opaque blobs and isn't keyword-searchable). After the fix the block renders behind a stable `[image]` prefix that preserves the "this turn had an attached image" provenance signal, and folds in the most useful searchable anchor from the `source`: a string `url` becomes `[image] {url}` (filename / hash stays keyword-searchable), and otherwise a string `media_type` becomes `[image:{media_type}]` (scopes search to PNG / JPEG / etc.). URL wins over media_type when both are present — URL carries more recall value. The base64 `data` field is never serialized into the chunk under any circumstance; the integration test pins this by searching for the data payload and asserting zero hits. Missing source / missing both anchors drops the block, matching the existing "bare prefix is pollution" rule used by `refusal` / `thinking` when the canonical payload field is missing. A direct `text` field on the block still wins via the higher-priority text-field path in `content_to_text`, so transcripts that pair an image with an inline caption stay unchanged. Type-narrowed to `image` so `type=text` blocks carrying a stray `source` field aren't accidentally relabelled. Focused `src/jsonl.rs` tests pin all of these cases (URL-source rendering, base64-source rendering with media_type, URL-wins-over-media_type precedence, both-missing drop, missing-source drop, text-field-wins precedence, mixed-block interleave order, the type-narrowing guard, and blank-url falls through to media_type); a new `tests/jsonl_ingest.rs` integration test ingests a transcript with both URL and base64 image blocks and asserts the URL filename is keyword-searchable, the media_type prefix is searchable, the base64 payload never appears in any chunk, and the surrounding text block lands in the same chunk in array order.

- Progress: JSONL extraction now also recognizes Anthropic-style `document` content blocks (`{"type":"document","source":{"type":"url","url":"..."}}`, `{"type":"document","source":{"type":"base64","media_type":"application/pdf","data":"..."}}`, `{"type":"document","source":{"type":"text","media_type":"text/plain","data":"Doc text..."}}`, and `{"type":"document","source":{"type":"content","content":[{"type":"text",...}]}}`) inside `content` arrays — the message-level attachment companion to the existing `image` block handling. Pre-fix the array walk silently dropped these because document blocks have no top-level `text` and no top-level nested `content` (content-source documents nest content under `source.content`, not `content`), so a turn whose only payload was an attached document (PDF, inline plain text, or nested content blocks) disappeared from the chunk entirely. After the fix the block renders behind a stable `[document]` prefix that preserves the "this turn had an attached document" provenance signal and folds in the most useful searchable anchor from the `source`: a `text`-source document's `data` field carries plain text — not binary noise — so it inlines as `[document] {data}` and joins keyword search on the document's literal body; a `content`-source document recurses into `source.content` via `content_to_text` so nested text blocks survive ingest; a `url` source surfaces the URL as `[document] {url}` (filename / hash stays keyword-searchable) and wins over `media_type` because URLs carry more recall value; a base64 source with only `media_type` falls back to `[document:{media_type}]` so the attachment is searchable by PDF / etc. Critically, the base64 `data` field is never serialized into the chunk under any circumstance (binary noise — would balloon the store with opaque blobs and isn't keyword-searchable) — only the deliberate `source.type=="text"` path surfaces `data`, and only because that path holds plain text rather than a binary blob; the integration test pins this contract by searching for a base64 prefix and asserting zero hits. Missing source / missing all anchors drops the block, matching the "bare prefix is pollution" rule used by `image` / `refusal` / `thinking` when the canonical payload field is missing; a direct `text` field on the block still wins via the higher-priority text-field path in `content_to_text` so transcripts that pair a document with an inline caption stay unchanged; type-narrowed to `document` so `type=text` blocks carrying a stray `source` field aren't accidentally relabelled. Focused `src/jsonl.rs` tests pin all of these cases (URL-source rendering, base64-source rendering with media_type, text-source inlining its data, content-source recursing through a nested content-block array, URL-wins-over-media_type precedence, both-missing drop, missing-source drop, text-field-wins precedence, mixed-block interleave order with surrounding text, the type-narrowing guard against `type=text` blocks carrying a stray `source`, blank-url falls through to media_type, blank text-source data falls through to dropping when no fallback anchor exists, and blank text-source data falls through to a present media_type anchor). A new `tests/jsonl_ingest.rs` integration test ingests a real transcript with all three primary shapes (URL-source PDF, base64 PDF with `JVBERi…` magic bytes, text-source quarterly report) and asserts the URL filename is keyword-searchable, the base64 surfaces only by its media_type label, the binary `JVBERi0xLjQK` payload never appears in any chunk, the text-source document's body is keyword-searchable on a token from the inlined data, and the surrounding text block in the URL turn lands in the same chunk in array order.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style `function_call` content blocks (`{"type":"function_call","call_id":"call_123","name":"docs_search","arguments":"{...}"}`) inside `content` arrays, the Responses-API companion to the existing Anthropic `tool_use` / `server_tool_use` block handling. Pre-fix these blocks had no `text` and no nested `content` (Responses keeps the call payload at the block level rather than nesting it under `input`), so the array walk silently dropped them — the call name, JSON-encoded `arguments`, and `call_id` identifier all disappeared. After the fix the call renders behind the shared `[tool_use:NAME] {arguments}` prefix (parallel to Anthropic `tool_use`, as called out in the slice spec) and mixes with surrounding text blocks in array order: a turn with `[text, function_call, text]` becomes `prefix\n[tool_use:NAME] {args}\ntrailing`. `arguments` reuses `arguments_value_to_text` so OpenAI's JSON-encoded-string form passes through unparsed (keyword tokens stay searchable), structured object/array forms re-serialize via serde_json's BTreeMap-sorted path for deterministic output, and empty `{}`/`[]` structured arguments collapse to the bare prefix (string `"{}"` passes through unchanged because it's a non-blank scalar — same as the OpenAI `tool_calls`-array behaviour). Missing or blank `name` drops the block (no `[tool_use:]` pollution), a direct `text` field on the block still wins, and the fallback is type-narrowed to `function_call` so unrelated blocks aren't accidentally prefixed. `tool_call_id_from_block` and `tool_name_from_block` were widened in lockstep: `function_call` uses the standard 6-alias id list (no extra `id` widening — Responses surfaces the call id as `call_id`, already in the alias set, so accidentally matching a top-level `id` is excluded by a defensive test), and `tool_name` recovers from the first `function_call` block's `name` when neither outer scope nor the inner-`message` envelope already provides one. Outer-scope and inner-message-envelope precedence rules still hold end-to-end. Focused `src/jsonl.rs` tests pin all of these cases (JSON-string passthrough with tool_name + tool_call_id recovery, structured-object input + sorted-key determinism, missing/empty-string/empty-object arguments behaviour, missing/blank name drops, text-field-wins precedence, mixed-block interleave order, outer-tool_name / outer-tool_call_id precedence over the block, inner-message envelope recovery, the standard 6-alias coverage on `function_call`, the `id`-not-recognized guard, the first-usable-block recovery rule across multiple function_call entries, and a `type=text` block carrying `call_id` not poisoning the fallback); a new `tests/jsonl_ingest.rs` integration test ingests a real assistant turn with `[text, function_call]` content and asserts the call name, JSON-encoded arguments, and `call_responses_123` identifier all survive end-to-end into keyword search and `chunks.tool_name` / `chunks.tool_call_id`.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `web_search_call` output items as their own JSONL lines (`{"type":"web_search_call","id":"ws_...","status":"completed","action":{"type":"search","query":"..."}}`). Pre-fix these lines were silently dropped because the object carries no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every existing fallback returned `None`. After the fix the line renders behind a stable `[web_search_call:ACTION] {filtered_json}` prefix where ACTION comes from `action.type` and the body is a compact filtered subset of the action object, with a strict whitelist of small string anchors (`query`, `url`) so huge optional arrays the provider may attach (e.g. `sources` / `results` / page snippets) never leak into the chunk text. Missing/blank `action.type` drops the line; missing `action` drops the line; no whitelisted anchor falls back to a bare `[web_search_call:ACTION]` so the action type still survives ingest. The top-level `id` is promoted to `chunks.tool_call_id` only for `type=web_search_call` lines (narrow widening — the standard message-id contract used by every other line shape is unchanged, guarded by a focused test). Both the content fallback and the id-as-tool_call_id promotion also recurse through the `message` envelope so Claude Code-style wrappers carrying a `web_search_call` under `message` are covered. The new fallback is the last branch of `lookup_content` so existing higher-priority fallbacks (content / text / message / body / arguments / input / tool_calls arrays) still win for lines that happen to carry both. Focused `src/jsonl.rs` tests pin the search-action render, the open_page url anchor, sorted-key determinism in the filtered body, the strict-whitelist drop of `sources` / `results`, the bare-prefix fallback when no anchor is present, three missing-action drop cases, content-still-wins precedence, type-narrowed id widening, the no-match guard against... [truncated]

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `file_search_call` output items as their own JSONL lines (`{"type":"file_search_call","id":"fs_...","status":"completed","queries":["..."],"results":[...]}`). Pre-fix these lines were silently dropped because the object carries no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every existing fallback returned `None`. After the fix the line renders behind a stable `[file_search_call] {"queries":[...]}` prefix that keeps the issued file-search queries keyword-searchable while strictly omitting the potentially huge `results` array payload. Missing / empty / all-blank `queries` still drops the line (no bare prefix pollution), the new fallback is type-narrowed to `file_search_call`, and the top-level `id` is promoted to `chunks.tool_call_id` only for this line shape. Focused unit tests cover ordered multi-query rendering, results-array omission, drop cases, content-wins precedence, type-narrowed id widening, message-envelope recursion, and outer `tool_call_id` precedence; a new end-to-end ingest/search test pins that the `[file_search_call]` prefix plus `fs_...` id survive while large inline result blobs do not.
- Progress: top-level OpenAI Responses `file_search_call` lines now also surface per-result file anchors from `results[]` as secondary `[file_search_result] ...` lines when `file_id` and/or `filename` are present, without leaking bulky snippet text. This keeps the queried file identities keyword-searchable while still omitting result `text`, `score`, and `attributes`; focused `src/jsonl.rs` tests cover both-anchor and single-anchor rendering, anchorless-entry drops, deterministic array-order joins, and the empty-results back-compat path.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `computer_call` output items as their own JSONL lines (`{"type":"computer_call","id":"cu_...","call_id":"call_...","status":"completed","action":{"type":"type","text":"hello world"}}` and the other action shapes — `click`/`double_click`/`drag`/`keypress`/`move`/`screenshot`/`scroll`/`type`/`wait`), the computer-use companion to the recent `local_shell_call` / `web_search_call` / `file_search_call` / `code_interpreter_call` line extractors. Pre-fix these lines were silently dropped because the object carries no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every existing fallback in `lookup_content` returned `None` and the issued computer action disappeared from search. After the fix the line renders behind a stable `[computer_call:ACTION] {filtered_json}` prefix (parallel to `[local_shell_call:ACTION]` / `[web_search_call:ACTION]`), where ACTION comes from `action.type` and the body is a compact, deterministic BTreeMap-sorted filtered subset of the action object. The whitelist is strict — `text` (scalar, the typed text for a `type` action) and `keys` (array of non-blank string entries, for `keypress`) — so coordinates (`x`, `y`, `scroll_x`, `scroll_y`), `button`, `pending_safety_checks`, and `path` arrays (for `drag`) never leak into the chunk text. These are the only action fields with meaningful keyword-recall value; everything else stays strictly out of the chunk. No-anchor actions (e.g. `screenshot` / `click` / `scroll`) fall through to a bare `[computer_call:ACTION]` so the action type still survives ingest. Missing `action` or missing/blank `action.type` drops the line (mirrors the missing-action drop rule on `web_search_call` / `local_shell_call`), and the fallback is type-narrowed to `computer_call` so unrelated line shapes carrying a stray `action` object are unaffected. Like `local_shell_call`, `computer_call` carries both a top-level `id` (response item id) and a distinct `call_id` (tool call id), so the standard `call_id` alias chain already picks up the correct identifier and the top-level `id` is deliberately NOT promoted — pinning the item-id-vs-call-id distinction Responses preserves for this shape, guarded by a focused regression test. The new fallback sits at the end of `lookup_content` so existing higher-priority paths still win, and it also recurses through the `message` envelope so Claude Code-style wrappers are covered. Focused `src/jsonl.rs` tests pin the `text`-anchor render, the `keys`-anchor render, sorted-key determinism when both anchors are present, the coordinate/button/safety-check non-leakage, missing-action / missing-action.type drops, the screenshot bare-prefix fallback, the empty/all-blank `keys` bare-prefix fallback, non-string `keys`-entry filtering, content-wins precedence, the no-prefix guard for unrelated line shapes carrying a stray `action` object, message-envelope recursion with `sessionId` plumbing, outer `tool_call_id` precedence, and the explicit no-id-widening regression.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `local_shell_call` output items as their own JSONL lines (`{"type":"local_shell_call","id":"lsh_...","call_id":"call_...","status":"completed","action":{"type":"exec","command":["bash","-c","ls /tmp"],"working_directory":"/home/agent","env":{...},"timeout_ms":30000,"user":"root"}}`), the Responses-API companion to the recent `web_search_call` / `file_search_call` / `code_interpreter_call` line extractors and the shell-side analogue to `code_interpreter_call`. Pre-fix these lines were silently dropped because the object carries no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every existing fallback in `lookup_content` returned `None` and the issued shell command disappeared from search. After the fix the line renders behind a stable `[local_shell_call:ACTION] {filtered_json}` prefix (parallel to `[web_search_call:ACTION]`), where ACTION comes from `action.type` (`"exec"` in current Responses spec) and the body is a compact, deterministic BTreeMap-sorted filtered subset of the action object. The whitelist is strict — `command` (array of non-blank string entries; blank/non-string entries are filtered out) and `working_directory` (scalar) — so `env` key/value maps (often large and sometimes secret-bearing), `timeout_ms`, and `user` never leak into the chunk text. Empty/all-blank `command` falls through to a `working_directory`-only anchor or a bare `[local_shell_call:ACTION]` when neither is present, so the action type still survives ingest. Missing `action` or missing/blank `action.type` drops the line (mirrors the missing-action drop rule on `web_search_call`), and the fallback is type-narrowed to `local_shell_call` so unrelated line shapes carrying a stray `action.command` are unaffected. Critical divergence from the other three top-level call extractors: `local_shell_call` carries both a top-level `id` (response item id) and a distinct `call_id` (tool call id), so the standard `call_id` alias chain already picks up the correct identifier and the top-level `id` is deliberately NOT promoted — pinning the item-id-vs-call-id distinction Responses preserves for this shape, guarded by a focused regression test that ingests a `local_shell_call` line with `id` but no `call_id` and asserts `tool_call_id` stays `None`. The new fallback sits at the end of `lookup_content` so existing higher-priority paths still win, and it also recurses through the `message` envelope so Claude Code-style wrappers are covered. Focused `src/jsonl.rs` tests pin the basic command-anchor render with deterministic JSON, the working_directory anchor (plus combined deterministic key order), the env/timeout_ms/user non-leakage, missing-action / missing-action.type drops, the empty-command bare-prefix fallback, non-string command-entry filtering, content-wins precedence, the no-prefix guard for unrelated line shapes carrying a stray `action.command`, message-envelope recursion, outer `tool_call_id` precedence, and the explicit no-id-widening regression; a new `tests/jsonl_ingest.rs` integration test ingests a real Responses-style transcript with a user prompt, a `local_shell_call` line carrying a multi-token shell command plus `working_directory` and `env`/`timeout_ms`/`user` noise, and a trailing assistant message, then asserts the `[local_shell_call:exec]` prefix plus the issued command and working_directory are keyword-searchable, the `call_id` round-trips into `chunks.tool_call_id`, the env key/value plus the response item id stay unsearchable as tool_call_id, and the surrounding prose survives unchanged.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `code_interpreter_call` output items as their own JSONL lines (`{"type":"code_interpreter_call","id":"ci_...","status":"completed","container_id":"cntr_...","code":"..."}`), the Responses-API companion to the recent `web_search_call` / `file_search_call` line extractors and the call-side analogue to Anthropic's `code_execution_tool_result` block. Pre-fix these lines were silently dropped because the object carries no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every existing fallback returned `None` and the issued code disappeared from search. After the fix the line renders behind a stable `[code_interpreter_call] {code}` prefix that keeps the executed code keyword-searchable behind a distinct prefix from `[web_search_call:ACTION]` / `[file_search_call]` / `[tool_use:NAME]` (and the reply-side `[code_execution_result]` / `[code_execution_error]` prefixes used for Anthropic's code_execution tool result). `code` is the only anchor that ever leaks into the chunk text — the optional `container_id` is a sandbox identifier whose tokens (`cntr_...`) aren't useful for keyword search and the optional `status` collapses to a small finite vocabulary that adds little recall value, so both stay strictly out of the chunk. Missing / empty / whitespace `code` drops the line (no bare-prefix pollution), the new fallback is type-narrowed to `code_interpreter_call`, and the top-level `id` is promoted to `chunks.tool_call_id` only for this line shape (mirroring the `web_search_call` / `file_search_call` id-widening contract — top-level `id` continues to be the message id on every other line). The new fallback sits at the end of `lookup_content` so existing higher-priority paths still win for lines that happen to carry both, and the id promotion sits at the end of the `tool_call_id` chain after the existing alias / content-block / tool_calls-array / web_search_call / file_search_call fallbacks. Both also recurse through the `message` envelope so Claude Code-style wrappers carrying a `code_interpreter_call` under `message` are covered. Focused `src/jsonl.rs` tests pin the basic code-anchor render, the `container_id` / `status` non-leakage, missing / empty / whitespace drops, content-wins precedence, type-narrowed id widening, the no-prefix guard for unrelated line shapes carrying a stray `code` field, message-envelope recursion, and outer `tool_call_id` precedence; a new `tests/jsonl_ingest.rs` integration test ingests a transcript with a user prompt, a `code_interpreter_call` line carrying multi-line pandas code plus a `NOISE_BLOB_SANDBOX_ID` container_id, and a trailing assistant message, then asserts the `[code_interpreter_call]` prefix plus the issued code is keyword-searchable, the `ci_...` id round-trips into `chunks.tool_call_id`, the sandbox container_id stays unsearchable anywhere in the store, and the surrounding assistant prose survives unchanged.

- Progress: JSONL extraction now also recognizes Anthropic-style `code_execution_tool_result` content blocks — the reply-side companion to a `server_tool_use` code_execution call (e.g. `name="code_execution"`). Pre-fix these blocks disappeared entirely from chunks: the success-shape `content` is a single `code_execution_result` object (stdout/stderr/return_code, not an array of text blocks) and the error-shape `content` is also a single object (`code_execution_tool_result_error` with `error_code`), so the generic `content_to_text` array walk returned `None` on both. After the fix the success path renders behind a distinct `[code_execution_result] {filtered_json}` prefix carrying a compact BTreeMap-sorted body with `return_code` (always included when numeric — `0` is a valid signal and is *not* dropped as falsy), `stdout`, and `stderr` (only when non-blank); the error path renders behind a distinct `[code_execution_error] error_code={code}` prefix so failure mode is keyword-searchable and doesn't collide with `[tool_error]` / `[mcp_tool_error]` / `[web_search_tool_error]`. The new handler runs ahead of the generic `content` recursion in `content_to_text`'s array walk so the prefix wins, and is type-narrowed on *both* the outer (`type=code_execution_tool_result`) and inner (`type=code_execution_result` or `code_execution_tool_result_error`) types so plain text blocks carrying a stray `content` object can't poison the prefix. Drop rules mirror every other typed extractor: no useful success fields → drop, a non-numeric `return_code` with nothing else → drop, error object with no `error_code` → drop, missing `content` → drop, unknown inner `type` → drop. A direct `text` field on the block still wins via the higher-priority text-field path. `tool_call_id_from_block` was widened in lockstep so the originating `server_tool_use` call id (`srvtoolu_...`) carried in the block's `tool_use_id` flows into `chunks.tool_call_id` — using the standard 6-alias list (no `id` widening, ... [truncated]
- Progress: JSONL extraction now also recognizes Anthropic-style `bash_code_execution_tool_result` content blocks — the bash-tool companion to `code_execution_tool_result`. Success replies now render behind a distinct `[bash_code_execution_result] {filtered_json}` prefix carrying non-blank `stdout` / `stderr` plus numeric `return_code` (including `0`), error replies render as `[bash_code_execution_error] error_code=...`, and `tool_use_id` now flows into `chunks.tool_call_id`. Focused unit/integration tests pin success/error rendering, drop cases, join ordering, type narrowing, and the no-top-level-id-widening guard.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `local_shell_call_output` output items as their own JSONL lines (`{"type":"local_shell_call_output","id":"lsh_out_...","call_id":"call_...","status":"completed","output":"<stdout/stderr text>"}`), the reply-side companion to the existing top-level `local_shell_call` extractor. Pre-fix these lines were silently dropped because the object carries no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every existing fallback in `lookup_content` returned `None` and the shell command output disappeared from search. After the fix the line renders behind a stable `[local_shell_call_output] {output_text}` prefix — deliberately distinct from `[local_shell_call:ACTION]` so keyword search can scope to outputs vs command initiations. `output` is read via `arguments_value_to_text` so the current string-typed form (per current Responses spec) passes through directly with all keyword tokens intact, while any future structured object/array form re-serializes deterministically via serde_json's BTreeMap-sorted path. Empty `{}` / `[]`, missing, or whitespace-only `output` drops the line — a bare `[local_shell_call_output]` carries no useful signal. The new fallback is type-narrowed to `local_shell_call_output` and sits at the end of `lookup_content` so existing higher-priority paths still win for lines that happen to carry both. Like `local_shell_call` (and `computer_call`), this shape carries a distinct top-level `call_id` field for the call identifier — the standard alias chain already picks that up, and `id` is deliberately NOT promoted to `tool_call_id` because it is the response item id (pinning the item-id-vs-call-id distinction Responses preserves for this shape, mirroring `local_shell_call`'s no-id-widening guard). Both the content fallback and the standard `call_id` alias also recurse through the `message` envelope so Claude Code-style wrappers carrying a `local_shell_call_output` under `message` are covered. Focused `src/jsonl.rs` tests pin the string-output render, the forward-compat structured-output deterministic key ordering, the no-id-widening regression, missing/blank-output drops (covering empty string, whitespace, empty object, and empty array), content-wins precedence, the type-narrowing guard against unrelated line shapes carrying a stray `output` field, message-envelope recursion with `sessionId` plumbing, and outer-`tool_call_id`-wins precedence; a new `tests/jsonl_ingest.rs` integration test ingests a transcript with a user prompt, a `local_shell_call` line, a `local_shell_call_output` line carrying multi-line stdout, and a trailing assistant message, then asserts the `[local_shell_call_output]` prefix plus the issued output is keyword-searchable, the `call_id` round-trips into `chunks.tool_call_id`, the response item id stays unpromoted, the new prefix doesn't collide with `[local_shell_call:`, and surrounding prose survives unchanged.

- Progress: JSONL extraction now also recognizes Anthropic-style `web_search_tool_result` content blocks inside `content` arrays — the reply-side companion to `server_tool_use` web_search calls. Pre-fix these blocks disappeared entirely from chunks: the success shape carries an array of inner `web_search_result` entries that have no `text` or nested `content` (only `url` / `title` / `encrypted_content` / `page_age`), so the generic array walk dropped them; the error shape (`content` is a single `web_search_tool_result_error` object, not an array) also returned `None` from `content_to_text`. After the fix each successful entry renders behind a distinct `[web_search_tool_result] {title} {url}` prefix per entry, with multiple entries joined by `\n` in array order so the result list survives as readable text; either title or url alone is enough to keep the entry (both-missing entries drop) and the opaque `encrypted_content` binary blob is never serialized into the chunk under any circumstance (parallel to the rule already enforced for base64 image / document data). Error blocks render behind a distinct `[web_search_tool_error] error_code={code}` prefix so failure modes are keyword-searchable and don't collide with `[tool_error]` / `[mcp_tool_error]`; missing/blank `error_code` drops the block to avoid bare-prefix pollution. The new handler runs ahead of the generic `content` recursion in `content_to_text`'s array walk so its rendering wins for `web_search_tool_result` specifically, and is type-narrowed so other blocks (including plain `type=text` blocks carrying a stray `content` field) fall through to their existing handlers unchanged. `tool_call_id_from_block` was widened in lockstep so the originating `server_tool_use` call id (`srvtoolu_...`) carried in the block's `tool_use_id` flows into `chunks.tool_call_id` — using the standard 6-alias list (no `id` widening, mirroring regular `tool_result` / `mcp_tool_result`). A direct `text` field on the block still wins and ... [truncated]
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `reasoning` output items (`{"type":"reasoning","id":"rs_...","summary":[{"type":"summary_text","text":"..."}],"encrypted_content":"..."}`) as their own JSONL lines, the reasoning-model companion to the existing `reasoning_text` content-block handling. Pre-fix these lines were silently dropped — the object carries no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every higher-priority fallback in `lookup_content` returned `None` and the model's summarized reasoning trace disappeared from search. After the fix the line renders behind a stable `[reasoning] {joined_text}` prefix — deliberately distinct from the content-block `[thinking]` prefix so keyword search can scope to top-level reasoning summaries vs inline reasoning traces — and walks the `summary` array, extracts the `text` field from each `summary_text` block (type-narrowed inner walk so unknown future entry types are filtered out cleanly), and joins usable entries with `\n` in array order so multi-step reasoning summaries land in a single chunk preserving sequence. Blank / missing `text` fields inside `summary_text` blocks are filtered out; a missing `summary` array, an empty array, or all-blank entries drop the line — no bare-prefix pollution. The optional `encrypted_content` base64 blob is never serialized into the chunk under any circumstance — that would pollute the store with opaque non-searchable bytes, paralleling the rule already enforced for base64 image / document payloads (integration test pins the contract by searching for the blob token and asserting zero hits). Type-narrowed to `reasoning` so unrelated line shapes carrying a stray `summary` array are unaffected, and recurses through the `message` envelope so Claude Code-style wrappers carrying a `reasoning` item under `message` are covered. Unlike `web_search_call` / `file_search_call` / `code_interpreter_call`, the `reasoning` shape is NOT a tool invocation — the top-level `id` (e.g. `rs_...`) is the response item id and is deliberately NOT promoted to `tool_call_id`, mirroring the no-id-widening rule used for `local_shell_call` / `computer_call` / `*_output` shapes; an explicit regression test pins this. Focused `src/jsonl.rs` tests cover basic single-block rendering, multi-block join order, non-`summary_text` entry filtering, blank-entry skipping, missing/empty-summary drops, all-blank drops, `encrypted_content` non-leakage, content-wins precedence, `message`-envelope recursion with `sessionId` plumbing, the no-id-widening guard, and a type-narrowing guard against `type=message` lines carrying a stray `summary` array; a new `tests/jsonl_ingest.rs` integration test ingests a user prompt + reasoning line with two `summary_text` blocks plus an `encrypted_content` blob + trailing assistant message, then asserts the `[reasoning]` prefix plus both summary entries are keyword-searchable, the encrypted blob is invisible, the response item id stays unpromoted in `chunks.tool_call_id`, and the surrounding prose survives unchanged.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `computer_call_output` items (`{"type":"computer_call_output","id":"cuo_...","call_id":"call_...","output":{"type":"input_image","image_url":"https://..."}}`) as their own JSONL lines. The new late fallback in `src/jsonl.rs` renders them behind a stable `[computer_call_output] ...` prefix, preferring `output.image_url` and falling back to `[computer_call_output:file] {file_id}` when only a file handle is present; missing anchors still drop the line, and the top-level response-item `id` is deliberately not widened into `tool_call_id` because `call_id` remains the real tool-call identifier. Focused unit tests cover URL/file-id precedence, envelope recursion, content-wins behavior, and the no-id-widening guard; a new `tests/jsonl_ingest.rs` regression pins end-to-end ingest/search behavior for the top-level line shape.
- Progress: top-level OpenAI Responses `computer_call_output` lines now special-case screenshot-style `data:image/...;base64,...` payloads so Lantern stores only a compact `[computer_call_output:MEDIA_TYPE]` anchor instead of dumping the opaque base64 blob into chunk text. Ordinary non-`data:` image URLs still win, malformed/blank-media-type data URLs fall back to `file_id` when present, and focused unit/integration tests pin both the media-type render and the no-base64-leak guarantee.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `function_call_output` items (`{"type":"function_call_output","id":"fco_...","call_id":"call_...","output":"<tool reply text>"}`) as their own JSONL lines — the reply-side companion to a top-level `function_call`. Pre-fix the line had no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every higher-priority fallback in `lookup_content` returned `None` and the tool reply disappeared from search (only the content-block companion was covered, which uses a different prefix and a different envelope shape). After the fix the new late fallback in `src/jsonl.rs` renders the line behind a stable `[function_call_output] {output_text}` prefix — deliberately distinct from the content-block companion's `[tool_result]` prefix so keyword search can tell a top-level Responses-style reply apart from a `function_call_output` block embedded inside a message's `content` array. `output` is read via `arguments_value_to_text` so OpenAI's current string-typed form passes through unchanged with all keyword tokens intact, while any structured object/array shape re-serializes deterministically via serde_json's BTreeMap-sorted path. Empty `{}` / `[]`, missing, or whitespace-only `output` drops the line — a bare `[function_call_output]` carries no useful signal. Type-narrowed to `function_call_output` so unrelated line shapes carrying a stray `output` field are unaffected. Like `local_shell_call_output` / `computer_call_output`, the shape carries a distinct top-level `call_id` for the call identifier — the standard alias chain already picks that up, and `id` (response item id) is deliberately NOT promoted to `tool_call_id`. Focused `src/jsonl.rs` tests pin the string-output render with `call_id` recovery, forward-compat structured-output deterministic key ordering, the no-id-widening regression, four missing/blank-output drop cases (empty string, whitespace, empty object, empty array), content-wins precedence, the type-narrowing guard against unrelated line shapes carrying a stray `output` field, message-envelope recursion with `sessionId` plumbing, outer-`tool_call_id`-wins precedence, and a side-by-side line-vs-block prefix-distinctness test; a new `tests/jsonl_ingest.rs` integration test ingests a transcript with a user prompt, a `function_call` line, a `function_call_output` line carrying the tool reply, and a trailing assistant message, then asserts the `[function_call_output]` prefix plus the issued reply text is keyword-searchable, the `call_id` round-trips into `chunks.tool_call_id`, the response item id stays unpromoted, the new prefix doesn't collide with the call-line `[tool:NAME]` rendering, and surrounding prose survives unchanged.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `code_interpreter_call_output` items (`{"type":"code_interpreter_call_output","id":"cio_...","call_id":"call_...","output":"<tool reply text>"}`) as their own JSONL lines — the reply-side companion to the existing top-level `code_interpreter_call` extractor. The new late fallback in `src/jsonl.rs` renders the line behind a stable `[code_interpreter_call_output] {output_text}` prefix distinct from `[code_interpreter_call]`, reads `output` via `arguments_value_to_text` so string output passes through unchanged and any future structured object/array form re-serializes deterministically, drops the line on missing / whitespace-only / empty-`{}`/`[]` `output`, type-narrows to `code_interpreter_call_output` so unrelated lines carrying a stray `output` field are unaffected, recurses through the `message` envelope, and preserves the existing tool-call-id contract — the standard alias chain picks up `call_id` while the top-level response-item `id` is deliberately NOT promoted into `tool_call_id`. Focused `src/jsonl.rs` tests cover basic rendering, structured-output deterministic serialization, empty/missing-output drops, content-wins precedence, message-envelope recursion, outer-`tool_call_id`-wins, the type-narrowing guard, and the explicit no-top-level-id-widening guard; a new `tests/jsonl_ingest.rs` integration test ingests a transcript with a `code_interpreter_call` + reply + surrounding prose and asserts the new prefix plus `call_id` survive end-to-end search while the response-item `id` does not become `tool_call_id`.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `image_generation_call` items (`{"type":"image_generation_call","id":"ig_...","status":"completed","revised_prompt":"...","result":"<base64 png>"}`) as their own JSONL lines. The new late fallback renders them behind a stable `[image_generation_call] {revised_prompt}` prefix, promotes the top-level `ig_...` id into `chunks.tool_call_id` only for this line shape, recurses through a nested `message` envelope, and deliberately omits the opaque base64 `result` blob plus low-recall config/status fields from chunk text. Focused `src/jsonl.rs` tests cover prompt rendering, missing/blank prompt drops, content-wins precedence, message-envelope recursion, type-narrowed id widening, and non-leakage of `result`/config fields; a new `tests/jsonl_ingest.rs` end-to-end regression pins ingest/search behavior and the `tool_call_id` round-trip.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `mcp_call` items (`{"type":"mcp_call","id":"mcp_...","server_label":"linear","name":"create_issue","arguments":"{...}"}`) as their own JSONL lines. Instead of falling through to the generic `arguments` payload path, they now render behind a distinct `[mcp_call:SERVER/NAME] ...` prefix, promote the top-level `mcp_...` id into `chunks.tool_call_id` only for this line shape, recurse through a nested `message` envelope, and keep the prefix distinct from Anthropic content-block `[mcp_tool_use:SERVER/NAME]` so keyword search can scope the transport/provider. Focused `src/jsonl.rs` tests cover prefix rendering, missing-name drops, name-only fallback when `server_label` is absent/blank, empty/missing-arguments prefix collapse, message-envelope recursion, content-wins precedence, and type-narrowed id widening; a new `tests/jsonl_ingest.rs` end-to-end regression pins ingest/search behavior for the `server_label`, argument payload, and `tool_call_id` round-trip.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `mcp_list_tools` items (`{"type":"mcp_list_tools","id":"mcpl_...","server_label":"linear","tools":[{"name":"create_issue"},...]}`) as their own JSONL lines. The new late fallback renders them behind a distinct `[mcp_list_tools:SERVER] name1, name2, ...` prefix so MCP server tool listings stay searchable without collapsing into call-shaped `[mcp_call:SERVER/NAME]` or content-block `[mcp_tool_use:SERVER/NAME]` prefixes. It preserves array order, drops blank/non-string tool names, recurses through nested `message` envelopes, omits verbose tool descriptions from chunk text, and deliberately does not widen the top-level response-item `id` into `tool_call_id` because listings are not calls. Focused `src/jsonl.rs` tests cover server/no-server rendering, empty-list bare-prefix behavior, content-wins precedence, type narrowing, envelope recursion, and the no-id-widening guard; a new `tests/jsonl_ingest.rs` regression pins end-to-end searchability for the server label and listed tool names.
- Progress: top-level OpenAI Responses `mcp_list_tools` lines now also surface inline listing failures carried in the same JSONL item via `error`. Success listings are unchanged, but when the server returns an inline error Lantern appends a distinct `[mcp_list_tools_error:SERVER] ...` line (or server-less `[mcp_list_tools_error] ...`) using the same deterministic payload rendering as `mcp_call.error`. This keeps MCP capability-listing failures keyword-searchable without widening `mcpl_...` response-item ids into `tool_call_id`, and focused unit/integration tests pin partial-listing warnings, server-scoped failures, structured-error determinism, and the no-id-widening guard.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `function_call` items (`{"type":"function_call","id":"fc_...","call_id":"call_...","name":"docs_search","arguments":"{...}"}`) as their own JSONL lines. Instead of falling through to the generic `arguments` payload path, they now render behind a distinct `[function_call:NAME] ...` prefix so top-level call lines stay distinguishable from content-block `[tool_use:NAME]` calls and OpenAI `role="tool"` replies. `arguments` still pass through the shared deterministic payload helper, `call_id` remains the true `tool_call_id` while the top-level response-item `id` is deliberately not widened, and the extractor recurses through a nested `message` envelope. Focused `src/jsonl.rs` tests pin payload-fallback preemption, prefix non-duplication, missing-name drops, content-wins precedence, envelope recursion, and the no-id-widening guard; a new `tests/jsonl_ingest.rs` end-to-end regression pins ingest/search behavior for the distinct prefix, argument-token searchability, and `tool_call_id` round-trip.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `mcp_approval_request` items (`{"type":"mcp_approval_request","id":"mcpr_...","server_label":"deepwiki","name":"ask_question","arguments":"{...}"}`) as their own JSONL lines — the application-mediated approval gate that precedes a paired `mcp_call`. Pre-fix the generic `payload_text("arguments")` fallback would still surface the raw arguments JSON but lose the server/name provenance prefix, so an approval gate became indistinguishable from the call it gated. The new gated extractor (mirrors the existing `mcp_call` gating block) renders the line behind a distinct `[mcp_approval_request:SERVER/NAME] {compact_arguments_json}` prefix so keyword search can scope to approval gates vs invocations independently, with `server_label` folded into the prefix and the top-level `mcpr_...` id type-narrowed-promoted into `chunks.tool_call_id` so the paired `mcp_approval_response`'s `approval_request_id` joins the same chunk. `arguments` reuses `arguments_value_to_text` (BTreeMap-sorted, JSON-encoded-string passthrough); empty `{}`/`[]` and missing/blank `arguments` collapse to the bare prefix; missing/blank `name` drops the line; missing/blank `server_label` falls back to `[mcp_approval_request:NAME]`. Focused tests pin renderings, deterministic structured-arguments serialization, server-less fallback, name-drop, empty/missing arguments collapse, payload-fallback preemption, content-wins precedence, type-narrowed id widening, message-envelope recursion, outer `tool_call_id` precedence, and prefix-distinct-from-`[mcp_call:]` invariants.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `custom_tool_call` items (`{"type":"custom_tool_call","id":"ctc_...","call_id":"call_...","name":"search_docs","input":"..."}`) as their own JSONL lines — the application-defined custom-tool companion to the standard `function_call` line shape. The new gated extractor in `src/jsonl.rs` renders them behind a distinct `[custom_tool_call:NAME] ...` prefix (distinct from `[function_call:NAME]` / `[tool_use:NAME]`), preempts the generic `payload_text("input")` fallback so the provenance prefix survives, reuses `arguments_value_to_text` so string inputs pass through and structured object/array inputs serialize deterministically via the BTreeMap-sorted path, drops the line on missing/blank `name`, collapses empty/missing `input` to the bare prefix, recurses through a nested `message` envelope, preserves higher-priority `content`/`text`/etc. precedence, suppresses the synthetic `[tool:NAME]` role-fold via the `content_has_typed_tool_prefix` guard, and deliberately does NOT widen the top-level response-item `id` into `tool_call_id` (the standard `call_id` alias remains the real tool_call_id). Focused unit tests cover string passthrough, deterministic structured-input serialization, missing/blank-name drops, empty/missing-input prefix collapse, content-wins precedence, message-envelope recursion, the no-top-level-id-widening guard, and the role-fold non-duplication invariant; a new `tests/jsonl_ingest.rs` regression pins end-to-end search behavior for the distinct prefix, input-token searchability, and `call_id` round-trip while the response-item `id` stays unpromoted.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `mcp_approval_response` items (`{"type":"mcp_approval_response","approval_request_id":"mcpr_...","approve":true,"reason":"..."}`) as their own JSONL lines — the application-mediated approval reply that pairs with a preceding `mcp_approval_request` gate. Pre-fix the response line was silently dropped: the object carries no `content` / `text` / `message` / `body` / `arguments` / `input` / `tool_calls`, so every higher-priority fallback in `lookup_content` returned `None` and the approval decision disappeared from search. After the fix the line renders behind a stable `[mcp_approval_response:approved]` or `[mcp_approval_response:denied]` prefix — distinct on purpose from `[mcp_approval_request:SERVER/NAME]` (the gate this responds to) and from `[mcp_call:SERVER/NAME]` (the call that follows on approval) so audit queries can scope to response decisions independently. The status (approved/denied) is folded into the prefix because the boolean is the canonical content; an optional non-blank `reason` is appended verbatim after the prefix. The `approval_request_id` field is type-narrowed-promoted into `chunks.tool_call_id` so the request and the response join by the same key (the request's `mcpr_...` top-level id already gets the same promotion via `mcp_approval_request_tool_call_id`). Drop rules: missing/blank `approval_request_id` drops the line (no link back to the gate = no useful provenance); missing or non-boolean `approve` drops the line (no canonical content); blank `reason` collapses to the bare prefix; the top-level response item `id` is NOT widened (canonical join key is `approval_request_id`). Type-narrowed to `mcp_approval_response` so unrelated lines carrying stray `approve` / `approval_request_id` fields are unaffected. Focused `src/jsonl.rs` tests pin approved/denied renderings, reason inlining, blank-reason collapse, missing-approval_request_id / missing-or-non-bool-approve drops, content-wins precedence, type narrowing, `approval_request_id`→`tool_call_id` promotion, message-envelope recursion, and outer `tool_call_id` precedence; a new `tests/jsonl_ingest.rs` regression pins end-to-end searchability for the approval status, reason, and request/response join key.
- Progress: OpenAI Responses-style top-level `mcp_call` lines that also carry `approval_request_id` now preserve that gate id as searchable call-side provenance without stealing the call's canonical `tool_call_id`. The line renderer appends a distinct `[mcp_call_approval_request:SERVER/NAME] mcpr_...` secondary line after the existing `[mcp_call:SERVER/NAME] ...` content so audit queries can find the gated invocation itself by approval id while `chunks.tool_call_id` stays bound to the call's own `mcp_...` id. Focused `src/jsonl.rs` tests pin render order, omission when `approval_request_id` is blank/missing, nested-message recovery, and the outer-tool-call-id-wins contract; `tests/jsonl_ingest.rs` adds an end-to-end regression that proves one approval id search now finds the gate request, the approval response, and the gated call chunk itself.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style top-level `custom_tool_call_output` items (`{"type":"custom_tool_call_output","id":"ctco_...","call_id":"call_...","output":"..."}`) as their own JSONL lines. The new late fallback in `src/jsonl.rs` renders the line behind a stable `[custom_tool_call_output] {output_text}` prefix distinct from `[function_call_output]`, `[local_shell_call_output]`, and content-block `[tool_result]`, reads `output` through the shared deterministic payload helper so string output passes through unchanged while future structured object/array output serializes predictably, drops missing / whitespace-only / empty-`{}` / empty-`[]` output, recurses through nested `message` envelopes, preserves content-wins precedence, and keeps the Responses item-id vs call-id contract intact (`call_id` remains the real `tool_call_id`, while the top-level response-item `id` is deliberately not widened). Focused `src/jsonl.rs` tests cover rendering, structured-output determinism, drop cases, envelope recursion, precedence, type narrowing, and the no-id-widening guard; a new `tests/jsonl_ingest.rs` regression pins end-to-end ingest/search behavior plus the `tool_call_id` round-trip.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style `input_image` content blocks (`{"type":"input_image","image_url":"https://..."}`, `{"type":"input_image","image_url":"data:image/png;base64,..."}`, and `{"type":"input_image","file_id":"file_..."}`) inside `content` arrays — the user-side image-attachment companion to Anthropic `image` blocks. The new typed handler in `src/jsonl.rs` renders them behind a distinct `[input_image]` prefix, prefers a non-`data:` `image_url`, falls back to `[input_image:MEDIA_TYPE]` for `data:` URLs without leaking the base64 payload, then to `[input_image:file] FILE_ID` when only a file handle is present, and drops anchorless blocks to avoid bare-prefix noise. Focused unit tests cover URL/file/media-type rendering, malformed `data:` fallback, text/type precedence, mixed-block ordering, and Anthropic/OpenAI coexistence; a new `tests/jsonl_ingest.rs` regression pins end-to-end ingest/search behavior including the no-base64-leak guarantee.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style `input_file` content blocks (`{"type":"input_file","file_url":"https://..."}`, `{"type":"input_file","filename":"report.pdf","file_data":"data:application/pdf;base64,..."}`, and `{"type":"input_file","file_id":"file_..."}`) inside `content` arrays — the user-side document/file-attachment companion to `input_image`. The new typed handler in `src/jsonl.rs` renders them behind a distinct `[input_file]` prefix, prefers non-`data:` `file_url`, then `filename`, then data-URL media type, then `[input_file:file] FILE_ID`, and never serializes base64 payloads from `file_data` / `data:` URLs into chunk text. Focused unit tests cover URL/filename/media-type/file-id rendering, malformed data-URL fallback, text/type precedence, mixed-block ordering, and Anthropic `document` coexistence; a new `tests/jsonl_ingest.rs` regression pins end-to-end searchability while asserting the base64 payload never lands in search.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style `input_audio` content blocks (`{"type":"input_audio","input_audio":{"data":"<base64>","format":"wav"}}`, `{"type":"input_audio","input_audio":{...},"transcript":"hello"}`, and `{"type":"input_audio","file_id":"file_..."}`) inside `content` arrays — the user-side audio-attachment companion to `input_image` / `input_file`. The new typed handler in `src/jsonl.rs` renders them behind a distinct `[input_audio]` prefix, prefers a non-blank top-level `transcript` (the only human-readable, full-text-searchable anchor), then `[input_audio:file] FILE_ID`, then a codec-scoped `[input_audio:{format}]` from nested `input_audio.format`, and drops anchorless blocks to avoid bare-prefix noise. The nested `input_audio.data` base64 payload is never touched and never serialized into chunk text under any path. Focused unit tests cover transcript/file_id/format rendering, the transcript-over-file_id-over-format precedence chain, missing-anchor drops, direct-text precedence, mixed-block ordering with input_image/input_file companions, the type-narrowing guard against ordinary text blocks carrying stray `transcript` / `file_id` fields, and a belt-and-braces leak check across every anchor path; a new `tests/jsonl_ingest.rs` regression ingests a transcript with a transcript-anchored and a file_id-anchored input_audio block and asserts both anchors are keyword-searchable while the raw base64 audio bytes never appear in any chunk.
- Progress: JSONL extraction now also recognizes OpenAI Responses-style `output_audio` content blocks (`{"type":"output_audio","transcript":"hello","data":"<base64>","format":"wav"}`, `{"type":"output_audio","file_id":"file_..."}`, and the defensive nested `{"type":"output_audio","output_audio":{"data":"...","format":"wav"}}` shape) inside assistant `content` arrays — the assistant-side audio-output companion to user-side `input_audio`. The new `output_audio_block_to_text` handler in `src/jsonl.rs` renders them behind a distinct `[output_audio]` prefix so keyword search can scope assistant audio output vs user audio input, with anchor precedence transcript → file_id → flat `format` → nested `output_audio.format` → drop. The top-level `data` field and nested `output_audio.data` field are never serialized into chunk text under any path (raw base64 audio bytes — would balloon the store with opaque blobs and aren't keyword-searchable). A direct `text` field still wins via the higher-priority text path; type-narrowed to `output_audio` so `type=text` blocks carrying stray `transcript`/`file_id`/`format` fields aren't accidentally relabelled. 12 focused unit tests pin transcript/file_id/flat-format/nested-format rendering, the full transcript → file_id → flat → nested precedence chain, file_id-over-format and flat-over-nested precedence, missing-anchor drops, direct-text precedence, mixed-block interleave order, a belt-and-braces leak check across every anchor path (both flat and nested `data`), the type-narrowing guard, and an `input_audio` vs `output_audio` distinctness guard ensuring same-shape blocks round-trip with the right direction prefix. A new `tests/jsonl_ingest.rs` integration test ingests an assistant turn with text + transcript-anchored `output_audio` plus a standalone format-only `output_audio` and asserts both anchors are keyword-searchable, the surrounding text caption joins in array order, and the raw base64 payload never appears in any chunk.

- Progress: JSONL extraction now also surfaces the citation `annotations` array on OpenAI Responses-style `output_text` content blocks (`{"type":"output_text","text":"...","annotations":[{"type":"url_citation","url":"...","title":"..."},{"type":"file_citation","file_id":"file_...","filename":"..."}]}`). Pre-fix the generic text-field precedence in `content_to_text`'s array walk picked up the `text` body cleanly but never inspected `annotations`, so the cited URL / title / file_id / filename — often the single most useful provenance signal on a turn (e.g. for retrieved web-search results) — were invisible to keyword search. After the fix each usable annotation renders on its own line behind a stable provenance prefix appended after the text body: `[url_citation] {url}` (or `[url_citation] {url} {title}` when title is non-blank) per `url_citation` entry, and `[file_citation] {file_id} {filename}` (with single-field `[file_citation] {file_id}` or `[file_citation] {filename}` fallbacks when only one anchor is present) per `file_citation` entry. Annotations join in array order so callers can correlate citation N back to the same position in the upstream transcript. The new handler runs ahead of the generic text-field precedence so the augmented rendering wins for `output_text` blocks with annotations specifically; type-narrowed to `output_text` so ordinary `type=text` (and every other) block falls through to the existing text-field path unchanged (Anthropic text blocks carry citations under a different `citations` shape and are deliberately left for a future slice). Backwards-compat falls through to the plain text path in three cases: missing `annotations` field, empty `annotations` array, and an array containing only unknown entry types or entries dropped for missing required fields — no trailing newline pollution. Optional `start_index` / `end_index` byte offsets are deliberately not surfaced (positional metadata, not keyword-searchable). Focused `src/jsonl.rs` tests pin url_citation rendering with and without title, multi-citation array-order join, file_citation rendering with both anchors and single-anchor fallbacks, the no-anchors file_citation drop, plain-text fallthrough for missing/empty/unknown-type annotations, the missing-url url_citation drop, mixed url+file rendering, mixed-block array-order interleave with surrounding text blocks, and the type-narrowing guard against `type=text` blocks carrying a stray `annotations` array. A new `tests/jsonl_ingest.rs` integration test ingests a user prompt + assistant `output_text` block carrying both a `url_citation` and a `file_citation` annotation, then asserts the assistant body, the cited URL, the cited title, the file_id, and the filename are each independently keyword-searchable end-to-end and all live in the same chunk under `[assistant]` with the citation lines joined in array order; surrounding user prose still survives unchanged.

- Progress: JSONL extraction now also recognizes Anthropic-style `citations` arrays on `type=text` content blocks — the Anthropic-side companion to OpenAI Responses `output_text` + `annotations`, deliberately left for a future slice in the original `output_text` annotation note above. Pre-fix the generic text-field precedence picked up the `text` body cleanly but the array walk never inspected `citations`, so document-grounded responses lost the cited `document_title` / `cited_text` anchors entirely — the single most useful provenance signal for citations-mode turns. After the fix, when a `type=text` block carries at least one usable citation entry, the block renders as `{text}\n[citation:CITATION_TYPE] {document_title} {cited_text}` per recognized entry, with single-field fallbacks (`{document_title}` or `{cited_text}` alone) when only one anchor is non-blank. Three document-coordinate citation types are recognized in this slice: `char_location`, `page_location`, and `content_block_location` — they all reference a document Anthropic was given and share the same anchor set (just pin the span with different coordinates: chars / pages / structured blocks). Each gets its own type-bearing prefix (`[citation:char_location]` etc.) so keyword search can scope to citation coordinate kind independently. Multiple citations join in array order so callers can correlate citation N back to its position in the upstream transcript. The new handler runs ahead of the generic text-field precedence in `content_to_text`'s array walk so the augmented rendering wins for `type=text` blocks with citations specifically; type-narrowed to `text` so OpenAI `type=output_text` blocks fall through to their existing `output_text_with_annotations_to_text` handler unchanged. Backwards-compat falls through to the plain text path in three cases: missing `citations` field (virtually every Anthropic assistant turn before citations were introduced), empty `citations` array, and an array containing only unknown citation types (or entries dropped for missing anchors) — no trailing newline pollution. `document_index` / `start_char_index` / `end_char_index` / page / block coordinates are deliberately not surfaced (positional metadata, not keyword-searchable). `web_search_result_location` / `search_result_location` citation types are deliberately left for a follow-up slice — they carry `url` / `source` rather than `document_title` and are best handled alongside their respective server-tool surfaces. Focused `src/jsonl.rs` tests pin char_location rendering with both anchors, document_title-only and cited_text-only single-anchor fallbacks, page_location and content_block_location rendering with their distinct prefixes, multi-citation array-order join, the no-anchors drop with plain-text fallthrough, plain-text fallthrough for missing/empty/unknown-type citations (including a forward-compat `web_search_result_location` entry that drops cleanly today), mixed-block array-order interleave with surrounding text blocks, the type-narrowing guard against `type=output_text` blocks carrying a stray `citations` array, and a side-by-side Anthropic-vs-OpenAI rendering test that proves the two handlers stay strictly non-overlapping. A new `tests/jsonl_ingest.rs` integration test ingests a user prompt + assistant `text` block with both a `char_location` and a `page_location` citation, then asserts the assistant body, both `document_title` values, and both `cited_text` values are each independently keyword-searchable end-to-end and all live in the same chunk under `[assistant]` with the citation lines joined in array order; surrounding user prose still survives unchanged.

- Progress: JSONL extraction now also recognizes Anthropic-style `web_search_result_location` and `search_result_location` citation entries on `type=text` content blocks — the server-tool-result companions to the existing document-coordinate `char_location` / `page_location` / `content_block_location` citation types covered in the original Anthropic citations slice. Pre-fix the citation-array walk dropped these entries via the catch-all unknown-type arm in `anthropic_citation_to_text`, so `web_search` and Search API tool citations were invisible to keyword search even though the assistant body itself rendered cleanly. After the fix `web_search_result_location` renders behind `[citation:web_search_result_location] {url} {title} {cited_text}` (joining whichever of those three anchors are non-blank in precedence order — URL first because it's the strongest keyword-search anchor) and `search_result_location` renders behind `[citation:search_result_location] {source} {title} {cited_text}` (`source` is the canonical identifier — a URL, doc id, or arbitrary caller-supplied string). Both keep the same per-entry, line-per-citation, type-narrowed rendering already used by the document-coordinate arms so callers can scope keyword search to web vs search-tool vs document-coordinate citations independently. All-anchors-missing drops the entry (no `[citation:web_search_result_location]` / `[citation:search_result_location]` pollution); when every entry in the array drops, the block falls through to the plain text path with no trailing newline. Positional / opaque metadata stays strictly out of the chunk — `encrypted_index` on web citations and `start_block_index` / `end_block_index` on search citations are never surfaced (an integration assert pins the no-leak rule for `encrypted_index`). Focused `src/jsonl.rs` tests pin the all-anchors render, single-anchor fallbacks (url-only, title-only, cited_text-only for web; source-only for search), the all-missing drop with plain-text fallthrough for both citation types, and a mixed-citation array-order render that interleaves `char_location` + `web_search_result_location` + `search_result_location` on consecutive lines in a single block. The pre-existing `anthropic_text_with_only_unknown_citation_types_renders_plain_text` test was updated to drop its `web_search_result_location` placeholder (now a known type) and continues to pin the unknown-type plain-text fallthrough. A new `tests/jsonl_ingest.rs` integration test ingests a user prompt + assistant `text` block carrying both a `web_search_result_location` and a `search_result_location` citation, then asserts the assistant body, both URLs / sources, both titles, and both cited_text excerpts are each independently keyword-searchable end-to-end and all live in the same chunk under `[assistant]` joined in array order; the `encrypted_index` opaque blob is asserted to never appear in any chunk; surrounding user prose still survives unchanged.

- Progress: JSONL extraction now also recognizes OpenAI Responses-style `container_file_citation` annotations on `output_text` blocks — the code-interpreter companion to `file_citation` for assistant turns that reference files produced inside (or uploaded into) a code interpreter container. Pre-fix the annotation walk dropped these entries entirely via the catch-all unknown-type arm in `annotation_to_text`, so the cited `file_id` / `filename` — the strongest provenance anchors on a code-interpreter citation — were invisible to keyword search. After the fix each entry renders behind a distinct `[container_file_citation] {file_id} {filename}` prefix with single-anchor fallbacks (`{file_id}`-only or `{filename}`-only when only one is non-blank). The opaque `container_id` sandbox handle is deliberately omitted from chunk text — mirrors the `code_interpreter_call` no-leak rule that `container_id` tokens like `cntr_...` never end up in chunks — and `start_index` / `end_index` byte offsets are also dropped (positional, not keyword-searchable). Drops cleanly when both `file_id` and `filename` are missing (no bare `[container_file_citation]` pollution); type-narrowing to `container_file_citation` keeps `file_citation` / `url_citation` arms unchanged. Focused `src/jsonl.rs` tests pin both-anchors rendering with `container_id` non-leakage, single-anchor file_id / filename fallbacks, the no-anchor drop with plain-text fallthrough, mixed-annotation array-order interleave with `url_citation` + `file_citation`, the prefix-distinct-from-`[file_citation]` invariant, and a type-narrowing guard against `type=text` blocks carrying a stray `container_file_citation` annotation; a new `tests/jsonl_ingest.rs` integration test ingests a user prompt + assistant `output_text` block with a `container_file_citation` annotation end-to-end and asserts the assistant body, the `cfile_...` file_id, and the filename are each independently keyword-searchable behind the new prefix in the same chunk, the `cntr_...` container_id never appears in any chunk (belt-and-braces store-wide search), and the prefix doesn't collide with the bare `[file_citation]` token.

- Progress: JSONL extraction now also surfaces the MCP server reply and failure payloads bundled on OpenAI Responses-style top-level `mcp_call` lines. Unlike `function_call` / `local_shell_call` / `computer_call` (whose outputs ship on separate `*_output` JSONL lines), Responses keeps both the request and its result on a single `mcp_call` line via `output` (server reply text) and `error` (failure payload). Pre-fix both fields were silently dropped: an agent's keyword search for the actual MCP server response or its failure mode returned nothing even though the call itself was searchable. After the fix, when present, `output` is appended on a `\n[mcp_call_output:LABEL] {text}` line and `error` on a `\n[mcp_call_error:LABEL] {text}` line, with `LABEL` mirroring the existing `SERVER/NAME` (or `NAME`-only when `server_label` is missing) used by the original `[mcp_call:LABEL]` call line. The new prefixes are deliberately distinct from the call prefix and from the Anthropic content-block companion's `[mcp_tool_result]` / `[mcp_tool_error]` shapes so keyword search can scope to call vs reply vs failure independently across transports. Both reuse `arguments_value_to_text` so string outputs pass through unchanged and structured object/array outputs serialize deterministically via the BTreeMap-sorted path; `null`, blank strings, and empty `{}` / `[]` collapse to no extra line rather than emitting a bare `[mcp_call_output:LABEL]` / `[mcp_call_error:LABEL]`. Backwards-compat: when neither `output` nor `error` is present the rendered chunk is exactly the original single `[mcp_call:LABEL] {args}` line. Focused `src/jsonl.rs` tests pin string-output rendering with call-side metadata preservation, deterministic structured-output / structured-error key ordering, string-error rendering, the missing-output-and-error backwards-compat case, four `null`/blank/empty drop cases, the both-`output`-and-`error`-present render order (call → output → error), and the server-less label parity so `[mcp_call:NAME]` followed by `[mcp_call_output:NAME]` stays consistent. A new `tests/jsonl_ingest.rs` integration test ingests a transcript with a successful `mcp_call` + structured-error `mcp_call` end-to-end, then asserts the output reply is keyword-searchable behind `[mcp_call_output:linear/create_issue]`, the error payload is keyword-searchable behind `[mcp_call_error:linear/create_issue]` with deterministic BTreeMap-sorted JSON, both chunks preserve `tool_call_id` / call-line provenance, the new prefixes don't collide with the call or content-block companions, and surrounding user/assistant prose survives unchanged.

- Progress: JSONL extraction now also surfaces the inline `outputs[]` array on OpenAI Responses-style top-level `code_interpreter_call` lines, so execution logs and generated images stop disappearing when the runtime bundles the result onto the call item itself instead of emitting a separate `code_interpreter_call_output` line. Pre-fix `code_interpreter_call_line_to_text` only whitelisted `code`, so a line shaped like `{"type":"code_interpreter_call","code":"plot()","outputs":[{"type":"logs","logs":"computed_revenue_total=4242"},{"type":"image","url":"data:image/png;base64,..."},{"type":"image","file_id":"file_chart"}]}` rendered as just `[code_interpreter_call] plot()` and every log line plus the generated chart provenance silently disappeared. After the fix each `outputs[]` entry is appended on its own `\n[code_interpreter_*] ...` line behind a distinct prefix (so keyword search can scope to logs vs the call body vs the dedicated reply-side `[code_interpreter_call_output]` line independently): `{"type":"logs","logs":"..."}` → `[code_interpreter_logs] {logs}`; `{"type":"image","url":"https://..."}` → `[code_interpreter_image] {url}` (filename / hash stays keyword-searchable); `{"type":"image","url":"data:image/png;base64,..."}` → `[code_interpreter_image:image/png]` (only the media type lands in chunk text — the base64 payload is never serialized, mirroring the no-leak rule already pinned for `image_block_to_text` / `computer_call_output_line_to_text`); `{"type":"image","file_id":"file_..."}` → `[code_interpreter_image:file] {file_id}`. Malformed `data:` URLs (no `;` separator, or a whitespace-only media type) fall back to the entry's `file_id` rather than leaking the rest of the URL. Drop rules mirror the rest of the family: anchorless `logs` / `image` entries (missing `logs`, missing both `url` and `file_id`) drop without polluting the chunk; unknown entry types fall through silently; an empty / missing `outputs` array preserves the exact pre-existing `[code_interpreter_call] {code}` shape; missing or blank `code` still drops the entire line so `outputs[]` can never rescue an orphaned call body. Inline outputs join in array order so the rendered chunk preserves the original execution sequence (log → image → log). Focused `src/jsonl.rs` tests pin single-logs rendering with `tool_call_id` round-trip, multi-logs array-order joining, image URL anchoring, the data-URL media-type scrub with belt-and-braces base64 leak check, file-id-only image rendering, malformed data-URL fallback to `file_id`, the anchorless-entries-and-unknown-types drop case, the empty-array no-op, the missing-code-still-drops invariant, and a mixed logs+image interleave; a new `tests/jsonl_ingest.rs` integration test ingests a transcript with all three primary inline-output shapes end-to-end and asserts the logs payload is keyword-searchable behind `[code_interpreter_logs]`, the file-id anchor is keyword-searchable behind `[code_interpreter_image:file]`, the data-URL media-type lands behind `[code_interpreter_image:image/png]`, the base64 payload (`NOISE_BASE64_PAYLOAD` token) never appears in any chunk, the top-level `id` still round-trips into `chunks.tool_call_id`, and surrounding user/assistant prose survives unchanged.
- Progress: standalone OpenAI Responses top-level `output_text` lines now reuse the existing citation augmentation path, so lines shaped like `{"type":"output_text","text":"answer","annotations":[...]}` preserve `[url_citation]`, `[file_citation]`, and `[container_file_citation]` anchors instead of dropping them through the generic top-level `text` fallback. Ordinary top-level `text` on unrelated line types is unchanged, and focused unit plus end-to-end ingest tests pin the new line shape and precedence.
- Progress: JSONL extraction now also surfaces the `pattern` anchor on OpenAI Responses-style top-level `web_search_call` lines, so `find_in_page` actions stop dropping the substring the model asked the browser tool to locate. Pre-fix the whitelist on `web_search_call_line_to_text` only accepted `query` and `url`, so a `find_in_page` action shaped like `{"type":"find_in_page","pattern":"async runtime","url":"https://docs.example/tokio"}` rendered as `[web_search_call:find_in_page] {"url":"..."}` and the pattern disappeared entirely — keyword search on the substring the agent was actually looking for returned nothing. After the fix the whitelist becomes `["pattern","query","url"]`, so `find_in_page` actions render as `[web_search_call:find_in_page] {"pattern":"async runtime","url":"https://docs.example/tokio"}` (BTreeMap-sorted so `pattern` lands before `url` and chunk text stays deterministic across runs), pattern-only `find_in_page` actions render as `[web_search_call:find_in_page] {"pattern":"..."}` instead of falling back to the bare action prefix, and the existing `search` / `open_page` rendering plus all drop and precedence rules are preserved unchanged (the whitelist is deliberately action-type-agnostic — same as `query` / `url` today — so a stray `pattern` field on a `search` action also gets surfaced, mirroring the existing handling for stray `url` on `search` actions). The strict-whitelist guard against huge optional arrays (sources / results / page snippets) is unchanged, the `[web_search_call:ACTION]` prefix family stays the same, and the top-level `id`-as-`tool_call_id` promotion continues to fire under the same `type=web_search_call` narrowing. Focused `src/jsonl.rs` unit tests cover `find_in_page` with pattern+url (sorted-key determinism plus `tool_call_id` round-trip), pattern-only `find_in_page`, and an updated three-anchor sorted-key test that pins `pattern` before `query` before `url` when all three live in the same action; a new `tests/jsonl_ingest.rs` integration test ingests a `find_in_page` line end-to-end and asserts both the `pattern` token and the `ws_...` id survive search while optional result blobs remain omitted.
- Progress: JSONL extraction now also recognizes Anthropic-style `search_result` content blocks (`{"type":"search_result","source":"...","title":"...","content":[...]}`) inside user `content` arrays. Lantern now renders them behind a distinct `[search_result]` prefix that preserves the result's `source` / `title` anchors ahead of the matched excerpt, while anchorless blocks fall back to the old inner-content-only behavior and never emit a bare provenance prefix. Focused `src/jsonl.rs` tests cover source/title/header-only/string-content/precedence cases, and a new `tests/jsonl_ingest.rs` regression pins end-to-end searchability for the source, title, and excerpt while ensuring the block does not populate `tool_call_id`.
- Progress: JSONL extraction now also recognizes Anthropic-style `web_fetch_tool_result` content blocks (`{"type":"web_fetch_tool_result","tool_use_id":"srvtoolu_...","content":{"type":"web_fetch_result","url":"https://...","content":...}}`) inside `content` arrays — the reply-side companion to `server_tool_use` web_fetch calls. Success replies now render behind a distinct `[web_fetch_tool_result] URL` prefix and append nested document content when present, while error replies render as `[web_fetch_tool_error] error_code=...`; `tool_use_id` now flows into `chunks.tool_call_id`, and nested base64 document payloads remain omitted in favor of the existing document provenance anchors. Focused unit tests cover success/error rendering, nested-document recursion, id recovery, precedence, and no-id-widening; a new `tests/jsonl_ingest.rs` regression pins end-to-end searchability for fetched URLs, nested document text/media-type anchors, and the no-base64-leak guarantee.
- Progress: top-level OpenAI Responses `computer_call_output` lines now also surface the sibling `output.current_url` field — the URL the browser was on at screenshot time, and typically the only keyword-recoverable anchor since the screenshot itself ships as a `data:image/png;base64,...` blob. The URL is appended after the existing primary anchor (so `[computer_call_output:image/png] https://site/page`, `[computer_call_output] {image_url} {current_url}`, or `[computer_call_output:file] {file_id} {current_url}`); a `current_url`-only payload that previously dropped now renders as `[computer_call_output] {current_url}`; blank/missing `current_url` preserves prior output exactly. The base64 payload still never lands in chunk text. Focused `src/jsonl.rs` tests cover data-URL/HTTPS/file_id + current_url combinations, the lonely current_url, blank-current_url no-op, and malformed-data-URL fallback, and a new `tests/jsonl_ingest.rs` regression pins end-to-end FTS searchability for the URL while reasserting the no-base64-leak guarantee.
- Progress: role-scoped ChatCompletions participant `name` values now also fold into the rendered role prefix for participant roles (`[user:alice]`, `[assistant:scribe]`, etc.) instead of staying metadata-only in `chunks.user`. This keeps multi-speaker transcripts keyword-searchable by participant identifier while preserving the earlier guard that `role="tool"`, `role="function"`, and role-less records still route `name` through tool-name behavior. Focused unit and end-to-end ingest tests pin the participant-role prefixes plus the tool-role non-regression.
- Progress: JSONL extraction now also recognizes legacy OpenAI ChatCompletions assistant `function_call` fields (`{"role":"assistant","content":null,"function_call":{"name":"get_weather","arguments":"{...}"}}`), rendering them behind the shared `[tool_use:NAME] ...` prefix so pre-`tool_calls` transcripts stop disappearing. `function_call.name` now recovers into `chunks.tool_name`, `content` and modern `tool_calls` still win when both are present, and `tool_call_id` correctly remains unset because the legacy shape carries no id. Focused `src/jsonl.rs` tests cover string/structured/missing arguments, missing/blank names, content/tool_calls precedence, and nested `message` envelopes; a new `tests/jsonl_ingest.rs` regression pins end-to-end ingest/search behavior for an archived ChatCompletions-style call + `role="function"` reply pair.
- Progress: legacy OpenAI ChatCompletions tool replies (`{"role":"function","name":"get_weather","content":"..."}`) now fold the tool name into a parallel `[function:NAME] {content}` role prefix — the legacy companion to the modern `role="tool"` → `[tool:NAME]` rendering. Pre-fix the prefix was just `[function]` so the tool name lived only in `chunks.tool_name` and was invisible to keyword search on the rendered chunk text. The two prefixes stay visually distinct so search can scope to the API era when needed; a missing/blank tool name falls back to the bare `[function]` prefix (parity with `role="tool"` without a name); and the existing `content_has_typed_tool_prefix` suppression still wins when a typed extractor (e.g. `mcp_call`) has already encoded its own provenance prefix into content. Focused `src/jsonl.rs` tests pin the basic name-folded render, the name-less fallback, and the no-double-prefix guard against a typed `mcp_call` payload; the existing `tests/jsonl_ingest.rs` end-to-end regression for the legacy `function_call` + `role="function"` reply pair now asserts the new `[function:get_weather]` prefix on the reply chunk.
- Progress: JSONL extraction now also recognizes legacy OpenAI ChatCompletions-style `image_url` content blocks (`{"type":"image_url","image_url":{"url":"https://...","detail":"high"}}` and the `data:` URL variant) inside `content` arrays — the multimodal companion to the legacy `function_call` slice above, and the predecessor of Responses' flat `input_image` shape. Unlike Responses where `image_url` is a flat string field, the ChatCompletions shape nests the URL under `image_url.url` alongside an optional `detail` field. Pre-fix the array walk silently dropped these because every existing image extractor was type-narrowed to `"image"` (Anthropic) or `"input_image"` (Responses), so archived ChatCompletions multimodal transcripts lost the attached image provenance entirely. After the fix the block renders behind a stable `[image_url]` prefix — deliberately distinct from `[image]` and `[input_image]` so keyword search can scope to the API era when needed — and folds in the most useful searchable anchor: `image_url.url` as a non-`data:` URL becomes `[image_url] {url}` (filename / hash stays searchable); a `data:` URL surfaces only the media type as `[image_url:{media_type}]` and the base64 payload is never serialized into the chunk. A missing/blank URL drops (bare prefix is pollution). Unlike `input_image`, there is no `file_id` fallback on the ChatCompletions shape, so a malformed `data:` URL with no clean media_type also drops rather than leaking the rest of the URL. The optional `detail` field stays strictly out of the chunk. A direct `text` field on the block still wins via the higher-priority text-field path; type-narrowed to `image_url` so `type=text` blocks carrying a stray nested `image_url` object aren't accidentally relabelled, and the `type=image_url` + flat-string variant (Responses shape with the wrong type) also drops to keep the strict-shape contract. Focused `src/jsonl.rs` tests pin URL rendering with the `detail` non-leak guarantee, data URL media-type rendering with the base64 non-leak guarantee, the no-URL / no-nested-object / malformed-data-URL / blank-media-type drops, text-field-wins precedence, mixed-block interleave order, the `type=text`-with-stray-image_url type-narrowing guard, side-by-side coexistence with `input_image` in the same turn, and the flat-string-on-image_url-type drop; a new `tests/jsonl_ingest.rs` integration test ingests a ChatCompletions multimodal user turn with both URL and data-URL `image_url` blocks plus a trailing assistant reply and asserts the URL is keyword-searchable behind `[image_url]`, the data-URL surfaces only by media type, the base64 payload never appears in any chunk, the prefix doesn't collide with `[input_image]`, and surrounding prose survives unchanged.
- Progress: Anthropic-style `code_execution_tool_result` blocks now also surface Files-API `code_execution_output` references attached to the inner `code_execution_result` via its nested `content` array (`{"type":"code_execution_output","file_id":"file_..."}` entries pointing at generated charts/exports). Pre-fix only stdout/stderr/return_code surfaced, so the strongest provenance anchor on a chart-producing run silently disappeared. After the fix each entry renders as a secondary `[code_execution_output] {file_id}` line appended after the main `[code_execution_result] ...` line so the file reference is keyword-searchable end-to-end; entries with no `file_id` or an unknown `type` are skipped (forward-compat), the existing no-useful-fields drop still fires when stdout/stderr/return_code AND outputs are all empty, and when only outputs are present the block renders behind a bare `[code_execution_result]` prefix followed by the output lines. The `tool_use_id`→`chunks.tool_call_id` path is unchanged, and the inner-content-as-string defensive case still drops cleanly. Focused `src/jsonl.rs` tests pin multi-output rendering with stdout, the outputs-only bare-prefix fallback, the anchorless/unknown-type entry drops, and the inner-content-string fallthrough; a new `tests/jsonl_ingest.rs` integration test ingests a real plot-generation transcript end-to-end and asserts each `file_id` is independently keyword-searchable behind the new prefix, the main result body still rides in the same chunk, the originating `srvtoolu_` call id round-trips into `chunks.tool_call_id`, and surrounding user/assistant prose survives unchanged.
- Progress: JSONL extraction now also recognizes legacy OpenAI ChatCompletions-style top-level `audio` assistant fields (`{"role":"assistant","content":null,"audio":{"id":"audio_...","data":"<base64>","transcript":"...","expires_at":...}}`) — the audio-output companion to the legacy `function_call` / `image_url` slices, shipped by ChatCompletions when `modalities=["text","audio"]`. Pre-fix the line dropped silently because `content` is null and no extractor knew about the top-level `audio` key — none of the content-block audio handlers fire on a flat field, and `payload_text` / `tool_calls_array_text` / `legacy_function_call_text` all return `None`. After the fix a new `legacy_assistant_audio_text` fallback in `src/jsonl.rs` (slotted in `lookup_content` right after `legacy_function_call_text`, so `content` / `tool_calls` / `function_call` still win when present) renders the field behind a stable `[audio] {transcript}` prefix — deliberately distinct from Responses-style `[input_audio]` / `[output_audio]` content-block prefixes so keyword search can scope to the ChatCompletions-era flat shape independently. The opaque base64 `data` payload is never serialized into the chunk under any path (mirrors the no-leak rule already pinned for `input_audio` / `output_audio` / image / document blocks), and the optional `id` / `expires_at` metadata stays strictly out of the chunk because neither is keyword-searchable. Missing or non-object `audio`, or a missing / blank `transcript`, drops the line so a bare `[audio]` placeholder never lands in chunk text. The fallback also recurses through the `message` envelope so Claude Code-style wrappers carrying a legacy audio assistant turn under `message` survive ingest. No `tool_call_id` widening — the legacy audio field is a content recovery path, not a tool call. Focused `src/jsonl.rs` tests pin the basic transcript render with the no-base64-leak + no-id-leak + no-expires_at-leak guarantees, the missing-transcript drop, the blank-transcript drop, a non-object `audio` field drop, content-wins-over-`audio` precedence, the `message`-envelope extraction path, and a prefix-distinct-from-`[input_audio]`/`[output_audio]` invariant; a new `tests/jsonl_ingest.rs` integration test ingests a ChatCompletions transcript with a user prompt, an assistant audio reply carrying a transcript plus base64 `data` and `id` / `expires_at` metadata, and a trailing user thank-you, then asserts the transcript token is keyword-searchable behind `[audio]`, the base64 payload + `audio_session_001` id never appear in any chunk, `chunks.tool_call_id` stays NULL, and surrounding user prose survives unchanged.

- Progress: legacy OpenAI ChatCompletions assistant refusals (`{"role":"assistant","content":null,"refusal":"..."}`) now survive ingest behind the same `[refusal] ...` prefix used for nested refusal content blocks, so structured-output / safety-mediated flat refusal turns stop disappearing. The new `legacy_assistant_refusal_text` fallback sits after `legacy_assistant_audio_text` in `lookup_content`, keeps content-wins precedence, recurses through `message` envelopes, drops blank/missing/non-string refusals, and deliberately does not widen any `tool_call_id`. Focused `src/jsonl.rs` tests pin flat rendering, drop cases, content precedence, shared-prefix parity with block refusals, and envelope recursion; `tests/jsonl_ingest.rs` adds an end-to-end ingest/search regression proving the refusal text is searchable while `tool_call_id` stays NULL.
- Progress: legacy ChatCompletions-style assistant `reasoning_content` / `reasoningContent` fields now survive ingest behind the shared `[thinking] ...` prefix used by Anthropic `thinking` and Responses `reasoning_text` blocks, so reasoning-only assistant turns from ChatCompletions-style providers stop disappearing when `content` is null. The new fallback keeps content-wins precedence, recurses through `message` envelopes, drops blank/non-string payloads, and keeps `tool_call_id` unset; focused unit and end-to-end ingest tests pin alias precedence, envelope recovery, and the no-tool-call-id contract.
- Progress: JSONL extraction now also surfaces Google Gemini / Vertex AI flat-message `parts` arrays (`{"role":"user","parts":[{"text":"..."}]}` and `{"role":"model","parts":[{"text":"..."},{"text":"..."}]}`), so Gemini-style transcripts stop dropping silently. Pre-fix the line dropped because none of the content/text/message/body/arguments/input/tool_calls fallbacks fire on `parts`, and `content_to_text` only walks `content`. After the fix `lookup_content`'s `from` helper adds a `parts` fallback that reuses `content_to_text` (which already handles `Value::Array` of `{text}` objects via its generic block-text precedence), so a Gemini user turn renders as `[user] {text}` and a model turn renders as `[model] {text1}\n{text2}` joined in array order — `role="model"` is preserved as-is rather than normalized to `assistant`, keeping provenance intact. The new fallback sits after `content` / `output_text` / `text` so existing flat-content lines stay unchanged ("content wins" / "text wins" precedence verified), and is type-narrowed to arrays so a defensive non-array `parts` field (string / scalar / missing) drops cleanly without polluting the chunk with a bare `[role]` prefix. Empty-array `parts` drops the line entirely. Mixed parts (text + Gemini `functionCall` / `functionResponse`) extract just the text — structured tool-call parts have no `text` and no recognized block `type`, so `content_to_text` skips them. That's a deliberate scope-limit: this slice ships text-only Gemini coverage, and a follow-up can extend the array walk to recognize `functionCall` / `functionResponse` parts behind `[tool_use:NAME]` / `[tool_result]`-style prefixes. The `parts` recovery also recurses through the `message` envelope so Claude Code-style wrappers around Gemini inner messages survive ingest. Focused `src/jsonl.rs` tests (9 new) pin basic user-text rendering with `[user]` prefix, multi-text-part array-order joining with the Gemini-specific `[model]` prefix, empty-array drop, non-array drop, content-wins-over-parts precedence, text-wins-over-parts precedence, inner-`message`-envelope recovery, outer-wins-over-inner-message precedence, and the text-extracted / functionCall-dropped contract for mixed-parts turns. A new `tests/jsonl_ingest.rs` integration test ingests a real Gemini transcript with a `role="user"` text-parts prompt and a `role="model"` two-text-parts reply and asserts both turns survive end-to-end keyword search, the model parts join into a single chunk, and `chunks.role` round-trips the Gemini-specific `model` value.
- Progress: JSONL extraction now also surfaces legacy OpenAI ChatCompletions `web_search_options` message-level `annotations` arrays on assistant turns (`{"role":"assistant","content":"...","annotations":[{"type":"url_citation","url_citation":{"url":"...","title":"..."}}]}`), the flat-message companion to the Responses content-block `output_text` + `annotations` handler already in place. Pre-fix the body surfaced cleanly via the existing content path but the cited URL and title were silently dropped — `output_text_with_annotations_to_text` is type-narrowed to `output_text` blocks and the ChatCompletions shape lives at the message level with the citation fields NESTED under a `url_citation` sub-object instead of flat on the entry. After the fix `append_chat_completions_annotations` augments any content sourced from `lookup_content`'s primary path (outer scope or inner `message` envelope) with `[url_citation] {url}` / `[url_citation] {url} {title}` lines in array order, sharing the prefix with the Responses-style handler so keyword search matches both shapes uniformly. Scope-local: annotations on `outer` only attach to content sourced from `outer`, and inner likewise — so a Claude Code-style `message` envelope still routes its own citations to its own body without leaking across. `message` / `body` scalar fallbacks don't get the augmentation (the shape doesn't ship annotations in practice). `chat_completions_annotation_to_text` is strictly type-narrowed on the nested `url_citation` sub-object so a Responses-style flat shape accidentally surfacing at the message level drops cleanly without relabeling. Empty/missing arrays, unknown types, entries dropped for missing inner objects or missing-URL all collapse to the bare body with no trailing newline pollution. No `tool_call_id` widening — annotations are content provenance, not tool calls. Focused `src/jsonl.rs` tests (11 new) pin basic URL+title rendering, title-less rendering, multi-citation array-order joining, empty/... [truncated]
- Progress: Gemini / Vertex AI `parts` arrays now also surface structured `functionCall` and `functionResponse` parts instead of dropping them. `functionCall` renders behind the shared `[tool_use:NAME]` prefix with deterministic `args` serialization, `functionResponse` renders behind `[tool_result]` with deterministic `response` serialization, text + tool parts still join in array order, and focused unit plus end-to-end ingest/search tests pin message-envelope recovery and keyword searchability for both the call and the reply.


**Status: DONE** — Replaced normalize-by-max BM25 with RRF (k=60). Dropped `--weight` parameter. `blend_hits` now ranks each result list and sums `1/(k+rank)` contributions per chunk_id.
- Replace `normalize_bm25` with RRF: `score = 1 / (k + rank)` where k=60
- Three lines of code, handles the "only one side has this hit" case naturally
- Drop the `--weight` parameter (RRF doesn't need it)

## P1 — Important

### 4. sqlite-vec Vector Index
**Status: DONE** — `src/store.rs` loads sqlite-vec, default-model embeddings dual-write into `chunks_vec_nomic_768`, semantic search auto-routes when eligible, and existing stores backfill the mirror on upgrade (schema v5).

### 5. Decay Weighting / Confidence Scoring
**Status: DONE** — access/recency tracking, decay maintenance, confidence output, user feedback, and query-success signals are all implemented across CLI, MCP, search, export, and inspection surfaces.
**Why:** Lantern treats all chunks as equally true. agent-memory-mcp has explicit usefulness ranking.
- Track access count, recency, and user feedback per chunk
- Decay older chunks that are never retrieved
- Surface confidence score in search results
- Consider: was this chunk ever used to answer a query successfully?

**Progress (slice 1):** `chunks.access_count` / `chunks.last_accessed_at` (schema v7)
are now read by keyword, semantic, vec, and hybrid search paths and surfaced on
`SearchHit`. `compute_confidence(now, last_accessed_at, timestamp_unix,
access_count)` replaces the old timestamp-only helper with a deterministic blend
of freshness decay (30-day time constant, 0.25 floor, `last_accessed_at`
preferred over `timestamp_unix`) and access-count saturation (`1 - exp(-n/5)`).
Ranking now uses confidence as a deterministic secondary tie-breaker when the primary score is equal. Retrieval still bumps access_count and last_accessed_at
for returned hits across keyword, semantic, vec, and hybrid search paths.
- Progress: added `chunks.feedback_score` (schema v8) plus `lantern::feedback::{record_feedback,get_feedback_score}`; `SearchHit` now surfaces `feedback_score`, and confidence scoring folds it in with a neutral `0` default so existing stores are unchanged. Focused tests cover the new round-trip and confidence behavior, and `lantern feedback <chunk_id> up|down` now provides a direct CLI write path for the signal.
- Progress: added `lantern compact` as a background maintenance pass for stale access metadata. It decays `access_count` using a separate `access_decay_at` checkpoint so repeated runs stay idempotent, and search touches now refresh that checkpoint alongside `last_accessed_at`. Focused tests cover fresh vs stale rows and the search/compact round trip.
- Progress: compact defaults now start decaying after one week instead of waiting a full month, while keeping the 30-day half-life and CLI override knobs intact. A regression test now pins the default minimum-age gate, nudging the background maintenance pass a bit more aggressive while the broader automation tuning remains open.
- Progress: MCP now exposes chunk feedback through `lantern_feedback`, mirroring the CLI feedback write path and covered by a focused sync-path test.
- Progress: end-to-end test now asserts that negative feedback strictly lowers a chunk's `confidence` below a neutral peer after a second search, closing the symmetric gap next to the existing positive-feedback test.
- Progress: `lantern compact` now supports `--dry-run` previews so operators can inspect the hypothetical decay impact without mutating the store. The report and text output surface the preview mode, and focused tests cover the no-write path alongside the existing mutation tests.
- Progress: `lantern compact` now reports `skipped_recent_chunks` — the count of scanned chunks held back because their age was below `minimum_age_secs`. Surfaced in both text output (`skipped_recent=N`) and the JSON report so operators can see how much of a pass was no-op due to recency. Focused tests cover the recent-only path and a mixed recent/stale store where one chunk decays and one is skipped.
- Progress: `lantern compact` now also reports `decay_fraction` in the maintenance summary, so automation can see how much of a pass actually decayed chunks at a glance. The JSON report carries the field directly and focused tests cover the empty-store and mixed recent/stale cases.
- Progress: pinned the asymmetric `last_accessed_at.or(timestamp_unix)` precedence rule with a focused regression test — a stale `last_accessed_at` now provably shadows a fresh `timestamp_unix` (the decayed floor wins), guarding against an accidental "max" rewrite of the reference-picking logic.
- Progress: MCP now exposes compact decay through `lantern_compact` / `LanternServer::compact_sync`, and a dry-run regression test covers the transport-free code path without mutating the store.
- Progress: added `--min-confidence <f64>` to `lantern search` / `lantern query`. `SearchOptions::min_confidence` / `SemanticOptions::min_confidence` enforce the floor across keyword, semantic, vec-semantic, and (post-blend) hybrid paths, before `bump_access_metadata` — filtered chunks do not count as retrievals. Default `None` preserves existing behavior; MCP now exposes the same floor too, so the decay-aware filter is consistent across CLI and tool calls.
- Progress: search summary/text metadata now surfaces `access_count` and `feedback_score` alongside role/session/turn/tool/timestamp so confidence explanations show the access/feedback inputs directly.
- Progress: search summary/text metadata now also surfaces `last_accessed_at` when present, so the freshness component behind confidence is visible in the human-readable path too.
- Progress: search results now also surface `access_decay_at` in the detailed metadata line and JSON result payload, so the decay checkpoint used by compact is inspectable alongside confidence inputs.
- Progress: search results now expose a structured `confidence_breakdown` in JSON (`freshness`, `access_boost`, `base`, `feedback_factor`) so callers can explain why a hit's confidence landed where it did without re-deriving the formula.
- Progress: human-readable search output now includes a compact `breakdown=...` metadata token for the same confidence components, so `search` / `query` output stays inspectable without switching to JSON.
- Progress: the confidence-breakdown contract is now pinned by regression tests, including the JSON field set and the reconstruction invariant that the breakdown recomposes the public confidence score exactly.
- Progress: search confidence now also surfaces `freshness_source` (`last_accessed_at`, `timestamp_unix`, or `none`) in both JSON and text output, so the freshness component explains which timestamp actually drove the score. Focused tests pin the new serialized contract.
- Progress: added a focused regression test that pins the neutral-feedback no-op explicitly: when `feedback_score == 0`, `compute_confidence_breakdown` returns `feedback_factor == 0.0` and `confidence == base`, guarding the migration-default behavior from accidental drift.
- Progress: `lantern compact` now has regression coverage for stale decays, checkpoint idempotency, dry-run non-mutation, and decay-fraction reporting, so the access-metadata maintenance slice is pinned end-to-end.
- Progress: added query_success_count (schema v11), plus `lantern::query_success::{record_query_success,get_query_success_count}`; `SearchHit` now surfaces `query_success_count`, confidence scoring folds in a positive-only query-success factor, and focused tests cover the neutral default, positive lift, formatter breakdowns, and export/inspect/reindex schema-version bumps.
- Progress: `compute_confidence_breakdown` now clamps negative `query_success_count` values back to the neutral baseline, and a regression test pins the behavior so corrupted stores cannot turn observed success into a penalty.
- Progress: tightened the `min_confidence` enforcement path so the floor and `bump_access_metadata` run exactly once per call across keyword, semantic, vec, and hybrid search. Each path now splits into a public function that applies the floor and bumps survivors and a private `*_candidates` helper that does neither, so hybrid can fuse raw candidates and only bump the blended survivors. A new regression test pins the hybrid invariant: when the blended floor drops a chunk, its `access_count` / `last_accessed_at` stay untouched even though the inner keyword and semantic passes saw it.
- Progress: export/show chunk dumps now include `query_success_count`, so offline snapshots and single-source inspections preserve the full confidence input set.
- Progress: `lantern query-success <chunk_id>` now increments the observed-success counter and reports the new score, giving a direct CLI path for the positive-only signal used by confidence scoring.
- Progress: MCP now exposes query-success through `lantern_query_success`, mirroring the CLI write path and covered by a focused sync-path test.
- Progress: single-source `export` now snapshots `confidence` plus `confidence_breakdown` per chunk at render time, and `show` prints a compact `confidence=` metadata token alongside the other provenance fields.
- Progress: added integration-test parity for query-success via `tests/query_success.rs`, covering the neutral default, increment visibility, strict confidence lift over a neutral peer, and missing-chunk error handling; `src/search.rs` now also pins the query-success zero-default as a confidence-breakdown no-op.
- Progress: pinned the `query_success_count.max(0)` defensive clamp inside `compute_confidence_breakdown` with a focused regression test, mirroring the existing `confidence_negative_access_count_treated_as_zero` coverage. Guards against a future refactor that would drop the clamp and turn a negative (corrupted) count into a confidence *pull* instead of the documented neutral no-op.
- Progress: single-source `show` text now emits a compact `breakdown=...` token with freshness source / access / feedback / query-success components, keeping the provenance dump inspectable without switching to JSON. Focused tests cover the render path.
- Progress: `lantern feedback` / `lantern query-success` now return and print the updated confidence snapshot alongside the raw score/count, so the write paths surface the same decay-aware signal that search uses. Focused tests pin the new report fields.
- Progress: confidence breakdown token formatting is now centralized in `search::format_confidence_breakdown_token`, and `search`, `show`, `feedback`, and `query-success` all reuse it so the compact human-readable explanation stays identical across surfaces.
- Progress: `lantern show` chunk metadata line now surfaces an explicit `freshness_source=<last_accessed_at|timestamp_unix|none>` token alongside the existing `breakdown=...` payload, so the precedence rule is visible at a glance without parsing the compact breakdown. Focused tests in `src/show.rs` cover all three variants.
- Progress: `lantern feedback` and `lantern query-success` now mirror `show` by printing the explicit `freshness_source=<...>` token in their human-readable output, and focused tests pin the new formatting helpers.
- Progress: `lantern feedback` and `lantern query-success` now also surface `access_decay_at=<...>` when the chunk has been compacted, so the write-path confidence snapshot exposes the same decay checkpoint used by search/show. Focused tests cover the new optional token.
- Progress: `lantern inspect` now reports a `confidence_signals` block — counts of chunks with non-zero `access_count`, `feedback_score`, `query_success_count`, and `access_decay_at` checkpoints — so operators can see at a glance how much retrieval / vote / observed-success / decay-maintenance signal the store has accumulated. Surfaced in both text (`signals: access=N feedback=N query_success=N decay_checkpoint=N`) and JSON. Focused integration tests pin the aggregate query, including the new decay-checkpoint count.

### 6. Knowledge Graph / Entity Extraction
**Why:** Lantern now has a lightweight entity layer, but it is still mostly co-occurrence-based rather than a richer typed knowledge graph.
- Extract entities (people, projects, concepts, files) from ingested content
- Build typed relationships between entities
- Queryable graph layer on top of the flat chunk store
- Use LLM extraction or NER — can be a post-ingest step
- Progress: URL and email entity extraction now run during ingest. `src/entities.rs` extracts `http(s)://` URLs plus simple ASCII email addresses from chunk text, persists them into the schema-v10 `entities` / `chunk_entities` tables, and is covered by focused tests that verify deduplication and linking.
- Progress: `src/entities.rs` now also extracts conservative backtick-wrapped file-path literals (for example `src/main.rs` and `Cargo.toml`) into a new `filepath` entity kind, and now also picks up plain relative/absolute path tokens like `./src/main.rs`, `/tmp/example.log`, and `../notes/todo.md` when they have a clear path-like shape. This keeps the regex-free, deduplicated provenance layer moving toward richer graph edges.
- Progress: `src/entities.rs` now extracts `@mention` handles into a new `mention` entity kind. Conservative rules: the `@` must not be preceded by an email-local character (so emails stay emails), the body must be at least two characters from `[A-Za-z0-9._-]`, and at least one ASCII letter is required (so `@2024-01-15` and `@1.2.3` do not become mentions). Schema v10's `(kind, value)` shape absorbs the new kind without a migration, and focused tests cover basic mentions, mention/email coexistence, dedup, trailing-separator trimming, short-handle rejection, and the digit/date guard.
- Progress: entity data now has a small read API: `entity_kind_from_str` parses CLI/MCP kind filters and `list_entities` returns ordered, filterable entity listings with chunk-reference counts, plus regression tests for ordering, filtering, literal substring matching, and limit handling.
- Progress: `lantern entities` CLI surface now exposes the listing API, with `--kind {url|email|filepath|mention}`, `--value-contains <substr>`, `--limit` (default 50), and `--format text|json`. Output reuses `entities::print_text` / `print_json`, so the human-readable and JSON shapes stay in lockstep with the library. Focused parse tests pin the default-only invocation, the full filter set, and rejection of unknown kinds.
- Progress: MCP now exposes the same entity listing surface through `lantern_entities` / `LanternServer::entities_sync`, so agents can inspect URL/email/filepath/mention entities without shelling out. A focused MCP test now covers the transport-free code path alongside the existing CLI and library coverage.
- Progress: `lantern entities --show-chunks N` now optionally loads up to N linked chunk refs per entity (chunk id, source URI, snippet) through `EntityListOptions::with_chunks`, giving a lightweight entity→chunk edge view without changing the default flat listing. Focused tests cover CLI parsing and chunk-ref ordering/sampling.
- Progress: `src/entities.rs` now also extracts `#hashtag` tags with conservative URL-fragment suppression, and the new kind is threaded through CLI/MCP filters and entity-kind parsing. Focused tests cover extraction, URL fragment suppression, and kind round-tripping.
- Progress: added `lantern entity-neighbors` plus `entities::entity_neighbors` to list co-occurring entities for a source entity, ranked by shared-chunk count with deterministic tie-breakers. The command supports kind/limit filters and text/JSON output, now tags each neighbor with a stable `edge_kind=co_occurs_with` graph label, and focused tests cover ranking, truncation, empty neighbors, missing-entity errors, and the new edge marker.

- Progress: `lantern entity-neighbors` now has an MCP twin, `lantern_entity_neighbors`, so agents can traverse the lightweight co-occurrence graph without shelling out. The MCP surface mirrors the CLI's entity-id/kind/limit options, reuses `LanternServer::entity_neighbors_sync`, and is pinned by a focused transport-free regression test that verifies ranked mention neighbors for a hashtag source.
- Progress: `lantern entity-neighbors` now also supports `--show-chunks` / `show_chunks` across the core entity layer, CLI, and MCP, returning shared chunk refs (chunk id, source uri, snippet) as concrete evidence for each co-occurrence edge. Focused tests pin both the core entity report and the MCP sync path, and the text formatter now renders the sampled supporting chunks under each neighbor.
- Progress: the entity-neighbor relationship label is now a typed `EdgeKind` enum in Rust instead of a bare string, while keeping the JSON/text value stable as `co_occurs_with`. That gives the graph edge surface a real type boundary for future relationship kinds without changing external behavior, and focused tests pin the enum round-trip plus the MCP neighbor path.
- Progress: entity extraction now also derives conservative `repo` entities from repository-root URLs like `https://github.com/diogenes/lantern` and `https://git.skylantix.com/diogenes/lantern/`, so project-like sources can be surfaced alongside URLs. CLI/MCP kind filters and parser help now include `repo`, and focused tests cover URL-derived repo extraction plus kind parsing/round-tripping.
- Progress: entity extraction now also derives conservative `domain` entities from URL hosts and email domains, normalizing case, stripping URL userinfo/ports, deduplicating shared hosts across URL/email mentions, and skipping ambiguous single-label / IPv6-literal hosts. CLI/MCP kind filters now include `domain`, and focused tests cover URL/email extraction, deduplication, normalization, ingest persistence, and parser round-tripping.
- Progress: `lantern entities` and MCP `lantern_entities` now accept an exact `session_id` filter, turning the flat entity list into a session-scoped graph view. When set, `chunk_count` and optional `--show-chunks` evidence are restricted to that session's chunks instead of global counts, and the report echoes the active `session_id` so callers can confirm the narrowed scope.
- Progress: `lantern entity-neighbors` and MCP `lantern_entity_neighbors` now also accept an exact `session_id` filter, scoping neighbor discovery, per-neighbor `shared_chunks` / `chunk_count`, and optional `--show-chunks` evidence to one session's chunks instead of the whole store. The text/JSON report now echoes the active session scope, and focused tests pin the core query plus CLI/MCP wiring.
- Progress: `lantern entities` now also reports `session_count` per entity — distinct sessions whose chunks reference it (chunks with `session_id IS NULL` excluded) — telling "spammed in one session" apart from "spread across many sessions". Surfaced in both text (`sessions=N` token) and JSON, and collapses to `1` under the existing `session_id` filter by construction. Focused tests cover the all-NULL → `0` case, the multi-session distinct count with mixed NULLs, and the session-scoped collapse to `1`.
- Progress: added `lantern entity-session-neighbors <entity_id>`, a session-level graph read surface that finds other entities appearing in at least one of the same sessions even when they never share a chunk. Results are ranked by distinct shared-session count, carry a typed `edge_kind=same_session_as` label plus each neighbor's overall `session_count`, exclude `NULL`-session chunks, support `--kind`, `--limit`, and text/JSON output, and are pinned by focused unit tests plus a CLI smoke test.
- Progress: `lantern entity-session-neighbors` now also accepts an exact `--session-id` filter, and MCP exposes the same scope through `lantern_entity_session_neighbors`. When set, source-session counting and neighbor discovery collapse to that single session, the text/JSON report echoes the active `session_id`, and focused tests cover the core query plus CLI/MCP wiring.
- Progress: chunk-level `lantern entity-neighbors` now also reports each neighbor's distinct `session_count` alongside `shared_chunks` / `chunk_count`, letting callers tell "co-occurs in this session" apart from "spread across many sessions". Surfaced in text (`sessions=N` token) and JSON, scoped to `--session-id` when set (collapses to 0/1 by construction), and transparently exposed through MCP `lantern_entity_neighbors`. Focused tests cover the all-NULL → `0` case, distinct-session counting with mixed NULLs, and the session-scoped collapse to `1`.
- Progress: `lantern show` now accepts `--show-entities N`, surfacing up to N extracted chunk-level entities inline in text (`entities=[kind:value, ...]`) and JSON (`chunks[].entities`) without changing the default output shape. The loader reuses deterministic `(kind, value, id)` ordering from the entity tables so chunk snapshots can jump straight into the graph surfaces, and focused CLI/unit/integration tests cover flag parsing, opt-in omission by default, limit handling, and JSON output.
- Progress: MCP `lantern_show` now mirrors the CLI `--show-entities` knob through a new `show_entities: Option<usize>` arg, routed through `LanternServer::show_sync` so the entity opt-in is exposed to agent callers without shelling out. `Some(0)` collapses to the cheap default path identically to `None`, preserving the prior wire shape (no `entities` field) when the option is omitted; `Some(n > 0)` populates `chunks[].entities` with up to N kind/value pairs in the same deterministic order as `show --show-entities`. A focused regression test in `src/mcp.rs` pins all three cases (omitted / explicit 0 / opt-in) end-to-end against an entity-rich ingested chunk.
- Progress: `lantern entity-session-neighbors` now supports opt-in shared-session evidence through `--show-sessions`, and the MCP `lantern_entity_session_neighbors` surface mirrors it with `show_sessions`. When requested, each neighbor returns a deterministic sample of the actual shared `session_id` values linking it to the source entity, while the default count-only path stays unchanged. Focused tests cover omission by default, explicit zero, deterministic ordering/truncation, CLI parsing, the transport-free MCP path, and the single-session collapse under `--session-id`.
- Progress: `lantern entities` now supports opt-in session evidence through `--show-sessions`, and MCP `lantern_entities` mirrors it with `show_sessions`. When requested, each entity returns a deterministic sample of distinct `session_id` values that reference it, while the default flat listing stays unchanged and explicit `0` collapses back to omission. Under `--session-id`, the evidence collapses to that single session. Focused tests cover CLI parsing, core omission/zero behavior, deterministic ordering/truncation, scoped collapse, and MCP parity.
- Progress: chunk-level `lantern entity-neighbors` now also supports opt-in shared-session evidence through `--show-sessions`, and MCP `lantern_entity_neighbors` mirrors it with `show_sessions`. When requested, each neighbor returns a deterministic sample of the actual shared `session_id` values linking it to the source entity, while the default count-only path stays unchanged and explicit `0` collapses back to omission. Under `--session-id`, the evidence collapses to that single session. Focused tests cover CLI parsing, core omission/zero behavior, deterministic ordering/truncation, scoped collapse, and MCP parity.
- Progress: chunk-level `lantern entity-neighbors` now also supports opt-in shared-project evidence through `--show-projects`, and MCP `lantern_entity_neighbors` mirrors it with `show_projects`. When requested, each neighbor returns a deterministic sample of distinct non-NULL shared `project` values linking it to the source entity, while the default count-only path stays unchanged, explicit `0` collapses back to omission, and a `--session-id` filter scopes the sample to that session's shared chunks without forcing it down to one project.
- Progress: `lantern entities` now also reports a `project_count` per entity — distinct upstream `chunks.project` values (NULL excluded) whose chunks reference it — bridging the entity graph layer with the project metadata added in schema v14. Mirrors the existing `session_count` shape: surfaced in text (`projects=N` token alongside `sessions=N`) and JSON (auto via serde), and `project_count == 0` cleanly distinguishes "no project metadata" from `project_count == 1` ("tied to one project") and `project_count > 1` ("spread across projects"). Independent of the `session_id` filter — a session-scoped listing can still span multiple projects, so the count does NOT collapse to 1. Focused tests cover the all-NULL → 0 case, distinct-project counting with a mixed-NULL row, and the project/session orthogonality under a `session_id` filter.
- Progress: `lantern entities` now supports opt-in project evidence through `--show-projects`, and MCP `lantern_entities` mirrors it with `show_projects`. When requested, each entity returns a deterministic sample of distinct non-NULL `project` values that reference it, while the default flat listing stays unchanged, explicit `0` collapses back to omission, and a `--session-id` filter scopes the sample to that session's chunks without forcing it down to one project.
- Progress: chunk-level `lantern entity-neighbors` now also reports each neighbor's distinct `project_count` alongside `shared_chunks` / `chunk_count` / `session_count`, mirroring the project-grouping spread metric already surfaced on `lantern entities`. Distinct non-NULL `chunks.project` values are counted per neighbor and surfaced in both text (`projects=N` token) and JSON (auto via serde) for `lantern entity-neighbors`; MCP `lantern_entity_neighbors` inherits the new field automatically via the shared `EntityNeighborsReport` shape. Follows the same chunk-scope rule as `chunk_count` rather than `session_count`: under a `session_id` filter the count reflects distinct projects among the neighbor's chunks *inside* that session and does NOT collapse to 1, keeping project grouping orthogonal to session grouping. Focused tests cover the all-NULL → 0 case, the distinct-project count with a mixed-NULL row, the session-scope orthogonality (`session_count` collapses to 1 while `project_count` stays at 2 within s1), and an MCP transport-free regression that pins the field on both global and session-scoped wire shapes.
- Progress: session-level `lantern entity-session-neighbors` now also reports each neighbor's distinct `project_count`, mirroring the spread metric already present on `lantern entities` and chunk-level `entity-neighbors`. The count uses distinct non-NULL `chunks.project` values and is surfaced in both text (`projects=N`) and JSON for CLI/MCP callers. Like the chunk-level surface, `project_count` stays orthogonal to `--session-id`: `session_count` collapses to 1 under a single-session scope, but `project_count` still reflects the distinct projects represented by that neighbor's chunks *inside* the scoped session. Focused tests cover the all-NULL → 0 case, mixed-NULL distinct counting, the session-scope orthogonality rule, and an MCP transport-free parity regression.
- Progress: session-level `lantern entity-session-neighbors` now also supports opt-in shared-project evidence through `--show-projects`, and MCP `lantern_entity_session_neighbors` mirrors it with `show_projects`. When requested, each neighbor returns a deterministic sample of distinct non-NULL shared `project` values linking it to the source entity, while the default count-only path stays unchanged, explicit `0` collapses back to omission, and a `--session-id` filter scopes the sample to that session's chunks without forcing it down to one project.
- Progress: session-level `lantern entity-session-neighbors` now also supports opt-in chunk evidence through `--show-chunks`, and MCP `lantern_entity_session_neighbors` mirrors it with `show_chunks`. When requested, each neighbor returns a deterministic sample of shared-session chunk refs (`chunk_id`, `source_uri`, `snippet`) drawn from chunks that reference either side inside a session both entities share; the default shape still omits `chunks`, explicit `0` collapses back to omission, and `--session-id` scopes the sample to that session's chunks.
- Progress: `lantern entities` now also reports a `user_count` per entity — distinct upstream `chunks.user` values (NULL excluded) whose chunks reference it — extending the project-grouping spread metric to the user grouping field added in schema v15. Mirrors the existing `project_count` / `session_count` shape exactly: surfaced in text (`users=N` token alongside `sessions=N` / `projects=N`) and JSON (auto via serde on `EntityListEntry`), and `user_count == 0` cleanly distinguishes "no user metadata" from `user_count == 1` ("tied to one user") and `user_count > 1` ("spread across users"). Independent of the `session_id` filter — one session can legitimately span multiple users (e.g. a shared agent thread), so the count does NOT collapse to 1 under a session-scoped listing. MCP `lantern_entities` inherits the new field automatically via the shared `EntityListReport` shape. Focused tests cover the all-NULL → 0 case, distinct-user counting with a mixed-NULL row (alice/alice/bob/NULL → 2), and the user/session orthogonality under a `session_id` filter.
- Progress: `lantern entities` now also supports opt-in user evidence through `--show-users`, and MCP `lantern_entities` mirrors it with `show_users`. When requested, each entity returns a deterministic sample of distinct non-NULL `user` values that reference it, while the default flat listing stays unchanged, explicit `0` collapses back to omission, and a `--session-id` filter scopes the sample to that session's chunks without forcing it down to one user.
- Progress: chunk-level `lantern entity-neighbors` now also reports each neighbor's distinct `user_count` alongside `shared_chunks` / `chunk_count` / `session_count` / `project_count`, extending the user-grouping spread metric already surfaced on `lantern entities` to the neighbor graph. Distinct non-NULL `chunks.user` values are counted per neighbor and surfaced in both text (`users=N` token) and JSON (auto via serde) for `lantern entity-neighbors`; MCP `lantern_entity_neighbors` inherits the new field automatically via the shared `EntityNeighborsReport` shape. Follows the same chunk-scope rule as `chunk_count` rather than `session_count`: under a `session_id` filter the count reflects distinct users among the neighbor's chunks *inside* that session and does NOT collapse to 1, since a single session may legitimately span multiple users (e.g. a shared agent thread). Focused tests cover the all-NULL → 0 case, the distinct-user count with a mixed-NULL row, the session-scope orthogonality (`session_count` collapses to 1 while `user_count` stays at 2 within s1), and an MCP transport-free regression that pins the field on both global and session-scoped wire shapes.
- Progress: session-level `lantern entity-session-neighbors` now also reports each neighbor's distinct `user_count`, mirroring the existing spread metric on `lantern entities` and chunk-level `entity-neighbors`. Surfaced in text (`users=N`) and JSON, inherits through MCP `lantern_entity_session_neighbors`, excludes NULL users, and stays orthogonal to `--session-id` (session scope can still span multiple users). Focused tests cover the all-NULL → 0 case, mixed-NULL distinct counting, the session-scope orthogonality rule, and MCP parity.
- Progress: chunk-level `lantern entity-neighbors` and session-level `lantern entity-session-neighbors` now support opt-in shared-user evidence through `--show-users`, and the MCP `lantern_entity_neighbors` / `lantern_entity_session_neighbors` surfaces mirror it with `show_users`. When requested, each neighbor returns a deterministic sample of distinct non-NULL shared users linking it to the source entity; the default shape still omits `shared_users`, explicit `0` collapses back to omission, and `--session-id` scopes the sample to that session's chunks without forcing it down to one user.
- Progress: `lantern entities` now also reports a `topic_count` per entity — distinct upstream `chunks.topic` values (NULL excluded) whose chunks reference it — extending the spread metric to the topic grouping field added in schema v16. Mirrors the existing `project_count` / `user_count` shape: surfaced in text (`topics=N` token alongside `sessions=N` / `projects=N` / `users=N`) and JSON (auto via serde on `EntityListEntry`), with `topic_count == 0` cleanly distinguishing "no topic metadata" from `topic_count == 1` ("tied to one topic") and `topic_count > 1` ("spread across topics"). Independent of the `session_id` filter — one session can legitimately span multiple topics, so the count does NOT collapse to 1 under a session-scoped listing. MCP `lantern_entities` inherits the new field automatically via the shared `EntityListReport` shape. Focused tests cover the all-NULL → 0 case, distinct-topic counting with a mixed-NULL row (alpha/alpha/beta/NULL → 2), and the topic/session orthogonality under a `session_id` filter.
- Progress: `lantern entities` now also supports opt-in topic evidence through `--show-topics`, and MCP `lantern_entities` mirrors it with `show_topics`. When requested, each entity returns a deterministic sample of distinct non-NULL `topic` values that reference it; the default flat listing still omits the field, explicit `0` collapses back to omission, and a `--session-id` filter scopes the sample to that session's chunks without forcing it down to one topic. Focused tests now pin CLI parsing/defaults, core omission/zero behavior, deterministic ordering/truncation, scoped non-collapse, MCP parity, and the transport-free MCP callers in `tests/semantic_search.rs` that had to adopt the new arg.

### 7. Autonomous Consolidation

- Periodic job that merges related chunks into summaries
- Replace 50 near-duplicate chunks with 1 dense summary
- Triggered by threshold (chunk count, store size, time since last consolidation)
- Keep provenance chain — summaries link back to source chunks

### 8. Multi-Session Memory Linkage
**Why:** If you ingest 50 support sessions, the store knows they're 50 JSONL sources but has no notion that they're related to the same student or topic.
- Add session grouping metadata (topic, user, project, thread)
- Auto-detect session relationships (shared entities, temporal proximity, overlapping terms)
- Query across linked sessions: "what do we know about student X?"
- Session bundles — group sources into logical collections
- Progress: added `lantern sessions` plus `sessions::list_sessions`, a small read-only listing surface that groups chunks by the existing `chunks.session_id` column and reports per-session chunk count, distinct source count, and first/last `timestamp_unix`. Ordered by chunk count desc then session id asc, with `--limit` (default 50) truncating entries while `total_sessions` stays full. Chunks without a `session_id` are excluded — this is an inspectable surface, not a session-inference layer. Tests in `tests/sessions.rs` cover empty stores, exclusion of null sessions, multi-source grouping, timestamp aggregation, ordering, and limit/total separation.
- Progress: `lantern search` and `lantern query` now accept `--session-id`, and the MCP `lantern_search` surface now accepts `session_id`, all as exact filters on `chunks.session_id` across keyword, semantic, and hybrid retrieval. Focused tests in `tests/session_filter.rs` cover direct search filtering plus CLI search/query behavior, while existing semantic/MCP test coverage was updated to keep the new option wired through every path.
- Progress: added `lantern related-sessions <session_id>` plus `sessions::related_sessions`, the first inferred-linkage surface on top of the flat session listing. Walks `chunk_entities` from the source session's chunks out to other sessions, ranking results by distinct shared-entity count (descending) with `session_id` ascending as a deterministic tiebreaker. The source session is excluded from its own list, chunks where `session_id IS NULL` cannot become phantom neighbors, and the report surfaces both `source_chunk_count` / `source_entity_count` and per-related-session `shared_entity_count`, `shared_chunk_count`, total `chunk_count`, and first/last `timestamp_unix`. CLI supports `--limit` (default 50) with truncation visible via `total_related`, plus text/JSON output. MCP now exposes the same read path through `lantern_related_sessions` / `LanternServer::related_sessions_sync`, so agent callers do not have to shell out to inspect inferred related-session links. Focused tests in `tests/sessions.rs` cover the unknown-source error path, the no-shared-entities case, ranking + tiebreakers + source-self-exclusion, exclusion of null-session_id chunks, chunk_count / first-and-last timestamp aggregation across the full related session, and limit truncation with full `total_related`; a focused `src/mcp.rs` regression test now pins the transport-free MCP twin too. As a drive-by, fixed three pre-existing `SearchArgs` initializer call sites in `tests/semantic_search.rs` that were broken by the `session_id` field added in the previous slice.
- Progress: `lantern related-sessions` now also surfaces each related session's representative `project` when every chunk in that session agrees on one non-null value, mirroring the all-or-nothing rule already used by `lantern sessions`. The field is emitted in both text (`project=...`) and JSON (omitted when absent), and focused tests cover the consistent, mixed, partial-null, and JSON-shape cases.
- Progress: added `lantern temporal-sessions` plus `sessions::temporally_related_sessions`, a read-only temporal proximity surface that ranks sessions by timestamp-range gap (with optional `--window-secs`) and surfaces overlap / gap seconds, source timing bounds, and truncation metadata. The CLI and MCP twins are wired, and focused tests cover gap ordering, window filtering, and the transport-free MCP path.
- Progress: `lantern related-sessions` now supports optional shared-entity evidence through `--show-entities`, and the MCP `lantern_related_sessions` surface mirrors it with `show_entities`. When requested, each related session returns a deterministic sample of the actual shared entity kind/value pairs linking it to the source session, while the default count-only path stays unchanged. Focused tests cover omission by default, shared-only filtering, deterministic ordering/truncation, CLI parsing, and the transport-free MCP path.
- Progress: first higher-level grouping field is in. `Chunk` now carries an optional `project`, the JSONL extractor extracts it from `project`, `project_id`, `projectId`, `repo`, and `repository` (outer envelope wins over a nested `message` object), and ingest persists it into a new schema-v14 `chunks.project` column. `lantern sessions` now reports a representative `project` per session — surfaced only when every chunk in the session agrees on the same non-null project, otherwise `None` — in both text (`project=...` token) and JSON (`Option::is_none`-skipped). Focused tests cover JSONL alias extraction and outer/inner precedence, the ingest round-trip into the chunks table, the consistent / mixed / partial-null / all-null cases on the read surface, and JSON-shape parity. Schema-version assertions in `tests/{export,inspect,reindex}.rs` are bumped to 14.
- Progress: JSONL extraction now treats ChatCompletions-style `name` as role-conditional metadata instead of a blanket `tool_name` alias. For participant roles (`user`, `assistant`, `system`, `developer`), `name` now falls into `chunks.user`; for tool-like roles (`tool`, legacy `function`) and role-less ambiguous records it continues to feed `tool_name`. This prevents transcripts like `{"role":"user","name":"alice","content":"hi"}` from polluting tool provenance while preserving legacy tool-call rendering. Focused tests pin participant-role routing, explicit `user`-alias precedence over `name`, and the unchanged tool/function/no-role behavior.
- Progress: companion `user` slice now mirrors the `project` plumbing end-to-end. `Chunk.user` is extracted from `user`, `user_id`, `userId`, `author`, and `participant` aliases (outer envelope wins over a nested `message`), persisted into a new schema-v15 `chunks.user` column, and `lantern sessions` reports a representative `user` per session under the same all-or-nothing rule (every chunk must agree on a single non-null user; mixed / partial-null / untagged sessions surface `None`). Text output emits a `user=...` token and JSON skips the field via `Option::is_none`. Focused tests cover JSONL alias extraction + outer/inner precedence, the ingest round-trip, the four session-listing cases (consistent / mixed / partial-null / all-null), and JSON-shape parity. Schema-version assertions in `tests/{export,inspect,reindex}.rs` are bumped to 15.
- Progress: companion `topic` slice now mirrors `project` / `user` end-to-end. `Chunk.topic` is extracted from `topic`, `topic_id`, `topicId`, `subject`, and `category` aliases (outer envelope wins over a nested `message`), persisted into a new schema-v16 `chunks.topic` column, and `lantern sessions` reports a representative `topic` per session under the same all-or-nothing rule. Text output emits a `topic=...` token and JSON skips the field via `Option::is_none`. Focused tests cover JSONL alias extraction (each alias + primary-key precedence), outer/inner-message precedence, the ingest round-trip into `chunks.topic`, the four session-listing cases (consistent / mixed / partial-null / all-null), and JSON-shape parity. Schema-version assertions in `tests/{export,inspect,reindex}.rs` are bumped to 16.
- Progress: companion `thread` slice now mirrors `project` / `user` / `topic` end-to-end. `Chunk.thread` is extracted from `thread`, `thread_id`, and `threadId` aliases (outer envelope wins over a nested `message`), persisted into a new schema-v17 `chunks.thread` column, and `lantern sessions` reports a representative `thread` per session under the same all-or-nothing rule. Text output emits a `thread=...` token and JSON skips the field via `Option::is_none`. Focused tests cover JSONL alias extraction (including the interaction with the existing `session_id <- thread_id` fallback), outer/inner-message precedence, the ingest round-trip into `chunks.thread`, the four session-listing cases (consistent / mixed / partial-null / all-null), and JSON-shape parity. Schema-version assertions in `tests/{export,inspect,reindex}.rs` are bumped to 17.
- Remaining focus: inferred linkage still needs to be built on top of those primitives — for example related-session discovery from shared entities, temporal proximity, or overlapping terms; and bundle/query surfaces that operate on linked sessions rather than a single exact `session_id`.

## P2 — Polish

### 9. Pluggable Embedding Backend
**Why:** Ollama-only is a philosophical choice but blocks CI/server use (no GPU). An `EmbeddingBackend` trait with Ollama as default costs nothing.
- Define `EmbeddingBackend` trait: `fn embed(&self, texts: &[String]) -> Result<Vec<Vec<f32>>>`
- Implement for Ollama (default), OpenAI, Voyage
- `--embed-backend ollama|openai|voyage` flag
- Config file for API keys

### 10. Record Embedding Model in Query Envelope
**Status: DONE** — `search --format json` now includes the query embedding model for semantic/hybrid searches; keyword searches omit it.

### 11. `--include-ext` Flag for Ingest
**Why:** `is_supported_file` is a hardcoded allowlist. `.proto`, `.sql`, `.tf`, `.nix`, `.el`, `.vim` — forever adding one at a time.
- Add `--include-ext proto,sql,tf` CLI flag
- Consider MIME sniffing + UTF-8 validation as fallback (extension list as fast path)
- Or `--include-all-text` to accept any text file
- Progress: `lantern ingest` now accepts `--include-ext` as a comma-separated list, normalizes away leading dots and case, and threads the extra extensions into the ingest allowlist. Focused tests cover CLI normalization, rejection of empty values, and end-to-end ingest of a custom `.proto` file.

### 12. Code-Aware Chunking (Treesitter)
**Why:** Paragraph-break chunking slices function bodies mid-block. For code recall, function/class boundary chunking is a meaningful quality jump.
- Add `tree-sitter` dependency
- Chunk at function/class/module boundaries for supported languages
- Fall back to paragraph chunking for non-code files
- Languages: Rust, Python, Go, TypeScript first

### 13. Surface MAX_INGEST_BYTES in JSON Report
**Status: DONE** — skipped ingest entries now carry a structured `skipped_reason` code in JSON (`too_large`, `unchanged`, `error`) alongside the human-readable message.

### 14. Library API Surface
**Why:** Crate exposes modules but they're CLI-shaped (print to stdout, CLI-ish defaults). An agent framework wrapping this discovers `print_summary` is hardcoded to `println!`.
- Refactor modules to return data, not print
- `print_summary` / `print_json` become display layer only
- Core functions return `Result<Vec<SearchHit>>`, display functions format them
- Document the library API for embedding in other Rust projects

### 15. `.lantern-allow` for Ingest Allowlist Overrides
**Why:** The default file allowlist is still hardcoded. A project-local allow file would let users add or narrow what Lantern treats as ingestible without editing code.
- Add `.lantern-allow` alongside `.lantern-ignore`
- Allow overriding the preset allowed file extensions / patterns
- Keep it explicit so users can opt into extra file types like `.proto`, `.sql`, `.tf`, `.nix`, `.el`, `.vim`

### 16. PDF Ingestion
**Why:** A lot of useful knowledge lives in PDFs, and Lantern can't ingest them yet.
- Add a PDF extractor path for text-based PDFs
- Decide whether to use a Rust-native parser, PDF-to-text tooling, or a fallback pipeline
- Preserve provenance and page/byte metadata where possible
- Skip or report scanned/image-only PDFs cleanly if extraction fails

### 17. Embedding Progress Bar
**Why:** Long embedding runs are hard to judge from a distance, especially on big repos over Termux/remote Ollama. A progress bar would make ingest feel much less opaque.
- Show chunks processed vs total during embedding
- Surface current file / source being embedded
- Keep it quiet in JSON mode; only render when interactive/TTY

## Done
- [x] Clippy warnings fixed (clamp, redundant .into_iter(), collapsible if)
- [x] Source/chunk IDs widened from 8 to 16 bytes (SHA-256[:16])
- [x] Forget defaults to dry-run, requires --apply to delete
- [x] MCP server (rmcp, 11 tools, stdio + TCP)
- [x] Core CLI (init, ingest, search, query, show, inspect, export, diff, forget, reindex, stash, version)
- [x] FTS5 keyword search (BM25)
- [x] Semantic search (Ollama embeddings, cosine similarity)
- [x] Hybrid search (BM25 + cosine)
- [x] Batch embedding (32 chunks/request)
- [x] JSONL transcript ingestion
- [x] Stash/archive system
- [x] .lantern-ignore support (gitignore-style patterns + defaults)
- [x] Static builds (musl Linux x86_64 + aarch64, macOS aarch64)
- [x] Tagged release pipeline with checksums
- [x] File size limit (50MB)
- [x] Pre-v0.1.0 code review fixes
- [x] sqlite-vec rollout (extension load, vec mirror, auto-routing, and upgrade backfill)