Expand description
Progress JSONL sink for quality semantic backfill.
When CASS_SEMANTIC_PROGRESS_JSONL=/abs/path/to/file.jsonl is set,
the semantic backfill code path appends one JSON object per transition
event to that file. Each event carries a timestamp, a phase + sub-phase,
a row/batch counter where applicable, the wall-time delta since the
sink was started, and a cheap RSS estimate.
Goal — give operators enough proof, during long-running quality semantic backfill runs, to tell whether time is going to selection, packet replay, embedding, staging, checkpoint, or publish; and to distinguish storage-side stalls from model-inference stalls. See cass#257.
Env-var family: matches the existing CASS_SEMANTIC_* namespace (see
src/search/policy.rs and src/indexer/semantic.rs). The sink itself
is silent when the env var is unset, so it has zero cost for normal
operation. Writes are best-effort: a failed write is logged at debug
and never propagated upward — we never want telemetry to crash a
backfill that would otherwise succeed.
Structs§
- Semantic
Progress Fields - Optional counters carried by an event. Every field is
Nonewhen not applicable — JSON serializers should skip nulls so the row stays readable. - Semantic
Progress Sink - Open the sink file (append, create) on first event, cache the handle in a Mutex. We deliberately accept the cost of a Mutex over every event because the JSONL stream is several events per batch, not per row — even a 50ms batch wall-time dwarfs the lock cost.
Enums§
- Semantic
Progress Event - The 16 named transition events. Strings deliberately mirror the
phase+sub_phasecolumns in each emitted record so ajquser can filter on event name OR phase as they prefer.
Constants§
- ENV_
PROGRESS_ JSONL - Env var that activates the sink and names the output file.
- PROGRESS_
JSONL_ SCHEMA - Schema version for the JSONL event stream. Bump on any breaking change to event names or fields.