Skip to main content

Module semantic_progress

Module semantic_progress 

Source
Expand description

Progress JSONL sink for quality semantic backfill.

When CASS_SEMANTIC_PROGRESS_JSONL=/abs/path/to/file.jsonl is set, the semantic backfill code path appends one JSON object per transition event to that file. Each event carries a timestamp, a phase + sub-phase, a row/batch counter where applicable, the wall-time delta since the sink was started, and a cheap RSS estimate.

Goal — give operators enough proof, during long-running quality semantic backfill runs, to tell whether time is going to selection, packet replay, embedding, staging, checkpoint, or publish; and to distinguish storage-side stalls from model-inference stalls. See cass#257.

Env-var family: matches the existing CASS_SEMANTIC_* namespace (see src/search/policy.rs and src/indexer/semantic.rs). The sink itself is silent when the env var is unset, so it has zero cost for normal operation. Writes are best-effort: a failed write is logged at debug and never propagated upward — we never want telemetry to crash a backfill that would otherwise succeed.

Structs§

SemanticProgressFields
Optional counters carried by an event. Every field is None when not applicable — JSON serializers should skip nulls so the row stays readable.
SemanticProgressSink
Open the sink file (append, create) on first event, cache the handle in a Mutex. We deliberately accept the cost of a Mutex over every event because the JSONL stream is several events per batch, not per row — even a 50ms batch wall-time dwarfs the lock cost.

Enums§

SemanticProgressEvent
The 16 named transition events. Strings deliberately mirror the phase + sub_phase columns in each emitted record so a jq user can filter on event name OR phase as they prefer.

Constants§

ENV_PROGRESS_JSONL
Env var that activates the sink and names the output file.
PROGRESS_JSONL_SCHEMA
Schema version for the JSONL event stream. Bump on any breaking change to event names or fields.