sof 0.8.1

Solana Observer Framework for low-latency shred ingestion and plugin-driven transaction observation
Documentation
# SOF Observer Runtime: Code-Accurate Reverse Engineering

This document describes how `sof-observer` currently works in code (not a progress log).
Primary implementation lives in:

- `crates/sof-observer/src/runtime.rs`
- `crates/sof-observer/src/app/runtime/*`
- `crates/sof-observer/src/ingest/core.rs`
- `crates/sof-observer/src/shred/wire/*`
- `crates/sof-observer/src/reassembly/*`
- `crates/sof-observer/src/verify/core/*`
- `crates/sof-observer/src/relay/*`
- `crates/sof-observer/src/repair/core/*`
- `crates/sof-observer/src/framework/*`

## Runtime entrypoints and composition

- `sof_observer::runtime::run()` creates a Tokio multi-thread runtime and runs `run_async()`.
- `sof_observer::runtime::run_async()` orchestrates all ingest, parse, verify, reassembly, repair, telemetry, and tx classification.
- Ingest sources can run concurrently:
  - direct UDP listener (`SOF_BIND`, default `0.0.0.0:8001`)
  - gossip bootstrap sockets (`SOF_GOSSIP_ENTRYPOINT`, requires `gossip-bootstrap` feature)
    including TVU sockets and repair ingress sockets

All sources feed a shared `mpsc::Sender<RawPacketBatch>`.

## End-to-end pipeline

For each packet in the runtime loop:

1. Emit `on_raw_packet` plugin hook.
2. If gossip repair is enabled, detect and route signed repair requests/pings to repair handler.
3. Parse shred header (`parse_shred_header`).
4. Emit `on_shred` plugin hook.
5. Apply duplicate suppression (signature+slot+index+fec+version+variant key).
6. Optionally run cryptographic verification (`ShredVerifier`).
7. Insert valid shreds into the bounded relay cache.
8. In gossip mode, optionally forward bounded UDP relay traffic to selected TVU peers.
9. Feed packet into FEC recovery (`FecRecoverer`) to attempt recovered data shreds.
10. Update missing-shred tracker (repair), fork/slot state, coverage metrics, and counters.
11. Feed data shreds into dataset reassembler; enqueue completed datasets to worker queues.
12. Process recovered data shreds through the same dataset path (with separate counters).

Dataset workers then:

1. De-duplicate recent dataset decode attempts.
2. Decode deshredded payload into `Vec<Entry>`.
3. Classify tx kind (`VoteOnly`, `Mixed`, `NonVote`).
4. Emit `on_transaction` and `on_dataset` plugin hooks.
5. Publish `TxObservedEvent` back to main runtime for aggregate metrics/logging.

## Wire format mirrored in code

Constants are in `crates/sof-observer/src/protocol/shred_wire.rs`.

Common header offsets:

- signature: `[0..64)`
- shred variant: byte `64`
- slot: `[65..73)` LE
- index: `[73..77)` LE
- version: `[77..79)` LE
- fec_set_index: `[79..83)` LE

Data header:

- parent_offset: `[83..85)` LE
- flags: byte `85`
- size: `[86..88)` LE

Coding header:

- num_data_shreds: `[83..85)` LE
- num_coding_shreds: `[85..87)` LE
- position: `[87..89)` LE

Variant high nibble mapping:

- `0x60`: merkle code
- `0x70`: merkle code resigned
- `0x90`: merkle data
- `0xB0`: merkle data resigned

Parser behavior (`shred/wire/parser.rs`):

- Data packets must be at least `SIZE_OF_DATA_SHRED_PAYLOAD` (`1203`) bytes.
- Coding packets must be at least `SIZE_OF_CODING_SHRED_PAYLOAD` (`1228`) bytes.
- Data `size` must be `>= 88` and `<=` computed variant-specific max (accounts for merkle proof + optional resigned sig trailer).
- Coding header must satisfy:
  - `num_data_shreds > 0`
  - `num_coding_shreds > 0`
  - `position < num_coding_shreds`

## Verification model

Verification is optional (`SOF_VERIFY_SHREDS`, default `false`).

`ShredVerifier` does:

1. Parse variant/signature/slot/index/fec_set from packet.
2. Compute merkle root from shred bytes + proof path.
3. Verify signature against:
  - slot leader map (if present),
  - cached signature->pubkey hits,
  - known pubkeys list.

Statuses:

- `Verified`
- `UnknownLeader`
- `InvalidMerkle`
- `InvalidSignature`
- `Malformed`

`SOF_VERIFY_STRICT=false` accepts `UnknownLeader`; strict mode drops it.

Slot-leader bootstrap:

- No RPC pre-load path is used.
- In gossip mode, known pubkeys are continuously refreshed from repair peer snapshot.

## FEC recovery model

`FecRecoverer` groups packets by `(slot, fec_set_index)`.

- Tracks raw data shreds by `index`.
- Tracks coding shreds by `position` and enforces one consistent `(num_data, num_coding)` config.
- Rejects mixed variant class in same set (proof size/resigned mismatch).
- Runs Reed-Solomon reconstruction once total present shards is at least `num_data`.
- Recovered data shreds are synthesized and re-validated by `parse_shred`.

Recovered shreds can be independently verified when `SOF_VERIFY_RECOVERED_SHREDS=true`.

## Dataset reassembly and tx extraction

`DataSetReassembler` tracks per-slot fragment streams:

- Uses `DATA_COMPLETE`/`LAST_IN_SLOT` to mark dataset boundaries.
- Emits contiguous ranges `[start_index..end_index]` when all shreds in that interval exist.
- Supports "tail emission without slot-0 anchor" when contiguous tail length reaches `SOF_DATASET_TAIL_MIN_SHREDS` (default `2`).

`process_completed_dataset()` then:

- Tries deshred/decode with progressive prefix skipping to recover from missing early shreds.
- Deserializes entry payload as `WincodeVec<Entry>`.
- Iterates transactions and classifies kinds:
  - vote program only (plus optional compute budget): `VoteOnly`
  - vote + non-vote programs: `Mixed`
  - no vote program: `NonVote`

## Repair subsystem

Repair requires gossip mode (`gossip-bootstrap` feature + `SOF_REPAIR_ENABLED=true`).

Core pieces:

- `MissingShredTracker`: slot/FEC gap detection and request planning.
- `OutstandingRepairRequests`: in-flight dedupe/timeout for sent requests.
- `GossipRepairClient`: peer discovery, scoring, and UDP request send.

Request types:

- `WindowIndex {slot, index}`
- `HighestWindowIndex {slot, index}`

Planner behavior:

- defers requests by settle delay (`SOF_REPAIR_SETTLE_MS`, default `250`)
- enforces cooldown (`SOF_REPAIR_COOLDOWN_MS`, default `150`)
- respects slot-lag policy (`SOF_REPAIR_MIN_SLOT_LAG`, default `4`)
- applies per-slot cap (`SOF_REPAIR_PER_SLOT_CAP`, default `16`)
- supports forward probing

Wire send path:

- Requests are serialized with fixed-int bincode `RepairProtocol::*`.
- Signature field is patched after signing the signable bytes.
- Repair response pings are recognized and answered with signed pongs.

Peer selection:

- Candidate peers from `cluster_info.repair_peers(slot)` (fallback to compatible gossip peers).
- Scores include send success/error, observed ping RTT, and shred source hit hints.
- Top-ranked active peers are retained; pick is weighted by score-derived weight.

## Gossip bootstrap and runtime switching

When `SOF_GOSSIP_ENTRYPOINT` is set:

- Entrypoints are expanded/resolved; startup falls back across candidates until one stabilizes.
- Node sockets include TVU receive sockets and repair socket; both feed ingest.
- Shred version is discovered from entrypoint unless overridden (`SOF_SHRED_VERSION`).

Runtime switch (only when repair is enabled and switch is enabled):

- Detect stall from shred age and dataset age thresholds.
- Optionally switch to alternate entrypoint.
- Supports overlap switching if a secondary non-overlapping port range is configured or auto-split is possible.
- Uses stabilization checks (sustain, packet minimum, max wait) before committing new runtime.

## Plugin framework (current runtime wiring)

Hook API:

- `on_raw_packet`
- `on_shred`
- `on_dataset`
- `on_transaction`
- `on_slot_status`
- `on_reorg`
- `on_recent_blockhash`
- `on_cluster_topology` (gossip-bootstrap)
- `on_leader_schedule` (gossip-bootstrap)

Current packaged runtime constructs `PluginHostBuilder::new().build()` (empty host).
So hooks exist and are called, but no plugins are registered unless you wire a non-empty `PluginHost` in your embedding runtime.

## Telemetry cadence

Every 5 seconds, runtime logs a consolidated `ingest telemetry` event with:

- packet/source-port mix
- parse/verify stats
- dedupe stats
- FEC/dataset worker queue stats
- tx class counts
- repair request and peer stats
- gossip switch stats
- slot coverage window stats

For full environment knobs and defaults, see `docs/operations/advanced-env.md`.