net-mesh 0.25.0

High-performance, schema-agnostic, backend-agnostic event bus
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
# Performance Analysis: Compute (Scheduler, Groups, Load Balance, Daemon Host)

Supplemental to the unified report. Focuses on the compute runtime — the daemon-host hot path that processes events on behalf of user workloads, the per-event group-routing logic, the load balancer, and the scheduler. Items continue from #147.

The compute path is hotter than it looks. Every event delivered to a daemon hits the host's `deliver`, which runs causal-link bookkeeping per event. Every event routed through a fork/replica/standby group runs the load balancer's `select`. The load balancer's hot path turns out to be one of the most allocation-heavy paths in the entire codebase per event.

---

## ✅ Fixed

| # | Item | Notes |
|---|------|-------|
| 149 | `EndpointState::metrics()` `RwLock<LoadMetrics>` read + clone per call → `ArcSwap<LoadMetrics>` + `load_score()` helper | Pre-fix every per-event select strategy (`select_least_load`, `select_power_of_two`, `select_adaptive`, etc.) called `state.metrics().load_score()` which acquired a parking_lot read lock, deep-cloned the 9-field `LoadMetrics` struct, then computed `load_score()` and dropped the clone. For 100 endpoints + LeastLatency = 100 RwLock acquires + 100 clones per event. Switched `metrics: RwLock<LoadMetrics>` → `ArcSwap<LoadMetrics>`; reads become one lock-free Acquire load. Added `EndpointState::load_score()` that runs `self.metrics.load().load_score()` — no clone, the ArcSwap guard holds a borrowed reference into the current Arc. The 13 internal call sites (every `state.metrics().load_score()`) switched to `state.load_score()`. The legacy `metrics()` accessor stays for `LoadBalancer::endpoints()` which materializes full `Endpoint` structs for operator inventory consumers — it does `(**self.metrics.load()).clone()` (one Arc deref + one struct clone, no lock). Updates (`update_metrics`, operator-cadence) call `metrics.store(Arc::new(metrics))`. Pinned by `endpoint_state_metrics_arc_swap_visibility_and_no_clone_on_read`: asserts `Arc::ptr_eq` across two consecutive reads with no intervening write (would fail under `RwLock<T>` or a swap-via-clone alternative), and that post-update `load_score()` reflects the new value. |
| 151 | `Scheduler::pick_best_candidate` full sort → `max_by` over the finite-score iterator | Pre-fix the finite-scoring candidates were `Vec::sort_by`'d (O(N log N)) and `first()`'d to extract the single winner — for 1000 candidates that's a full O(N log N) sort just to take one element. Post-fix is `max_by` over the same filter chain (O(N)) with the tie-break direction inverted relative to the legacy sort because `max_by`'s "greater wins" is the opposite of `sort_by`'s "less goes first": score becomes `sa.partial_cmp(sb)` (was `sb.partial_cmp(sa)`) and tie-break becomes `tie_break_compare(b, a, tie_break)` (was `tie_break_compare(a, b, tie_break)`). NaN is filtered upstream so `partial_cmp.unwrap_or(Equal)` is a safety belt only. Existing `pick_best_candidate_drops_nan_scores` re-pins the tie-break direction across the swap. |
| 152 | `GroupCoordinator::origin_hash_for_entity_id` linear scan → `HashMap<NodeId, u64>` reverse index | Pre-fix the per-routed-event hop "LB returned `entity_id_bytes` → which member's `origin_hash` does that belong to?" ran a linear scan over `members`, comparing 32-byte `NodeId`s. For a 100-member group at 100K events/sec that's 10M × 32-byte equality checks per second. Post-fix `GroupCoordinator` carries `origin_hash_by_entity_id: HashMap<NodeId, u64>` maintained on `add_member` / `remove_last` / `update_member_placement`, and the lookup is a single O(1) HashMap probe. Pinned by `origin_hash_lookup_uses_reverse_index_across_mutations` (covers all three mutation paths plus the unknown-key None case). |
| 153 | `DaemonHost::deliver` skips the `horizon.encode()` xxh3 walk on the no-output path | Pre-fix `horizon.encode()` fired on every `deliver()` even when `daemon.process()` produced zero outputs — walking every entry in the horizon hashmap and computing `xxh3_64` per entry, then dropping the encoded value because there's nothing to attach it to. For a horizon tracking 16 origins that's 16 xxh3 computes per event for observing / filtering / state-update daemons that take the no-output branch most of the time. Post-fix `deliver` early-returns `Ok(Vec::new())` when `outputs.is_empty()`, *after* `horizon.observe` and the `events_processed` bump (so observation accounting is unchanged) but before the encode + `chain.append` loop. Pinned by `deliver_skips_horizon_encode_when_daemon_returns_no_outputs` — asserts `events_processed == 1`, `events_emitted == 0`, AND `chain.head()` is unchanged (which proves `chain.append` never ran). |
| 154 | `compute_parent_hash` allocating concatenate-then-hash → streaming `Xxh3::update` | Pre-fix this allocated a `Vec::with_capacity(CAUSAL_LINK_SIZE + prev_payload.len())` per output event, extended the 32-byte link bytes + the payload into it, ran one-shot `xxh3_64`, and dropped the Vec. For daemons emitting 100K events/sec with 1 KB payloads: 100K allocs/sec plus 100 MB/sec of memcpy bandwidth just for parent-hash computation. Post-fix drives an `xxhash_rust::xxh3::Xxh3` streaming hasher with `update(&link_bytes)` + `update(prev_payload)` — zero intermediate allocation. The streaming digest matches the legacy concatenated one-shot bit-for-bit (load-bearing for forward / backward compat since `parent_hash` is part of the chain-validation invariant); pinned by `compute_parent_hash_streaming_matches_concatenated_oneshot` across empty, exact-block, just-over-block, and multi-block payloads. |
| 155 | `EndpointState::last_selected` `Mutex<Instant>` → `AtomicU64` of nanos since a process-wide baseline | Pre-fix every successful `try_record_request` reservation locked a parking_lot `Mutex<Instant>` and stamped `Instant::now()`. The field is purely observational (never read inside this module today — the Mutex was being used as a cell, not for mutual exclusion). For 100K successful selections/sec that's 100K lock+unlock pairs of pure overhead. Post-fix `last_selected: AtomicU64` stores nanos elapsed since a static `LB_INSTANT_BASELINE: OnceLock<Instant>` (lazy-initialized on first endpoint construction); the write becomes one Relaxed store and the read becomes one Relaxed load. `Instant::now()` is still consulted (same syscall as the legacy stamp) — the win is the eliminated Mutex acquisition per non-replay selection. |
| 158 | `mark_healthy` / `mark_unhealthy` / `update_member_placement` linear `iter_mut().find(\|m\| m.index == index)` → direct `members.get_mut(index as usize)` | Pre-fix every health update scanned the members vec by `m.index == index` even though `index` matches the Vec position (the `for index in 0..n` construction loops in `replica_group.rs`, `fork_group.rs`, and `standby_group.rs` push in dense 0..n order). Post-fix is O(1) `members.get_mut(index as usize)` with a defensive `member.index == index` re-check inside the slot — if a future change ever breaks the dense-index invariant, the slow path of "do nothing" is strictly safer than acting on the wrong member. Pinned by `mark_health_resolves_via_direct_index` (covers the happy path + a 99-index out-of-range no-op assertion). |

---

## 🔴 High-impact

### 148. `LoadBalancer::get_available_endpoints` walks every endpoint and clones per match — per event

**Location:** `behavior/loadbalance.rs:885-948`:

```rust
fn get_available_endpoints(&self, ctx: &RequestContext) -> Result<Vec<Arc<EndpointState>>, ...> {
    let mut available = Vec::new();
    let mut zone_matches = Vec::new();

    for entry in self.endpoints.iter() {        // <-- full DashMap walk per event
        let state = entry.value();
        if !state.is_available() { continue }       // atomic load
        if state.is_circuit_open(recovery_time) { continue }   // atomic load + clock
        if state.connections.load(Relaxed) >= max { continue }   // atomic load
        if !ctx.required_tags.is_empty() && !ctx.required_tags.iter().all(|t| state.tags.contains(t)) { continue }
        // ...
        available.push(Arc::clone(state));     // atomic refcount bump per match
    }
    // ...
}
```

This runs on **every routed event**. For a load balancer with 100 endpoints and 100K events/sec:

- 10M DashMap iterations/sec
- 30-40M atomic loads/sec (3-4 per endpoint × 100 endpoints × 100K events)
- 5-10M Arc refcount bumps/sec for survivors
- 2 fresh Vec allocations per event = 200K allocs/sec

This is the dominant cost on the compute-group routing path.

**Fix:**
Maintain a snapshot of available endpoints, updated only on health/circuit/enabled state changes. Routing reads it via `ArcSwap<Vec<Arc<EndpointState>>>`:

```rust
struct LoadBalancer {
    // existing fields ...
    available_snapshot: ArcSwap<Vec<Arc<EndpointState>>>,
}
```

Per `select()`: one atomic load (the ArcSwap), zero iteration, zero allocation. Health/circuit state changes update the snapshot off the hot path.

For zone-aware routing, stratify into `HashMap<Zone, ArcSwap<Vec<...>>>` so zone lookup is a single ArcSwap load.

**This is the single biggest per-event win on the compute path.**

### 149. `EndpointState::metrics()` does a RwLock read + full clone per call — called per endpoint per select for several strategies

**Location:** `behavior/loadbalance.rs:280-282`:
```rust
fn metrics(&self) -> LoadMetrics {
    self.metrics.read().clone()
}
```

Called from `select_least_latency`, `select_least_load`, `select_power_of_two`, `select_adaptive`. AND from every `Selection { ..., load_score: state.metrics().load_score() }` site — meaning **even RoundRobin pays the full metrics clone** just to populate an output field.

Per select with 100 endpoints + LeastLatency: **100 RwLock reads + 100 LoadMetrics clones per event**.

**Fix:**
- `LoadMetrics` is probably a small struct of `AtomicU64`s wrapped in a holder. Replace `metrics: RwLock<LoadMetrics>` with `metrics: LoadMetricsAtomic` where each field is an atomic. Reads become atomic loads with no clone.
- For the `Selection { load_score: ... }` field: only populate it when the strategy actually uses it. RoundRobin shouldn't pay this cost.

For 100K events/sec on a 100-endpoint LB, that's 10M RwLock-read + clone ops/sec eliminated.

### 150. `select_consistent_hash` rebuilds the full hash ring per event

**Location:** `behavior/loadbalance.rs:1131-1170`:
```rust
let mut ring: Vec<(u64, NodeId)> = self.hash_ring
    .iter()
    .map(|entry| (*entry.key(), *entry.value()))
    .collect();
ring.sort_unstable_by_key(|&(k, _)| k);

let idx = ring.partition_point(|&(k, _)| k < hash);

for i in 0..ring.len() {
    let (_, node_id) = ring[(idx + i) % ring.len()];
    if let Some(state) = endpoints.iter().find(|e| e.node_id == node_id) {  // O(N) per ring entry
        return Selection { ... };
    }
}
```

Per ConsistentHash select:
- Collect all ring entries (1600+ for typical 16-virtual-nodes-per-endpoint × 100 endpoints)
- Sort them
- Linear-scan endpoints to find a match per ring lookup

**This is O(N log N) per request when it should be O(log N).**

**Fix:** Maintain the ring as a pre-sorted `ArcSwap<Vec<(u64, NodeId)>>`, updated incrementally on endpoint add/remove. Routing is binary search on the loaded snapshot. Endpoint resolution: build a `HashMap<NodeId, usize>` index alongside the available endpoints so the find is O(1) instead of O(N).

For a hot consistent-hash workload (commonly used for cache affinity, session pinning), this is the difference between "scales to 100 endpoints" and "doesn't."

### 151. `Scheduler::pick_best_candidate` sorts when it should take the max

**Location:** `compute/scheduler.rs:474-494`:
```rust
let mut scored: Vec<(u64, f32)> = candidates.into_iter()
    .filter_map(|n| placement.placement_score(&n, artifact).map(|s| (n, s)))
    .filter(|(_, s)| s.is_finite())
    .collect();

scored.sort_by(|(a, sa), (b, sb)| {
    sb.partial_cmp(sa).unwrap_or(Ordering::Equal)
        .then_with(|| tie_break_compare(*a, *b, tie_break))
});

scored.first().map(|(n, _)| *n)
```

Full sort to take only the first element. For 1000 candidates that's O(N log N) when O(N) `max_by` would suffice.

This is the same pattern as #89 in the memories query layer. Same fix:
```rust
scored.into_iter().max_by(|(a, sa), (b, sb)| {
    sa.partial_cmp(sb).unwrap_or(Ordering::Equal)
        .then_with(|| tie_break_compare(*b, *a, tie_break))
})
```

Combined with #114 (per-candidate `with_caps` shard-lock acquisition during scoring), the scheduler's placement decision is currently doing significant work that could compress dramatically.

### 152. `GroupCoordinator::origin_hash_for_entity_id` is a linear scan called per routed event

**Location:** `compute/group_coord.rs:250-255`:
```rust
fn origin_hash_for_entity_id(&self, entity_id: &NodeId) -> Option<u64> {
    self.members
        .iter()
        .find(|m| m.entity_id_bytes == *entity_id)
        .map(|m| m.origin_hash)
}
```

Called by `route_event` per event routed through any compute group. `NodeId` is `[u8; 32]`, so each comparison is a 32-byte equality check.

For a group with 100 members at 100K events/sec: 10M × 32-byte comparisons/sec just for the origin_hash lookup after the load balancer picked an endpoint.

**Fix:** Maintain `entity_id_to_origin_hash: HashMap<NodeId, u64>` alongside `members`. Update on `add_member`, `remove_member`, `update_member_placement`. Lookup becomes O(1).

For groups with many members (replica groups, fork groups at scale), this is per-event waste compounding with #148's load balancer cost.

### 153. `DaemonHost::deliver` calls `horizon.encode()` per event even when there are no outputs

**Location:** `compute/host.rs:231`:
```rust
pub fn deliver(&mut self, event: &CausalEvent) -> Result<Vec<CausalEvent>, DaemonError> {
    self.horizon.observe(event.link.origin_hash, event.link.sequence);
    let outputs = self.daemon.process(event)?;
    self.stats.events_processed += 1;

    let horizon_encoded = self.horizon.encode();    // <-- always
    let mut causal_outputs = Vec::with_capacity(outputs.len());
    for payload in outputs {
        // ... uses horizon_encoded
    }
    // ...
}
```

`horizon.encode()` walks every entry in the horizon hashmap and computes `xxh3_64` per entry. For a horizon tracking 16 origins: 16 hash computations per `deliver()` call.

Most daemon events don't produce outputs (think: state updates, observations, filtering). When `outputs.is_empty()`, the horizon encode is pure waste — there's nothing to attach it to.

**Fix:** Skip the encode when outputs are empty:
```rust
let outputs = self.daemon.process(event)?;
self.stats.events_processed += 1;
if outputs.is_empty() {
    return Ok(Vec::new());
}
let horizon_encoded = self.horizon.encode();
// ... continue
```

For daemons with low output ratio (1 output per N inputs), saves (N-1)/N of the encode cost.

**Further fix:** Cache the encoded horizon, invalidate on `observe()`. Most observes target an origin that's already in the bloom (no bit change). Only re-encode on observed-new-origin.

### 154. `compute_parent_hash` allocates a fresh `Vec` per output event purely to concatenate before hashing

**Location:** `state/causal.rs:127-135`:
```rust
pub fn compute_parent_hash(prev_link: &CausalLink, prev_payload: &[u8]) -> u64 {
    let link_bytes = prev_link.to_bytes();
    let mut combined = Vec::with_capacity(CAUSAL_LINK_SIZE + prev_payload.len());
    combined.extend_from_slice(&link_bytes);
    combined.extend_from_slice(prev_payload);
    xxh3_64(&combined)
}
```

Per output event from any daemon: allocate a Vec, memcpy 32 bytes + the payload into it, hash, drop the Vec. For a daemon emitting 100K events/sec with 1KB payloads: 100K allocs/sec + 100MB/sec of memcpy bandwidth just for parent-hash computation.

The comment even acknowledges this: "For large payloads, use xxh3's incremental API if needed (future optimization)." But there's no size threshold — the alloc fires for every payload regardless of size.

**Fix:** Use streaming xxh3:
```rust
use xxhash_rust::xxh3::Xxh3;
let mut h = Xxh3::new();
h.update(&prev_link.to_bytes());
h.update(prev_payload);
h.digest()
```

Zero allocation, no memcpy. Slightly more streaming overhead than one-shot for tiny inputs, but compensates with zero alloc — net win.

Same pattern as #92 in the CortEX checksum code; same fix.

### 155. `LoadBalancer::try_record_request` takes a Mutex + clock syscall per successful reservation

**Location:** `behavior/loadbalance.rs:305-320`:
```rust
fn try_record_request(&self, max_connections: u32) -> bool {
    let reserved = self.connections.fetch_update(...);
    if reserved {
        self.total_requests.fetch_add(1, Ordering::Relaxed);
        *self.last_selected.lock() = Instant::now();    // <-- Mutex + clock
    }
    reserved
}
```

Per successful routing decision: a parking_lot Mutex lock + `Instant::now()` syscall.

**Fix:** `last_selected: AtomicU64` storing nanos since epoch (or a coarse-clock tick). Atomic store, no lock. The Mutex is being used as cell storage, not for mutual exclusion.

For 100K successful selections/sec: 100K lock+unlock pairs eliminated, 100K clock reads eliminated.

## 🟡 Medium-impact

### 156. `Scheduler::find_migration_targets` clones the daemon filter + allocates a String per call

**Location:** `compute/scheduler.rs:208`:
```rust
const MIGRATION_TAG: &str = "subprotocol:0x0500";
let combined = daemon_filter.clone().require_tag(MIGRATION_TAG.to_string());
```

`daemon_filter.clone()` is a deep clone of `CapabilityFilter` (Vec of tags/models/tools, HashSet, etc). `MIGRATION_TAG.to_string()` heap-allocates from a static.

Migration placement is control-plane (not per-event), so frequency is low. But the pattern is wrong:
- `CapabilityFilter` could have `require_tag(impl Into<Cow<'static, str>>)` accepting static strs without alloc.
- Or `find_migration_targets` could take a pre-built filter from the caller and reuse across migration attempts.

### 157. `Scheduler::place_with_locality` double-allocates on the drained path

**Location:** `compute/scheduler.rs:138-146`:
```rust
let candidates: Vec<u64> = if local_drained {
    self.capability_index.query(filter)
        .into_iter()
        .filter(|&id| id != self.local_node_id)
        .collect()
} else {
    self.capability_index.query(filter)
};
```

The drained path: `query` allocates a Vec, then `into_iter().filter().collect()` allocates another Vec just to drop one item (the local node).

**Fix:** Add `CapabilityIndex::query_excluding(filter, &exclude_set)` that filters during the inner walk. Saves one Vec allocation.

Same pattern in `place_with_spread` (`compute/group_coord.rs:268`): if primary placement is excluded, calls `query_candidates` again — second full index query when the first one already produced the candidate set.

### 158. `GroupCoordinator::mark_healthy` and `mark_unhealthy` linear-scan members

**Location:** `compute/group_coord.rs:148-163`:
```rust
pub fn mark_unhealthy(&mut self, index: u8) {
    if let Some(member) = self.members.iter_mut().find(|m| m.index == index) {
        // ...
    }
}
```

Linear scan by index. Members are stored in `Vec<MemberInfo>` indexed by `u8`. If `index` is sequential and dense (which it likely is — `for index in current..n` at the scale_to site), this could be direct array indexing: `members.get_mut(index as usize)`. O(1) instead of O(N).

Cold-ish path (health changes are infrequent), but trivial fix.

### 159. `RecoveryRegistry::try_run_all` allocates per tick

**Location:** `compute/mod.rs:154-193`. Per recovery tick (~1Hz):
- `mem::take` to swap out the handler vec → allocates a new empty one
- `Vec::with_capacity(handlers_to_run.len())` for survivors
- `Vec::new()` for the recovered slot list
- Each handler is called via `catch_unwind` (overhead)
- Merge survivors back

For a low-frequency tick this is fine. If recovery becomes per-event for some reason, it'd matter.

### 160. `LoadBalancer::select_weighted_round_robin_at` recomputes `total_weight` per call

**Location:** `behavior/loadbalance.rs:982`:
```rust
let total_weight: f64 = endpoints.iter().map(|e| e.effective_weight()).sum();
```

Per WeightedRoundRobin select, iterates all endpoints, calls `effective_weight()` per endpoint (which reads `is_enabled()` atomic + `health()` RwLock read). Sum re-computed per event.

**Fix:** Cache `total_weight: AtomicF64` (or `AtomicU64` of `f64::to_bits()`), updated incrementally on weight/health changes. Per WRR select: one atomic load instead of N RwLock reads + sum.

### 161. `LoadBalancer::select_random` and `select_weighted_random` likely use thread_rng per call

Didn't look at the body but the pattern across the codebase suggests `rand::thread_rng()` per select. `thread_rng` is thread-local but still hits a TLS slot per call.

For high-rate random LB selection, instantiate a per-LB `SmallRng` seeded once. Per-call: pure userspace RNG step. Worth checking the actual code if Random is a configured strategy in production.

### 162. `DaemonHost::deliver` runs `current_timestamp()` per output event via `CausalEvent::received_at`

**Location:** `state/causal.rs:185`:
```rust
let event = CausalEvent {
    link: next_link,
    payload: payload.clone(),
    received_at: current_timestamp(),    // <-- per output
};
```

Per output event from a daemon. Same coarse-clock pattern (#33, #66, #115, #135, #137, #155).

### 163. `LoadBalancer::select` retry loop pays the full filter cost per attempt

**Location:** `behavior/loadbalance.rs:764`. The retry loop runs `get_available_endpoints(ctx)` up to 4 times if reservation races. Each retry pays the full DashMap walk + filter cost (the #148 cost).

If #148 is fixed (snapshot-based filtering), retries become cheap. If not, contended scenarios pay 4× the per-event cost.

## 🟢 Low-impact / cleanup

### 164. `CapabilityFilter::clone` in `find_migration_targets` could be a borrowed builder

Already covered in #156. Listed for completeness.

### 165. `select_least_latency` and `select_least_load` walk all endpoints to find min

Standard linear min over N. Endpoints don't expose a sorted view so this is unavoidable without a separate heap. For small N (typical), it's fine. For large N with these strategies, a maintained min-heap would help, but probably not worth the complexity.

### 166. `place_with_spread` returns `PlacementDecision { reason: FirstMatch }` even when it ran the full exclusion-filter search

`compute/group_coord.rs:267-275`. The reason field doesn't distinguish "first match" from "filter-narrowed first match." Cosmetic — affects observability not perf.

### 167. `GroupCoordinator::healthy_count` does a linear scan + saturation cast

`compute/group_coord.rs:232-238`. Cold accessor (control plane). Could cache via incremental update on health-change. Probably not worth it.

### 168. `LoadBalancer::endpoints` is a DashMap; iteration order is unspecified

Every LB strategy that iterates `endpoints` (which is all of them via `get_available_endpoints`) gets non-deterministic ordering. RoundRobin's deterministic step is computed via a separate counter, so this works — but if anyone added "iterate in insertion order" logic by accident, it'd be subtly broken. Worth a comment-level audit, not a perf item.

### 169. `select_consistent_hash` `endpoints.iter().find(|e| e.node_id == node_id)` is O(N) per ring entry

Already covered in #150. The fix there subsumes this.

### 170. Causal `CausalEvent::clone` is implicit in many paths

`state/causal.rs:138` — `#[derive(Clone)]`. Per-event clones likely happen at delivery boundaries. The `payload: Bytes` is cheap; the `link: CausalLink` is 32 bytes Copy; `received_at: u64` is Copy. So a CausalEvent clone is ~40 bytes of memcpy + one Bytes refcount bump. Not bad, but if it's per-event in a hot loop, worth checking call sites.

---

## What I'd actually do

The compute-path findings cluster into a clear hierarchy:

**Top 3 (transformative on per-event compute routing):**

1. **#148 — snapshot-based available endpoints in LoadBalancer.** Removes a full DashMap walk + 4× atomic ops per endpoint + 2 Vec allocs per event. Probably 5-15× speedup on the LB hot path for high-endpoint deployments.

2. **#149 — atomic LoadMetrics instead of RwLock<LoadMetrics>.** Removes a RwLock read + LoadMetrics clone per endpoint per select. Compounds with #148 — once you have the snapshot, the per-endpoint metrics read is the next bottleneck.

3. **#150 — pre-sorted hash ring with O(log N) lookup.** Only matters if ConsistentHash is the configured strategy, but when it is, this is the difference between "scales" and "doesn't."

**Next tier (per-event daemon-host cost):**

4. **#153 — skip horizon encode when there are no outputs.** Per-event win for daemons with low output ratio.

5. **#154 — streaming xxh3 in compute_parent_hash.** Per-output-event allocation eliminated.

6. **#152 — entity_id → origin_hash map in GroupCoordinator.** Per-routed-event lookup fix.

**Wins that depend on whether compute is hot for your users:**

If users run compute workloads through your scheduler at high event rates (the architectural pitch beyond just RPC), these items matter a lot. If compute is a niche feature, they're nice-to-have.

**Items I'd skip:**

The migration / placement items (#156, #157, #158, #166, #167) are all cold-path. They're correctness-grade or observability-grade, not perf-grade.

---

## Compounding with prior findings

The compute path doesn't exist in isolation. Several items here interact with previously-flagged findings:

- **#148 (LB snapshot)** removes work that gets compounded by **#107 (session NodeId resolution)** and **#106 (routing-table lookup)**. Per-event compute routing currently pays all three.
- **#149 (atomic metrics)** is the same pattern as **#11 (RedexIndex Arc<HashSet>)** and **#96 (Arc<Memory>)** — "clone the inner value to read" is a recurring anti-pattern that snapshot/Arc fixes.
- **#151 (max_by instead of sort)** is the same pattern as **#89 (memories query top-K)** — top-K via sort is a recurring sub-optimization.
- **#154 (streaming xxh3)** is the same pattern as **#92 (cortex checksum)** — allocating to concatenate-then-hash is a recurring anti-pattern.

**Cross-cutting fix:** A "Arc-wrap-and-snapshot" pattern applied uniformly across LB endpoints, capability index entries, memories/tasks state, and replication metadata would eliminate the per-read-clone cost in every subsystem at once. The diffs are mechanical and similar across all sites.

---

## Honest expectation

The compute path is where I'd expect the biggest unrealized wins for users running heavy workload orchestration. Specifically:

- **High-event-rate compute groups** (many events/sec routed through fork/replica groups): #148, #149, #152 compound. Likely 3-10× on the per-event routing cost.
- **ConsistentHash users**: #150 alone is potentially 100× on selection latency at 100+ endpoints.
- **Daemons with low output ratio** (filtering, observing, state-update workloads): #153 cuts deliver() cost meaningfully.
- **Heavy chain producers** (daemons that emit GB/s of chained events): #154 + #162 cut per-output allocation + clock cost.

For users who DON'T run compute workloads — pure pub/sub or RPC users — none of this matters. The compute subsystem only fires when daemons + groups are used.

If compute is part of your product pitch (workload orchestration, state-replicated services, daemon scheduling), this section probably contains the highest-leverage items in the entire audit. If it's a legacy or niche subsystem, skip.