net-mesh 0.21.0

# Configuring RedEX replication

Operator-facing companion to [`STORAGE_AND_CORTEX.md`](STORAGE_AND_CORTEX.md).
This document covers how to turn on cross-node replication for a
RedEX channel, what each knob does, and what failure modes you'll
see in production.

## When to enable

Replication is opt-in per channel via `RedexFileConfig::replication`.
The default (`None`) keeps every existing channel single-node — no
observable behavior change, no wire traffic on the mesh's
`SUBPROTOCOL_REDEX`.

Turn it on when:

- The channel carries data you can't lose if one node's disk wipes.
- Multiple consumers in different fault domains read the channel
  and you want each to read from the nearest replica rather than
  hairpinning to the publisher.
- The channel's publish rate is bounded (heartbeat traffic + sync
  bandwidth grow linearly with replica count).

Don't turn it on when:

- The channel is local-only telemetry (e.g. per-process metrics).
- Loss of recent events on node-down is acceptable (the heartbeat
  cycle takes ~3 × `heartbeat_ms` to detect leader failure).
- You're not sure how many replicas you want — start single-node,
  measure, then add `ReplicationConfig` later.

## Quick start

```rust
use net::adapter::net::redex::{
    Redex, RedexFileConfig, ReplicationConfig, PlacementStrategy,
};

let mesh: Arc<MeshNode> = build_mesh()?;
let redex = Arc::new(Redex::new());

// Install the replication wiring on every Redex that participates.
// Idempotent — safe to call from multiple call sites.
redex.enable_replication(mesh.clone());

// Open a replicated channel. The same RedexFileConfig (with
// matching ReplicationConfig) should be used on every node that
// hosts a replica.
let cfg = RedexFileConfig::default()
    .with_replication(Some(
        ReplicationConfig::new()
            .with_factor(3)
            .with_heartbeat_ms(500),
    ));
let file = redex.open_file(&channel_name, cfg)?;
```

`enable_replication` installs a per-`Redex` router on the mesh's
`SUBPROTOCOL_REDEX` inbound dispatch; subsequent `open_file` calls
with `replication: Some(_)` spawn one tokio task per channel. The
router auto-registers + unregisters at `open_file` / `close_file`
time.

### Binding-language equivalents

The same surface ships in every language binding. The replication
opt-in is a nested `replication` field on the channel config; the
operator surface (`enable_replication`, `replication_prometheus_text`)
is exposed as methods on the binding's `Redex` handle.

- **Node** (`@net-mesh/core`):
  ```ts
  redex.enableReplication(mesh);
  await redex.openFile("my/channel", {
    replication: { factor: 3, heartbeatMs: 500n, placement: "standard" },
  });
  ```
- **Python** (`net`):
  ```python
  redex.enable_replication(mesh)
  redex.open_file("my/channel",
                  replication=True, replication_factor=3,
                  replication_heartbeat_ms=500)
  ```
- **Go** (cgo wrapper at `bindings/go/net/redex.go`):
  ```go
  redex.EnableReplication(meshArcPtr)
  redex.OpenFile("my/channel", &net.RedexFileConfig{
      Replication: &net.ReplicationConfig{
          Factor: 3, HeartbeatMs: 500, Placement: net.PlacementStandard,
      },
  })
  ```
- **C/FFI**: the `libnet` cdylib exports `net_redex_*` symbols
  directly. See `bindings/go/net/redex.go`'s cgo header block for
  the canonical extern signatures; non-Go consumers wire to the
  same symbols. Config rides as a JSON string through
  `net_redex_open_file` to keep the C surface narrow.

## `ReplicationConfig` fields

### `factor: u8`

Number of replicas (including the leader) the channel maintains.
Range: `[1, 16]` (default `3`). `1` collapses to single-node-with-
coordinator — useful for testing the daemon lifecycle without
spinning peers. The ceiling is conservative (replication overhead
goes superlinear above ~8 replicas due to heartbeat fanout); plumb
your own ceiling if you have a genuine 16+-replica workload.

When `placement = Pinned(nodes)`, the effective factor is
`nodes.len()` — the operator's explicit list wins over the numeric
hint.

### `placement: PlacementStrategy`

Where replicas live and how they're chosen. Three options:

- **`Standard`** (default) — let `PlacementFilter` decide based on
  `metadata.intent`, `metadata.colocate-with`, `scope:` tags,
  proximity, and resource availability. Production default.
- **`Pinned(Vec<NodeId>)`** — manual placement on a fixed `NodeId`
  set. Used for special-case topologies, integration tests, and
  recovery scenarios. The vector's length pins the effective
  replication factor regardless of `factor`.
- **`ColocationStrict`** — every replica must live on a node
  already holding the chain referenced by
  `metadata.colocate-with-strict`. Refuses placement on nodes
  with insufficient coverage.

**Phase F gap**: `Standard` and `ColocationStrict` currently
bootstrap with an empty replica set; the placement filter's
re-selection on roster change lands with Phase F. Until then, use
`Pinned` for production channels where you need deterministic
membership.

### `heartbeat_ms: u64`

Cadence between leader → replica heartbeats. Range:
`[100, u64::MAX]` (default `500`). Lower for faster failure
detection at the cost of more wire traffic; higher for less
overhead at the cost of slower failover.

Failure-detection window = `3 × heartbeat_ms` (three-missed
hysteresis). With the default `500 ms`, a silent leader is
declared dead after ~1.5 s — well under the activation-gate's "5 s
RTO" target.

Don't go below `100 ms` — heartbeat traffic dominates the
channel's effective throughput.

### `leader_pinned: Option<NodeId>`

Pin the leader to a specific node. `None` (default) lets the
deterministic election pick the lowest-RTT healthy replica. When
`Some(node)`, the election picks `node` whenever it's healthy.

Common reasons to pin:
- A specific node has the lowest write-latency to the publisher.
- An operator is running a blue/green deployment and wants to
  force traffic to a known canary.
- Compliance: the channel's writes must originate from a node in
  a specific data center.

If `placement = Pinned(set)` and `leader_pinned = Some(node)`,
`node` must be in `set` — otherwise `validate()` rejects.

### `on_under_capacity: UnderCapacity`

Behavior when a replica's local file rejects an append because of
disk pressure (heap segment at the 3 GB hard cap, or
persistent-tier write fail).

- **`Withdraw`** (default) — drop the replica role; the
  coordinator transitions to `Idle`, the `causal:<hex>` capability
  tag is withdrawn, and peers re-resolve to a healthy replica via
  `find_chain_holders`. Reads re-route as a natural consequence.
- **`EvictOldest`** — call `RedexFile::sweep_retention()` to free
  space, keep the replica role, retry the apply on the next
  chunk. **Requires `retention_max_*` to be configured on the
  same `RedexFileConfig`** — without retention caps the sweep is
  a no-op and the next apply will fail again.

`under_capacity_total` bumps on both branches regardless of
policy, so the operator-facing counter reflects every disk-pressure
event.

### `replication_budget_fraction: f32`

Fraction of measured NIC peak that replication-sync I/O may
consume. Range: `(0.0, 1.0]` (default `0.5`). The bandwidth
budget is a token bucket; leaders reject `SyncRequest`s with
`SyncNackError::Backpressure` when the bucket is empty.

The denominator is currently a 1 Gbps placeholder; the
proximity-graph throughput probe wires the measured peak in a
follow-up.

## Lifecycle

```text
open_file(channel, cfg with replication=Some(_))
    │
    ▼
spawn ReplicationRuntime (tokio task per channel)
    │  ── Idle  (initial)
    ▼
placement filter / pinned set selects this node
    │
    ▼
coordinator.transition_to(Replica, CapabilitySelected)
    │  ── Replica  (advertises causal:<hex> capability tag)
    ▼
heartbeat loop:
  - Leader emits heartbeats every heartbeat_ms
  - Replica observes leader's tail_seq in each heartbeat
  - If replica's local tail < leader's tail, replica emits SyncRequest
  - Leader's handle_sync_request reads from local file, returns SyncResponse
  - Replica's apply_sync_response advances local tail
    │
    ▼  (leader silent for 3 × heartbeat_ms)
coordinator.transition_to(Candidate, MissedHeartbeats)
    │  ── Candidate  (microseconds-scale; deterministic election)
    ▼
elect(replica_set, self, rtt_lookup, healthy_peers) →
    SelfWins → transition_to(Leader, ElectionWon)
    PeerWins(_) → transition_to(Replica, ElectionLost)
    NoEligibleReplica → stay Candidate, next round
    │
    ▼
close_file(channel)
    │
    ▼
coordinator.transition_to(Idle, ChannelClose) + router unregisters
```

## Observability

Per-channel atomic counters (`ChannelMetricsAtomic`) exposed via
the `ReplicationMetricsRegistry`. Prometheus shapes:

| Metric | Type | Meaning |
|--------|------|---------|
| `dataforts_replication_lag_seconds{channel,role}` | gauge | Leader: max-across-replicas of `now - last_heartbeat`. Replica: `now - believed_leader.last_heartbeat`. |
| `dataforts_replication_sync_bytes_total{channel}` | counter | Cumulative bytes shipped via `SyncResponse`. |
| `dataforts_leader_changes_total{channel}` | counter | Transitions into Leader role. Spikes indicate election thrash. |
| `dataforts_replication_under_capacity_total{channel}` | counter | Disk-pressure events (bumps regardless of policy). |
| `dataforts_replication_skip_ahead_total{channel}` | counter | `BadRange` NACKs received (gap exceeded `skip_threshold`). |
| `dataforts_replication_election_thrash_total{channel}` | counter | `MissedHeartbeats` transitions; > 1/30s indicates instability. |
| `dataforts_replication_witness_withdrawals_total{channel}` | counter | Reserved for Phase E witness coordination. |

Render via `ReplicationMetricsRegistry::snapshot().prometheus_text()`.

For per-channel introspection (current role, manual transition for
recovery), `Redex::replication_coordinator_for(channel_name) ->
Option<Arc<ReplicationCoordinator>>` returns the coordinator handle.

## Failure modes

| Symptom | Likely cause | Resolution |
|---------|--------------|------------|
| Replica's `lag_seconds` keeps growing | Leader's bandwidth budget exhausted, or replica's mesh path saturated | Increase `replication_budget_fraction`, or check the proximity-graph throughput probe for path-level loss |
| Frequent `leader_changes_total` bumps | `heartbeat_ms` too aggressive for the link's typical RTT variance | Bump `heartbeat_ms`, or pin leader with `leader_pinned` |
| `under_capacity_total` > 0 + replica disappeared | `UnderCapacity::Withdraw` fired | Free disk on the replica or switch policy to `EvictOldest` (requires retention caps) |
| `skip_ahead_total` > 0 | Replica fell more than `skip_threshold` behind; leader's retained range trimmed past replica's tail | Either accept the data loss or bump leader's retention caps |
| `election_thrash_total` rising | Two replicas oscillating leadership under flaky connectivity | Investigate the proximity graph; partition-detector should fire if pathology is partition-shaped |

## Limits + non-goals

- **One writer per channel** — the leader is the single writer.
  RedEX is append-only and monotonic on `seq`; multi-writer
  topologies are out of scope.
- **Replication is best-effort under pressure** — the leader's
  replication factor is a hard guarantee, but individual replicas
  fall back to `UnderCapacity` policy when local storage saturates.
- **Skip-ahead is heap-only** — when the leader's `SyncResponse`
  carries `first_seq` above the replica's local tail (the leader
  trimmed past the replica's retained range), the replica calls
  `RedexFile::skip_to(first_seq)` and retries the apply. Persistent
  files (`redex-disk`) reject `skip_to` with a typed error; affected
  replicas fall back to NACK BadRange and heartbeat-cycle recovery
  while the persistent-tier truncate+rebuild path waits for v2.
- **DST coverage is partial** — pure-logic pieces (state machine,
  election, catch-up helpers, runtime tick) have unit tests; the
  full deterministic-simulation harness for partition + retention-
  drift scenarios is Phase F work.