net-mesh 0.24.0

MeshOS — Features Overview
*Atomic Playboys Release*

MeshOS is the cluster‑behavior engine of Net.  
It converts the substrate (RedEX + Capabilities + Dataforts) into a living operating environment:

- reconciles desired vs actual state  
- supervises daemons  
- drives replica placement  
- manages blob movement  
- responds to admin events  
- incorporates locality  
- produces behavior snapshots for Deck  

MeshOS is the brain that makes the cluster move.

---

1. Unified Event Loop
MeshOS runs a single event loop that processes multiple event types:

- Replica updates  
- Daemon lifecycle signals  
- RTT samples  
- Node health  
- Admin actions  
- Blob announcements  
- Placement intent from Dataforts  

One canonical ordering → deterministic cluster behavior.

MeshOS does not run multiple reactors.  
One stream → one reconcile → consistent actions.

---

2. Reconciliation Engine
The core MeshOS responsibility:

desired state (from Dataforts)  
vs  
actual state (from RedEX folds)

Difference becomes actions:

- start_daemon  
- stop_daemon  
- migrate_blob  
- pull_replica  
- reduce_heat  
- mark_avoid  
- apply backoff  

Every iteration produces a minimal action list to bring the mesh into alignment.

---

3. Daemon Supervision
MeshOS handles all node‑resident workers:

- start / stop  
- restart on crash  
- exponential backoff  
- health checks  
- saturation signals  
- graceful shutdown  

Daemons register via the MeshOS SDK (Rust for now; other languages follow binding pattern).

---

4. Replica Enforcement
MeshOS continuously enforces the placement plan:

- replica count compliance  
- drift correction  
- greedy pulls  
- anti-entropy  
- repair under churn  
- de-duplication of stale replicas  
- safety when nodes rejoin  

Blob movement becomes a first-class system action.

---

5. Locality Awareness
MeshOS incorporates RTT measurements to influence and override placement heuristics:

- avoid slow nodes  
- reroute blob pulls  
- prioritize neighbors  
- correct “bad” placements  
- identify partitions early  

Locality signals flow back into Dataforts in the next placement round.

---

6. Admin Event Handling
MeshOS applies operator intent:

- maintenance enter / exit  
- drain node  
- cordon / uncordon  
- restart all daemons  
- clear avoid lists  
- drop replicas  
- invalidate placement for recalculation  

Admin events flow through RedEX, so all nodes converge on identical behavior.

---

7. Safety & Backpressure
MeshOS emits systemic safety signals:

- global backpressure  
- temporary throttle windows  
- blob pull cooldown  
- crash-loop gating  
- replica stabilization periods  

This prevents thrashing, storming, and runaway repairs.

---

8. Behavior Snapshot for Deck
MeshOS emits a fold of cluster behavior state:

- current actions  
- pending actions  
- recent failures  
- drift  
- heat levels  
- locality map  
- health  
- daemon status  
- placement stability  

Deck uses this to paint the live cluster jungle:

- replica movement  
- blob migration  
- node saturation  
- failure envelopes  
- recovery attempts  

---

9. Rust SDK Surface
MeshOS exposes a Rust-only SDK for daemon integration:

- register_daemon  
- report_health  
- report_saturation  
- publish_capabilities  
- receive MeshOS control events  
- graceful_shutdown  

Other language SDKs follow after the substrate settles.

MeshOS itself remains Rust-native and not a cross‑lang surface.

---

10. Interaction Surfaces
MeshOS interacts with four systems:

- RedEX for event streams and state commitments   (1/2)
- Capability System for node attributes and daemon metadata  
- Dataforts for placement and replica intent  
- MeshDB for folded state and Deck queries  

MeshOS does not duplicate their logic.  
It composes them.

---

11. Non-Goals
MeshOS is not:

- a scheduler for user jobs  
- a remote execution system  
- a workflow orchestrator  
- a data warehouse  
- a compute framework  

It is the behavior layer of the cluster — the logic that keeps everything coherent and alive.

---

Summary
MeshOS transforms Net into a living distributed system:

- one event loop  
- many event types  
- unified reconcile  
- deterministic actions  
- daemon supervision  
- replica enforcement  
- locality correction  
- operator integration  
- Deck visibility  

MeshOS is the atomic heart of the mesh.

12. Maintenance Nodes
Maintenance Nodes are first‑class cluster roles supervised directly by MeshOS.  
They represent nodes that are logically present but operationally isolated so that:

- replicas can drain safely  
- daemons can shut down gracefully  
- admin changes can apply  
- operator tasks can run without disrupting placement  
- Deck can show safe‑state transitions  

MeshOS implements maintenance nodes as a state machine with guaranteed-safe transitions.

---

Maintenance Node Lifecycle
MeshOS manages the following states:

- Active → normal participation  
- EnteringMaintenance → prepare for isolation  
- Maintenance → fully isolated, can run operator commands  
- ExitingMaintenance → rejoin, resync, re-evaluate placement  
- DrainFailed → operator warning state  
- Recovery → post‑maintenance repairs

All transitions are idempotent, chain-driven, and visible in Deck.

---

Maintenance Mode Guarantees
When a node enters Maintenance:

1. Replica Freeze  
   - No new replicas placed on the node.  
   - Existing replicas are scheduled for pull/migrate.

2. Daemon Drain  
   - All non-essential daemons receive a shutdown event.  
   - Health is no longer considered in placement.

3. Blob Safety  
   - Blob cleanup runs after all replicas migrate.  
   - No pulls target a maintenance node.

4. Admin Surface Unlocked  
   - Node can run operator commands:  
     - key rotation  
     - identity changes  
     - indexing fixes  
     - storage repairs  
     - config reloads  

5. Cluster Stability  
   - Dataforts treats maintenance nodes as absent in scoring.  
   - Other nodes adopt replicas to satisfy desired count.

---

Exiting Maintenance
MeshOS coordinates a controlled exit:

1. Health Revalidation  
   Daemons must restart and report health.

2. Capability Refresh  
   Fresh capability set emitted via fork-of.

3. Replica Warmup  
   Node is eligible for replica re-placement after RTT stabilizes.

4. Avoid List Timeout  
   No immediate “hot placement” — ramp-up window prevents thrash.

---

Operator APIs
MeshOS surfaces simple admin commands (Rust-only SDK):

enter_maintenance(node_id)
exit_maintenance(node_id)
drain(node_id)
uncordon(node_id)


These are chain-driven admin events — not RPCs.  
All nodes converge on the same interpretation.

---

Deck Integration
Deck displays:

- maintenance state  
- drain progress  
- hung daemons  
- replicas in transit  
- merge/diff of admin events  
- recovery envelope  
- expected time-to-safe  

Operators see, live:

- what is draining  
- what is blocked  
- what is safe to remove  
- what is still migrating  

---

Why Maintenance Nodes Matter
They enable:

- safe cluster upgrades  
- safe daemon upgrades  
- node replacement  
- machine swaps  
- storage migration  
- key rotation  
- operator inspection  
- stress-free debugging  

Without unsafe side effects.

MeshOS becomes a true distributed OS with real operator semantics, not just a placement engine.