simulacra 0.1.0

# Simulacra

A deterministic discrete-event simulation engine for message flow across large computer networks, with pluggable latency, jitter, and failure models.

## Status

v0.1 implementation complete: kernel, async task façade, network layer with topology / routing / bandwidth / buffers / drop policies, trace recording with JSON export, and a comprehensive failure-injection surface (partition, link failure with reroute, node failure, opt-in in-flight drop) usable from both the raw `Network<P, L>` API and the `TaskSim<M>` async façade. Determinism is enforced end-to-end via `tests/determinism.rs`. See `CHANGELOG.md` for the full surface and `Phase 7 follow-ups` below for what's open.

## Vision

Simulacra is a Rust-first simulation platform for modeling large networks of computers and the movement of messages through those networks over simulated time.

The project starts from a deliberately simple premise:

- time advances by events, not wall clock time
- nodes are passive state, not OS threads
- messages move through a topology according to routing and delay models
- randomness is explicit and reproducible
- repeated runs with the same seed should produce the same result

The long-term goal is not just “a simulator,” but a modern, ergonomic, inspectable engine for systems simulation.

## Non-goals for v1

To keep the project honest, the first version should not try to be:

- a packet-level Internet simulator
- a full cloud/datacenter simulator
- a general-purpose async runtime replacement
- a parallel discrete-event simulator
- a GUI-heavy academic framework

Those may become future directions, but they should not define the initial architecture.

## Core idea

At its heart, Simulacra is a deterministic scheduler over timestamped events.

A minimal mental model:

```rust
while let Some(event) = queue.pop() {
    sim.now = event.time();
    sim.handle(event);
}
```

The first concrete domain is network-style message delivery:

1. a node sends a message
2. a route is selected
3. latency and jitter are applied
4. a delivery event is scheduled at a future simulated time
5. the target node receives the message when that event is processed

## Design principles

### Determinism first

Given the same seed, same topology, and same inputs, a simulation should produce the same result.

This implies:

- deterministic event ordering
- explicit tie-breaking rules
- seeded randomness
- no dependence on wall clock time

### Simple kernel, rich layers

The core engine should stay small:

- time
- event queue
- scheduler
- task or node registry
- deterministic RNG

Higher-level conveniences should layer on top of that core.

### Data-oriented where it matters

The system should avoid a needlessly object-heavy model. Nodes, links, messages, and events should be represented compactly where practical.

### Observable by default

A simulator is much more useful when its behavior can be inspected. Instrumentation should be treated as a first-class concern, not an afterthought.

### Ergonomic without hiding the model

The API should be pleasant, but it should not obscure the fact that this is a discrete-event simulator with explicit causality and simulated time.

## Architecture overview

The initial architecture is expected to have at least these conceptual pieces.

### 1. Simulation kernel

Responsible for:

- current simulated time
- event queue management
- deterministic event ordering
- running the main loop

Possible core shape:

```rust
pub struct Simulation {
    now: Time,
    queue: EventQueue,
    rng: SimRng,
    // domain-specific registries layered on top
}
```

### 2. Time model

A dedicated `Time` type should represent simulated time explicitly.

Open questions:

- integer ticks vs nanoseconds vs generic duration units
- whether `Time` and `Duration` should be distinct types
- overflow behavior

Initial recommendation:

- use integer-based simulated time
- keep `Time` and `Duration` distinct
- avoid floats in the core clock model

### 3. Event model

Events are the atomic units of causality.

Initial requirements:

- each event has a scheduled time
- event ordering must be deterministic
- tie-breaking should be explicit

A likely shape:

```rust
pub struct Scheduled<E> {
    pub at: Time,
    pub order: u64,
    pub event: E,
}
```

Where `order` is a monotonic sequence number used to break ties at the same timestamp.

### 4. Topology model

The initial domain centers on message flow through a graph of nodes and links.

Topology responsibilities:

- node identifiers
- edges / links
- route lookup or route computation
- latency base values
- optional capacity/failure metadata later

Initial recommendation:

- start with static topology
- start with precomputed routes or simple routing logic
- keep the topology layer separate from the scheduler

### 5. Network/message model

The first domain-specific event set can remain extremely small.

Example:

```rust
pub enum NetEvent {
    DeliverMessage {
        src: NodeId,
        dst: NodeId,
        message: MessageId,
    },
}
```

This is enough for an initial simulator that models delayed delivery over a graph.

### 6. Randomness model

Randomness should be deterministic and scoped.

Requirements:

- seeded runs
- repeatable jitter/failure behavior
- ability to replay exactly

Possible future refinement:

- separate RNG streams for different concerns such as routing, jitter, failures, workload generation

### 7. Observability

The engine should make it easy to answer questions like:

- what event fired at this time?
- why was this message delayed?
- what was the queue depth over time?
- what state transitions happened for this node?

Potential outputs:

- event trace logs
- counters / metrics
- queue depth histories
- timeline exports

## Execution model

### Baseline execution model

The first execution model should be single-process and single-threaded.

Rationale:

- simplest correct implementation
- deterministic by default
- easy to debug and reason about
- avoids premature complexity around causality and partition coordination

### Parallelism stance

Parallelism is not rejected; it is deferred.

Near-term parallelism should focus on:

- many independent simulation runs in parallel

Not on:

- parallelizing a single run

Longer-term, partitioned simulation may be explored if the architecture justifies it.

## Async/task model

A major design opportunity is to provide an async-like API on top of the discrete-event engine.

Example user-facing shape:

```rust
async fn node_main(ctx: NodeContext) {
    loop {
        let msg = ctx.recv().await;
        ctx.sleep(Duration::from_millis(10)).await;
        ctx.send(msg.reply_to(), reply(msg)).await;
    }
}
```

Important distinction:

- this would be inspired by Tokio-like ergonomics
- but it would not be driven by wall clock time or OS I/O
- the simulator would poll suspended tasks according to simulated events

Recommendation:

- do not make this the first implementation milestone
- first build the explicit event kernel
- then layer an async/task façade on top if the core remains clean

## Initial crate shape

A likely long-term workspace structure:

- `simulacra-core` — time, event queue, scheduler
- `simulacra-net` — topology, routing, message delivery, latency/jitter models
- `simulacra-task` — async/task façade over the simulation kernel
- `simulacra-vis` — visualization/export helpers
- `simulacra` — top-level convenience crate or prelude

For now, starting as a single crate is the right move.

## Proposed v0 scope

The first meaningful version should be intentionally narrow.

### v0 goals

- deterministic simulated clock
- priority queue of scheduled events
- static topology of nodes and links
- message send from one node to another
- route latency plus optional jitter
- seeded reproducibility
- basic event trace output

### v0 non-goals

- packet fragmentation
- bandwidth/congestion modeling
- dynamic routing protocols
- node CPU/memory execution modeling
- partitioned simulation
- GUI
- real async runtime integration

## Example first scenario

A very small end-to-end milestone:

- create 10 nodes in a graph
- define link latencies
- send a message from node A to node B
- compute route delay plus jitter
- schedule delivery
- run simulation to completion
- emit trace of all delivery events

If that works deterministically, the nucleus of the project is sound.

## Open design questions

### Time

- What should the canonical unit of simulated time be?
- Should the core be unitless ticks and let higher layers interpret them?

### Event queue

- Is `BinaryHeap` enough initially?
- Do we want a more specialized calendar queue or timing wheel later?

### Topology/routing

- Precompute shortest paths, or compute dynamically?
- Should route selection be part of the topology layer or a pluggable strategy?

### Payload storage

- Should events contain payloads directly, or refer to message storage by ID?
- What data layout minimizes allocations without making the API miserable?

### Deterministic ordering

- What exact tie-break rules should govern events at identical timestamps?

### Instrumentation

- What should be built into the core versus layered externally?

### Async façade

- Should the task model be a first-party layer or a separate experimental crate?

## Roadmap

### Phase 1: minimal kernel

- `Time`
- `Scheduled<E>`
- event queue
- simulation loop
- deterministic ordering

### Phase 2: network domain

- node IDs
- topology
- routing
- message delivery
- jitter model

### Phase 3: reproducibility and traces

- seeded RNG
- trace recording
- replay validation

### Phase 4: ergonomics

- better scenario construction APIs
- helper builders
- docs and examples

### Phase 5: async/task experiment

- simulated `sleep().await`
- task wakeups scheduled by the event queue
- node task contexts

### Phase 6: scale exploration

- profiling
- allocation reduction
- compact storage
- multi-run parallel execution

### Phase 7: advanced models

- loss/failure injection
  - `SpikyLatency` landed in 2026-04
  - pair-level partition/heal (`Network::partition` / `heal`) — initial commit
  - link failure with reroute (`Topology::fail_link` / `heal_link`,
    Dijkstra-aware) landed in 2026-05; in-flight messages survive
  - node failure (`Topology::fail_node` / `heal_node`) landed in 2026-05;
    excludes the node from routing as src, dst, or intermediate hop
  - opt-in in-flight drop landed in 2026-05 via
    `NetConfig::drop_in_flight_on_failure`; failure mutators sweep the
    event queue and rewrite unroutable `Deliver` events into `Drop`s
    (uses new `Simulation::rewrite_queue` API)
  - failure injection in async task facade landed in 2026-05:
    `NodeContext` and `TaskSim` expose `partition` / `heal` / `fail_link` /
    `heal_link` (+ `_directed`) / `fail_node` / `heal_node`; sends across
    failed/partitioned routes drop with `messages_dropped` counter on
    `TaskSimStats`. Replaces the previous broken "no route → silent
    deliver-now" behavior in `SendFut`/`inject` with a clean drop.
- minimal end-to-end bandwidth cap with per-`(src, dst)` serialization
  queueing landed in 2026-04 via `Network::set_bandwidth` + `send_sized`

#### Phase 7 follow-ups (open)

Concrete next moves, ordered by rough effort, smallest first.

1. **Failure-exercising bench.** All current benches have empty failure
   sets, so the per-edge `HashSet::contains` in Dijkstra and the
   partition check in `SendFut::poll` are invisible. Add a bench that
   actually populates `failed_links` / `failed_nodes` / `partitions`
   (e.g., 10% of edges failed) so future regressions on the failure
   hot path become visible. Add a column to `docs/perf-baseline.md`.
2. **Task-layer trace export.** `TaskSim` has its own `SimState` and
   `EventQueue<TaskEvent<M>>`, separate from `Network`'s
   `TracedNetwork`. Determinism tests today only cover the `Network`
   path. Add a `TracedTaskSim<M>` (or `TaskSimBuilder::with_trace`)
   that records `Delivered` / `Dropped` events with timestamps, then
   add a task-layer scenario to `tests/determinism.rs`.
3. **Time-bounded failure scheduler helper.** A common pattern is
   "fail at T1, heal at T2." Today users implement it inline by
   checking `ctx.now()` on each handler tick (see
   `examples/failure_injection.rs`). A small helper —
   `Scenario::fail_at(time, action)` or similar — would dedupe that
   pattern.
4. **In-flight drop in the async task layer.** `Network` has the opt-in
   `NetConfig::drop_in_flight_on_failure`; `TaskSim` does not. Symmetry
   would mean adding the same flag to `TaskSimBuilder` / `TaskSim` and
   sweeping `events: EventQueue<TaskEvent<M>>` on failure mutators.
   Mostly mechanical given the existing `EventQueue::rewrite` primitive.
5. **Queueing disciplines beyond FIFO.** Per-link bandwidth + buffer +
   tail/RED drop is in. Missing: priority queues, weighted fair
   queueing (WFQ), traffic classes. Each is its own design exercise;
   start by clarifying the user-visible API on `Topology` (e.g.,
   `add_link_with_discipline(...)`).
6. **Partitioning experiments.** Vague until a concrete protocol drives
   the requirements. A small Raft-flavored or gossip-with-Byzantine
   example would surface what's missing — likely related to (5) above.

## README draft

## Simulacra

Simulacra is a deterministic discrete-event simulation engine for modeling message flow across large computer networks.

It is designed around a few simple ideas:

- simulated time instead of wall clock time
- explicit event-driven causality
- deterministic replay from a seed
- ergonomic APIs layered over a small core

### Current focus

The first milestone is a minimal simulator that can:

- represent a network topology
- send messages across routes
- apply latency and jitter
- process delivery events in deterministic time order

### Why?

Most existing simulation tools in this space are either highly academic, domain-heavy, or not very ergonomic. Simulacra aims to explore a different point in the design space: modern Rust APIs, deterministic behavior, and a strong foundation for observability and tooling.

### Status

Very early. The architecture is still being defined.

## Immediate next steps

1. Define `Time`, `Duration`, and `Scheduled<E>`.
2. Implement the first event queue.
3. Implement `Simulation::run()`.
4. Model a minimal topology and `DeliverMessage` event.
5. Write one deterministic end-to-end scenario test.

## Notes for future contributors

Keep the kernel small. Prefer deterministic behavior over cleverness. Resist adding realism faster than the core can absorb it.