# VMM
ktstr includes a purpose-built VMM (virtual machine monitor) that boots
Linux kernels in KVM for testing.
## KtstrVm builder
```rust,ignore
let result = vmm::KtstrVm::builder()
.kernel(&kernel_path)
.init_binary(&ktstr_binary)
.topology(Topology::new(numa_nodes, llcs, cores_per_llc, threads_per_core))
.memory_mib(4096)
.run_args(&["run".into(), "--ktstr-test-fn".into(), "my_test".into()])
.build()?
.run()?;
```
## Topology
The VM topology is specified as `(numa_nodes, llcs, cores_per_llc,
threads_per_core)`. On x86_64, the VMM creates ACPI tables (MADT,
SRAT, SLIT, and HMAT when `numa_nodes > 1`) and MP tables. On
aarch64, topology is expressed via FDT cpu nodes with MPIDR-derived
`reg` properties.
```rust,ignore
pub struct Topology {
pub llcs: u32,
pub cores_per_llc: u32,
pub threads_per_core: u32,
pub numa_nodes: u32,
pub nodes: Option<&'static [NumaNode]>,
pub distances: Option<&'static NumaDistance>,
}
```
`total_cpus()` = llcs * cores_per_llc * threads_per_core.
`num_llcs()` = llcs.
When `nodes` is `None` (the default), memory and LLCs are distributed
uniformly across NUMA nodes with default 10/20 distances. When
`Some`, each `NumaNode` specifies its LLC count, memory size, and
optional HMAT attributes (`latency_ns`, `bandwidth_mbs`,
`mem_side_cache`). A `NumaNode` with `llcs = 0` models a CXL
memory-only node.
`NumaDistance` is an NxN inter-node distance matrix. Diagonal entries
must be 10, off-diagonal > 10, and the matrix must be symmetric (ACPI
SLIT requirements).
Use `Topology::new(numa_nodes, llcs, cores, threads)` for uniform
topologies, or `Topology::with_nodes(cores, threads, &nodes)` for
explicit per-node configuration.
## initramfs
The VMM builds a cpio initramfs containing:
- The test binary (as `/init`)
- Optional scheduler binary (as `/scheduler`)
- Shared library dependencies (resolved via ELF DT_NEEDED parsing)
The initramfs is cached based on a cache key derived from the binary
contents. A compressed SHM segment enables COW overlay into guest
memory, sharing physical pages across concurrent VMs.
## Guest-host communication
**Serial console** -- COM2 carries guest stdout/stderr, the
canonical crash diagnostic transport. The guest panic hook writes
`PANIC: <info>\n<bt>\n` to COM2; the host parses it via
`extract_panic_message` and surfaces the backtrace in test failure
output. The legacy COM2 result / exit-code fallback (delimited
`===KTSTR_TEST_RESULT_START===` / `_END===` sentinels and
`KTSTR_EXIT=N` lines) was removed pre-1.0 — virtio-console port-1
(below) is the only result transport.
**Virtio-console port 1 TLV stream** -- the primary guest-to-host
data channel. Carries scenario markers (`MSG_TYPE_SCENARIO_START`,
`MSG_TYPE_SCENARIO_END`), test results (`MSG_TYPE_TEST_RESULT`),
exit codes (`MSG_TYPE_EXIT`), stimulus events (`MSG_TYPE_STIMULUS`),
scheduler exit notifications (`MSG_TYPE_SCHED_EXIT`), profraw
coverage data (`MSG_TYPE_PROFRAW`), per-payload-invocation metrics
(`MSG_TYPE_PAYLOAD_METRICS`), and raw LlmExtract output
(`MSG_TYPE_RAW_PAYLOAD_OUTPUT`). Each TLV frame has a CRC32 for
integrity checking.
## Virtio devices
The VMM implements three virtio-MMIO devices in addition to the
serial console above. All three speak the virtio 1.x MMIO transport
(virtio-v1.2 §4.2.2) with `VIRTIO_F_VERSION_1` and use irqfd
(eventfd → KVM GSI) for interrupt delivery.
- **virtio-blk** (`vmm::virtio_blk`) -- file-backed block device
with a single request virtqueue and a token-bucket throttle.
Used to give workloads real on-disk filesystems (per-test images
cloned from a btrfs template). Advertises
`VIRTIO_BLK_F_BLK_SIZE`, `VIRTIO_BLK_F_SEG_MAX`,
`VIRTIO_BLK_F_SIZE_MAX`, `VIRTIO_BLK_F_FLUSH`, and
`VIRTIO_RING_F_EVENT_IDX`, plus `VIRTIO_BLK_F_RO` when configured
read-only.
- **virtio-net** (`vmm::virtio_net`) -- two-virtqueue (RX, TX) NIC
with an in-VMM L2 loopback backend. Used by network-shaped
workloads (TCP/UDP throughput, latency) without depending on the
host's network stack. Advertises `VIRTIO_NET_F_MAC` so the guest
binds a deterministic MAC.
- **virtio-console** (`vmm::virtio_console`) -- three-port multiport
console with eight virtqueues (per virtio-v1.2 §5.3.5: two control
queues plus an in/out pair per port, three ports → 2 + 2·3 = 8).
Port 0 carries the interactive `/dev/hvc0` console alongside the
COM1/COM2 16550 serial ports; port 1 carries the guest-to-host TLV
stream that delivers exit code, test result, per-payload metrics,
raw payload outputs, profraw, and scheduler exit notifications;
port 2 is a transparent byte-pipe relay carrying scx_stats request
bytes from the host to the in-guest relay thread and the
scheduler's responses back. Advertises
`VIRTIO_CONSOLE_F_MULTIPORT` with `max_nr_ports = 3`.
## Performance mode
When `performance_mode` is enabled, the VMM applies host-side
isolation (vCPU pinning, hugepages, NUMA mbind, RT scheduling),
guest-visible hints (KVM_HINTS_REALTIME CPUID), and KVM exit
suppression. Non-performance-mode VMs set `KVM_CAP_HALT_POLL` to
200us; overcommitted topologies set it to 0.
See [Performance Mode](../concepts/performance-mode.md) for the
full optimization list, prerequisites, and validation.
## Dual-role architecture
The same test binary serves two roles:
**Host side** -- manages the VM lifecycle: builds the initramfs, boots
the kernel, runs the monitor, and evaluates results.
**Guest side** -- runs inside the VM as `/init` (PID 1). The Rust init
code (`vmm::rust_init`) mounts filesystems, starts the scheduler,
dispatches the test function, then reboots.
The role is determined at runtime:
- **PID 1 detection**: when running as PID 1, the `#[ctor]` function
`ktstr_test_early_dispatch()` runs the guest init path, which handles
the full guest lifecycle.
- **`#[ktstr_test]` host dispatch**: a `#[ctor::ctor]` function
(`ktstr_test_early_dispatch`) runs before `main()` in any binary
that links against ktstr. When both `--ktstr-test-fn` and `--ktstr-topo`
are present, it boots a VM and runs the test inside it.
- **`#[ktstr_test]` guest dispatch**: when only `--ktstr-test-fn` is
present (no `--ktstr-topo`), the ctor runs the test function
directly -- the binary is already inside a VM.
This design means one `cargo build` produces everything needed for
both host and guest execution. The initramfs embeds the same binary
that built it.
## Boot process
1. Load kernel (bzImage on x86_64, Image on aarch64) via `linux-loader`.
2. Set up KVM vCPUs with the specified topology. vCPU creation
takes `kvm->lock` twice in the kernel (`kvm_vm_ioctl_create_vcpu`
at `virt/kvm/kvm_main.c:4158`): once for the
`created_vcpus` counter + per-arch precreate hook, then released
before the per-vCPU allocations, and reacquired for vcpu-list
insertion. The largest cost — the per-vCPU FPU `vzalloc` on
x86_64 (`fpu_alloc_guest_fpstate` at
`arch/x86/kernel/fpu/core.c:242`) — runs between the two lock
acquisitions and can parallelize across vCPUs (aarch64 has its
own per-vCPU init path with analogous costs). High vCPU counts
still add measurable boot latency even with the concurrency,
because each vCPU pays the alloc + TSC-sync cost serially within
its own thread. See [Performance Mode](../concepts/performance-mode.md).
3. Build and load initramfs.
4. Set up serial devices (COM1 for console, COM2 for results).
5. Boot the kernel.
6. Kernel starts `/init` (the test binary).
7. PID 1 detected: the guest init path mounts filesystems, starts the
scheduler, dispatches the test function, and reboots.