# Host-State Profiler
The host-state profiler captures a host-wide per-thread snapshot
of scheduling counters, memory / I/O accounting, CPU affinity,
cgroup state, and thread identity, then compares two snapshots to
surface what changed. It is a manually-invoked CLI companion to
the automated scheduler tests — useful when a run passes on one
machine and fails on another, or for A/B comparing host behaviour
across kernel / sysctl / workload changes.
This is a **different tool** from `ktstr show-host` /
`cargo ktstr show-host`, which captures the host *context*
(kernel, CPU model, sched_\* tunables, NUMA layout, kernel
cmdline) — aggregate state that does not change between
scenarios. The profiler captures *per-thread* cumulative counters
that do change, and its comparison surface is designed for the
thread-level diff.
## When to use it
- **Workload investigation** — you observe a regression and want
to know which process / thread pool moved in run time,
context-switch rate, or migration count.
- **Kernel / sysctl A/B** — capture before and after flipping a
sched_\* tunable on an otherwise-identical workload; the
compare output surfaces every counter that responded.
- **Host baselining** — capture on a known-good host, capture on
a failing host, compare to isolate what differs at the
thread-behaviour level.
The profiler is **not** invoked automatically by scenarios or the
gauntlet. It is opt-in and operator-driven via the
`ktstr host-state` subcommand.
## Capture
```sh
ktstr host-state capture --output baseline.hst.zst
# ... run workload, change a tunable, reboot a kernel, etc. ...
ktstr host-state capture --output after.hst.zst
```
`capture` walks `/proc` for every live thread group, enumerates
each thread, and reads a handful of procfs sources for each one.
The output is a zstd-compressed JSON snapshot (conventional
extension: `.hst.zst`).
### What is captured per thread
- **Identity** — tid, tgid, `pcomm` (process name from
`/proc/<tgid>/comm`), `comm` (thread name from
`/proc/<tid>/comm`), cgroup v2 path, `start_time` in USER_HZ
clock ticks, scheduling policy name, nice, CPU affinity mask.
- **Scheduling counters** (cumulative, from `/proc/<tid>/sched`;
requires `CONFIG_SCHED_DEBUG`) — `run_time_ns`, `wait_time_ns`,
`timeslices`, `voluntary_csw`, `nonvoluntary_csw`, `nr_wakeups`
(plus `_local` / `_remote` / `_sync` / `_migrate` / `_idle`
splits), `nr_migrations`, `wait_sum` / `wait_count`,
`sleep_sum`, `block_sum` / `block_count`, `iowait_sum` /
`iowait_count`.
- **Memory** — `minflt` / `majflt` from `/proc/<tid>/stat`.
`allocated_bytes` / `deallocated_bytes` from the jemalloc
per-thread-destructor TSD cache — populated only for processes
linked against jemalloc; glibc arena counters are opaque and
read as zero rather than failing capture.
- **I/O** — `rchar`, `wchar`, `syscr`, `syscw`, `read_bytes`,
`write_bytes` from `/proc/<tid>/io` (requires
`CONFIG_TASK_IO_ACCOUNTING`).
Every field is **cumulative-from-birth**, so the probe timing
does not alter the output: two snapshots of the same thread at
different wall-clock instants produce the same numbers as long
as their cumulative counters have not rolled over. Metrics that
reset on attachment (perf_event_open counters, BPF tracing
samples, etc.) are intentionally absent — they require long-lived
instrumentation the capture layer cannot install without
disturbing the system it is measuring.
### Capture is best-effort
Each internal reader returns `Option`; a kernel without
`CONFIG_SCHED_DEBUG` yields `None` from the schedstat reader
without failing the rest of the thread. Counters collapse to `0`,
identity strings collapse to empty, affinity collapses to an
empty vec. **A missing reading is indistinguishable from a
genuine zero in the output** — the contract is "never fail the
snapshot." Tests that need stronger guarantees inspect the
underlying readers directly (they remain `Option`-shaped and are
unit-tested in the module).
### Per-cgroup enrichment
Every cgroup at least one sampled thread resides in gets a
`CgroupStats` entry: `cpu_usage_usec`, `nr_throttled`,
`throttled_usec`, `memory_current` — read directly from
cgroup v2 files (`cpu.stat`, `memory.current`), NOT derived from
per-thread data, because those are aggregate-over-the-cgroup
values.
### Snapshot identity
The top-level `HostStateSnapshot` also embeds a `HostContext`
(the same structure `show-host` prints — kernel, CPU, memory,
sched_\* tunables, cmdline). Older tools or synthetic fixtures
that omit the context render `(host context unavailable)` rather
than failing the compare.
### Cgroup namespace caveat
The per-thread `cgroup` path is read verbatim from
`/proc/<tid>/cgroup` — it is therefore relative to the **cgroup
namespace root the capturing process sees**, NOT the
system-global v2 mount root. A process inside a nested cgroup
namespace sees a truncated path; a process outside sees a longer
one. Cross-namespace comparison requires external
canonicalization (the capture layer deliberately does not attempt
it because the right resolution depends on capture-site privilege
and namespace visibility).
## Compare
```sh
ktstr host-state compare before.hst.zst after.hst.zst
```
`compare` joins the two snapshots on `(pcomm, comm)` by default
and emits one row per `(group, metric)` pair. Groups present on
only one side surface as **unmatched** — a row is missing
because the process did not exist, not because it did zero work.
### Grouping
- `--group-by pcomm` (default) — aggregate every thread of the
same process together.
- `--group-by cgroup` — aggregate by cgroup path. Useful for
container-per-workload deployments where the process name is
ambiguous across cgroups.
- `--group-by comm` — aggregate by thread name across every
process. Useful when a thread-pool name like `tokio-worker`
spans many binaries and you want one row per pool, not per
binary.
### Cgroup-path flattening
```sh
ktstr host-state compare before.hst.zst after.hst.zst \
--group-by cgroup \
--cgroup-flatten '/kubepods/*/pod-*/container' \
--cgroup-flatten '/system.slice/*.scope'
```
`--cgroup-flatten` accepts glob patterns that collapse dynamic
segments (pod UUIDs, session scopes, transient unit IDs) to a
canonical form before grouping, so the same logical workload
across two runs lands on the same row even if the kernel
assigned different UUIDs.
### Aggregation rules
Each metric declares its own aggregation rule
([`HOST_STATE_METRICS`] in `src/host_state_compare.rs`):
- **Sum** (most counters) — cumulative values add. Delta is the
signed difference; percent delta is relative to the before-
side.
- **OrdinalRange** (`nice`) — min / max across the group; the
renderer shows `[min, max]` and the delta uses the midpoint
so a shift on either end is visible.
- **Mode** (`policy`) — the most-common policy name and its
share of the group. No scalar, so rows sort to the bottom of
the default sort.
- **Affinity** (`cpu_affinity`) — aggregates into an
`AffinitySummary` with `min_cpus` / `max_cpus` and a `uniform`
flag. Heterogeneous groups render as `"N-M cpus (mixed)"`.
### Output and interpretation
The comparison prints **raw numbers and percent delta**. There
are no judgment labels (regression vs. improvement) — the
meaning of "run_time went up 15%" depends on whether you were
measuring a CPU-bound workload (more work done) or a spin-wait
pathology (more time wasted). The interpretation is scheduler-
specific and left to the operator.
Sort order: by default, rows are sorted by absolute delta
(largest movers first) so the most-changed metrics surface at
the top. Rows with no numeric scalar (`policy`, heterogeneous
affinity) fall to the bottom.
## File format
`.hst.zst` is zstd-compressed JSON of `HostStateSnapshot`. The
schema is `#[non_exhaustive]` so field additions do not break
existing snapshots:
```
HostStateSnapshot
├── captured_at_unix_ns: u64
├── host: Option<HostContext>
├── threads: Vec<ThreadState>
└── cgroup_stats: BTreeMap<String, CgroupStats>
```
`ThreadState::start_time_clock_ticks` is in USER_HZ (100 on
x86_64 and aarch64), NOT the kernel-internal CONFIG_HZ — so
cross-host comparison between differently-configured kernels on
those architectures is meaningful. Other in-tree architectures
(alpha, for instance, with USER_HZ=1024) would require normalization
at capture time; the capture layer currently targets x86_64 and
aarch64 only.
Compression level `3` (matching the ktstr remote-cache
convention): adequate ratio at fast speed, and host-state
captures are small enough that further compression produces
diminishing returns on I/O.
## Related
- [`cargo ktstr show-host`](../running-tests/cargo-ktstr.md) —
captures the host *context* (kernel, CPU, tunables) that the
profiler embeds as the `host` field. Use `show-host` when you
want to inspect host configuration only, without the per-
thread walk.
- [Capture and Compare Host State](../recipes/host-state.md) —
recipe covering the `show-host` / `stats compare` flow for
comparing host *context* across sidecars (not the per-thread
profiler).
- [Environment Variables](environment-variables.md) — every
ktstr-controlled env var.