1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
//! `cargo hyperlight perf` — Profile Hyperlight guest execution with `perf kvm`.
//!
//! This subcommand automates the workflow of generating guest symbol information
//! and running `perf kvm` to profile code executing inside Hyperlight micro-VMs.
//!
//! # How it works
//!
//! Hyperlight loads guest PIE ELF binaries at a configurable base address (default
//! `0x1000` with init-paging). `perf kvm` resolves guest samples using a
//! kallsyms-format text file (`--guestkallsyms`) containing symbol addresses
//! shifted to match the runtime guest layout (ELF VA + base address).
//!
//! This command:
//! 1. Reads the guest ELF binary using the `object` crate
//! 2. Generates a kallsyms file with addresses shifted by the base address
//! 3. Runs `perf kvm record` with the appropriate flags
//! 4. Displays a `perf kvm report` with demangled symbols
//!
//! To mitigate sample misattribution on pre-Ice Lake CPUs (see below), the
//! generated kallsyms includes synthetic `__gap__` symbols between functions
//! wherever there are inter-function regions (alignment padding, unused code).
//! This prevents perf's `symbols__fixup_end()` from stretching function ranges
//! across gaps, which would cause skidded NMI samples to be misattributed to
//! the preceding function.
//!
//! **Important:** The gap marker name must NOT use bracket characters (e.g.
//! `[gap]`), because perf's kallsyms parser interprets `[name]` as a kernel
//! module annotation (like `/proc/kallsyms` lines ending in `[module_name]`).
//! Using brackets corrupts the symbol table and causes addresses inside
//! nearby functions to become unresolvable (shown as raw hex in reports).
//!
//! # Why gap markers are needed (and when they matter)
//!
//! ## The kallsyms format has no size information
//!
//! `perf kvm --guestkallsyms` accepts a kallsyms-format file: lines of
//! `address type name` — nothing else. This format was designed for Linux
//! kernel profiling, where `/proc/kallsyms` lists kernel symbols that are
//! typically laid out contiguously with no gaps. Crucially, **kallsyms does
//! not carry `st_size`** — there is no way to express a symbol's extent.
//!
//! ## `symbols__fixup_end()` assumes contiguous layout
//!
//! Since kallsyms has no size field, perf's `symbols__fixup_end()`
//! (tools/perf/util/symbol.c) sets each symbol's end address to the start
//! of the next symbol. For contiguous kernel text this is correct, but for
//! a general ELF binary it's wrong: functions may have alignment padding
//! or gaps from linker section placement between them. Without gap markers,
//! `symbols__fixup_end()` would stretch each function's range to the next
//! function.
//!
//! ## Why `perf kvm` can't just read the ELF
//!
//! Normal userspace profiling (`perf record ./binary`) doesn't have this
//! problem — perf reads the ELF directly via `/proc/<pid>/maps` + the
//! binary's `.symtab`/`.dynsym`, which include `st_size`. But KVM guest
//! profiling goes through a completely different code path: the guest RIP
//! in samples is a guest virtual address with no associated host process
//! or `/proc` mapping. `perf kvm` resolves these addresses using the
//! kallsyms mechanism (designed for kernel symbol resolution), which has
//! no concept of ELF symbol sizes. There is no `--guest-elf` option.
//!
//! ## When gap markers matter
//!
//! **Pre-Ice Lake (no guest PEBS):** NMI skid causes the sampled guest RIP
//! to be tens to hundreds of instructions away from the true overflow point.
//! Skidded samples can land in gap regions (alignment padding, unused code).
//! Without gap markers, `symbols__fixup_end()` stretches the preceding
//! function's range to cover the gap, and these skidded samples are
//! misattributed to that function. Gap markers absorb these samples instead.
//!
//! **Ice Lake+ (guest PEBS, `precise_ip=3`):** PEBS records the exact
//! instruction that retired at counter overflow. Gap regions contain no
//! executable code (only alignment padding, `nop`/`int3` bytes), so no
//! instruction ever retires there and no sample will have an IP in a gap.
//! Whether `symbols__fixup_end()` stretches ranges across gaps or not has
//! no effect on `perf report` output — the sample counts are identical
//! either way. Gap markers are harmless but have no practical impact.
//!
//! # Subcommands
//!
//! - `cargo hyperlight perf record` — Record samples (like `perf record`).
//! - `cargo hyperlight perf report` — Display a report from recorded data
//! (like `perf report`).
//!
//! # Modes
//!
//! - **Guest-only** (default): `perf kvm record` captures only guest samples.
//! - **Combined** (`--host`): `perf kvm --host --guest` captures host and
//! guest samples scoped to the workload process tree.
//!
//! # Requirements
//!
//! The guest ELF binary must contain a `.symtab` section with function symbols.
//! Debug info (`.debug_*` sections) is **not** needed — only the symbol table
//! matters. Rust release builds (which omit debug info by default) work fine
//! since `.symtab` is preserved. For Rust, only `strip = "symbols"` or
//! `strip = true` in the Cargo profile will remove `.symtab` and break
//! profiling. For C/C++, `strip -s` / `--strip-all` has the same effect;
//! `strip --strip-debug` is safe.
//!
//! # Limitations
//!
//! Flat profiles only (no guest call stacks). `perf kvm` cannot unwind the
//! guest stack because guest virtual addresses are not resolvable through host
//! page tables.
//!
//! # Known issue: guest IP imprecision on pre-Ice Lake CPUs
//!
//! On pre-Ice Lake CPUs, `perf kvm` guest profiles are **unreliable for
//! function-level attribution**. Samples may appear in never-called functions.
//! This is a hardware limitation, not a software bug.
//!
//! ## Root cause
//!
//! Guest PEBS is only available on Ice Lake+. On older CPUs, `perf kvm`
//! falls back to NMI-based sampling (`precise_ip=0`). The PMU counter
//! overflows at instruction X, but the NMI is recognized many instructions
//! later (skid). The NMI triggers a VMEXIT, and KVM reads `GUEST_RIP` from
//! the VMCS—which reflects the skidded position, not the overflow point.
//!
//! The KVM path: `vmx_vcpu_enter_exit()` → NMI exit → `vmx_do_nmi_irqoff()`
//! → host NMI handler → `perf_instruction_pointer()` → `kvm_rip_read(vcpu)`
//! → `vmcs_readl(GUEST_RIP)`.
//!
//! ## Consequences
//!
//! On Broadwell, empirical analysis showed the captured IPs are **byte-level
//! random** within hot code regions:
//!
//! - Most guest IPs land at non-instruction-boundary addresses, at a rate
//! matching random chance given the average x86 instruction length.
//! - IPs do cluster in genuinely hot ~KB-scale code regions (cold code
//! gets zero samples), but within those regions the byte position is
//! random.
//! - Function attribution is proportional to byte size, not execution
//! frequency. Large functions in hot regions attract disproportionate
//! samples even if never called.
//!
//! ## Workarounds
//!
//! - **Native profiling**: Build guest code as a native binary and profile
//! with `perf record -e cycles:pp` for PEBS-quality results.
//! - **Upgrade to Ice Lake+**: Enables guest PEBS with `precise_ip=3`.
//! - **Treat profiles as region-level heatmaps**: ~KB-scale region hotness
//! is valid; per-function percentages are not.
use OsString;
use Write as _;
use fs;
use Path;
use PathBuf;
use ;
use ;
use ElfFile64;
use ;
/// Default base address where Hyperlight loads guest binaries (init-paging).
const DEFAULT_BASE_ADDRESS: u64 = 0x1000;
/// Profile Hyperlight guest execution with perf kvm (Linux/KVM only).
/// Subcommands for `cargo hyperlight perf`.
/// Main entry point for `cargo hyperlight perf`.
///
/// The iterator should start with the subcommand name ("perf"), which
/// clap consumes as the binary name (argv\[0\]).
/// Generate a kallsyms-format string from the guest ELF binary.
///
/// For each defined symbol with a nonzero address, the output line is:
/// `{address + base_address:016x} T {name}`
///
/// Symbols are sorted by address ascending (as required by kallsyms format).
/// We also inject `_text` and `_stext` symbols at the `.text` section address
/// so that `perf kvm` can set up the guest kernel map.
///
/// ## Why we can't just emit raw symbols
///
/// The kallsyms format (`address type name`) carries no size information.
/// `perf kvm` processes these symbols through `symbols__fixup_end()`
/// (tools/perf/util/symbol.c), which extends each symbol's range to the
/// start of the next symbol — a heuristic designed for contiguous kernel
/// text. For general ELF binaries with gaps between functions (alignment
/// padding, dead code, linker-placed sections), this causes misattribution:
/// samples in gaps are credited to the preceding function.
///
/// Unlike userspace profiling where perf reads the ELF's `.symtab` with
/// `st_size` via `/proc/<pid>/maps`, KVM guest samples are guest virtual
/// addresses with no host-side process or memory mapping. `perf kvm` has
/// no `--guest-elf` option and cannot read symbol sizes from the binary.
///
/// ## Gap markers
///
/// To compensate, we read `st_size` from the ELF ourselves and inject
/// synthetic `__gap__` markers at each function's true end whenever a
/// gap exists before the next function. `symbols__fixup_end()` then clips
/// each real symbol at its true boundary. On pre-Ice Lake CPUs (no guest
/// PEBS), NMI skid causes samples to land in gap regions — the `__gap__`
/// markers absorb these instead of letting them inflate a neighboring
/// function. On Ice Lake+ with PEBS, no sample lands in gaps (no code
/// executes there), so the markers have no practical effect but are
/// harmless.
/// Write kallsyms content to a temp file and return it.
/// Build the common `perf kvm` argument prefix used by both record and report.
/// Parse a number as hex (0x prefix) or decimal.
pub