monitr 0.3.46 - Docs.rs

# monitr Audit: Activity/Resource Monitor Improvements

Created: 2026-06-10

## Executive summary
`monitr` is a solid baseline CLI/TUI monitor. The original high-impact gaps around per-process network attribution, async TUI detail loading, and history-pruning performance are now marked complete, but the current code still has follow-up issues in preference persistence, one-shot output parity, documentation accuracy, and command-backed sampling robustness. This version keeps the completed findings for historical context and appends new implementation-ready findings from the current audit refresh.

## Findings (prioritized)

### 1) [x] [High] Missing per-process network attribution (core observability gap)
Why this matters: Without process-level RX/TX, operators cannot answer “which process is using bandwidth?” from the main monitor view and must switch tools.
Current behavior: Network metrics are aggregated at interface level in snapshot/table logic and are not consistently attributed to individual PIDs in the primary process view. `inspect`/`ports` can provide connection context, but not integrated per-process throughput in the normal rows/charts.
Why this is incomplete today: No canonical process-net metric path exists in the refresh model used by the main table and trend data.
Implementation details: inspect snapshot flow, then thread a process-level metric through state, table schema, and renderers. Add a best-effort backend path that attempts attribution and degrades gracefully when unsupported or permission-restricted.
Files to inspect first: `src/snapshot.rs`, `src/app.rs`, `src/ui.rs`, `src/inspect.rs`, `src/ports.rs`.
Trade-offs: attribution is platform-specific and can add per-refresh CPU cost, especially on systems with large socket tables.
Recommended rollout: add per-process deltas (bytes in/out + rates), include a `Net` metric in column config, and surface clearly when unavailable.
Implementation status: **complete**. Implemented per-process network totals/rates on macOS via `nettop` in `src/sampler.rs` (`platform::process_network_totals` + parser), propagated into `ProcessRow` as `network_in_rate`, `network_out_rate`, `total_network_in`, and `total_network_out`, surfaced in `src/ui.rs` (`Network` tab, compact/full schemas + sortable columns), and exposed in `src/output.rs` and `src/inspect.rs` via JSON/CLI fields.
Acceptance status: aggregate network totals remain in the UI and snapshot totals, while unsupported/errored process attribution now degrades gracefully with explicit support/error flags in totals (`process_network_supported`, `process_network_error`).

### 2) [x] [High] Synchronous `lsof`/inspect hydration blocks TUI responsiveness
Why this matters: opening process detail can freeze the interface for noticeable time under large FD counts or slow command execution.
Current behavior: `toggle_handles` triggers `inspect::collect_handles(pid)` on the hot UI path; `collect_handles` shells out to `lsof` and may produce large output.
Current behavior: `collect_handles` is now scheduled on a background thread when handles are requested, with a loading placeholder shown while polling the worker channel and result caching to avoid repeated expensive lookups.
Implementation details: a request result channel and one-shot thread spawn were added in `src/app.rs` (`toggle_handles`, `poll_handle_results`, cancellation helpers, and 30s `handles_cache` TTL). `src/ui.rs` now renders a loading overlay state while async work is pending.
Files to inspect first: `src/app.rs` (selection state + event loop), `src/inspect.rs` (handle collection/render), `src/ui.rs` (detail panel timing).
Trade-offs: cache invalidation can show stale handles briefly; that is acceptable if clearly labeled and prioritized over UI lock-ups.
Recommended rollout: introduce a `HandleRequest` job queue with a result channel; show “loading handles…” placeholder on first open.
Implementation status: **complete**. Handle hydration runs off the UI thread with loading and cached results, while stale requests are discarded on selection change.
Suggested acceptance checks: repeatedly switch rows in under 1s and confirm no visible frame stalls; simulate slow `lsof` and confirm spinner/loading remains interactive.

### 3) [x] [High] O(n²) behavior in dead-process pruning can degrade at scale
Why this matters: the cleanup loop grows expensive as process count rises, making low intervals feel laggy during churn.
Current behavior: process history retention uses linear search over a vector of live PIDs for each stored PID while pruning dead entries.
Why this is incomplete today: as process count rises, behavior tends toward O(n²) and can compound at 1s+ refresh rates.
Implementation details: convert live PID list to a `HashSet` once per snapshot and use set membership for prunes. Keep history map type unchanged to avoid broad refactors.
Files to inspect first: `src/history.rs` in the record/update path.
Trade-offs: tiny extra allocation for the temporary set; usually far below runtime noise compared with reduced scan cost.
Recommended rollout: replace vector-membership checks with set-membership checks and keep behavior unchanged for callers.
Suggested acceptance checks: benchmark refresh duration at high process counts before/after and confirm no regression in memory behavior.
Implementation status: **complete**. Replaced the `Vec<u32>`-based live PID membership check in `History::record` (`src/history.rs`) with a `HashSet<u32>`, turning the dead-PID prune from O(n·m) into O(n + m) where `n` is the size of `process_cpu` and `m` is the live process count. The set is allocated with the snapshot's process count as its capacity hint so the extra allocation stays bounded. Public `History` API, call sites, and existing tests are unchanged; all 43 unit tests still pass.

### 4) [x] [Medium] `inspect` one-shot output is less useful than the data model
Why this matters: CLI one-shot mode is often used for scripting and audits; missing fields weakens its practical value despite already collecting them.
Current behavior: `InspectProcess` contains richer fields (session/priority/open-files fields), but rendering omits several of them depending on mode and presentation path.
Why this is incomplete today: inconsistent feature parity creates a trust gap between interactive and non-interactive modes.
Implementation status: **complete**. `InspectProcess` is now a type alias for `ProcessRecord` in `src/process_record.rs`, and snapshot/inspect paths both serialize the shared schema, including process metadata and network totals/rates.
Files to inspect first: `src/inspect.rs` (process model + rendering), any output formatter/flag handling in `src/main.rs`.
Trade-offs: output verbosity can increase significantly; provide optional compact mode or stable flags to limit fields.
Recommended rollout: add explicit `inspect` fields list in output docs, then keep defaults stable and sorted across modes.
Suggested acceptance checks: compare outputs for a known process with and without optional verbose mode; field set should be explainable and predictable.

### 5) [x] [Medium] `lsof` dependency failures are fragile and non-actionable
Why this matters: users on minimal/locked systems may assume `monitr` is broken when command plumbing is missing or blocked.
Current behavior: `lsof` is preflight-checked via `ensure_lsof_available`, and missing-command, permission, and command-failure messages include remediation text.
Why this was incomplete: historically errors were generic and made troubleshooting difficult.
Implementation details: both `src/inspect.rs` and `src/ports.rs` route through `ensure_lsof_available`/`lsof_failure_message` to provide actionable diagnostics and platform-safe unsupported-path behavior.
Files to inspect first: `src/inspect.rs`, `src/ports.rs`, `src/main.rs` argument handling.
Trade-offs: adding checks is extra startup/feature logic but improves operability significantly.
Recommended rollout: on first lsof use, attempt `which/command` resolution once; cache result and provide actionable diagnostics in both CLI and TUI paths.
Suggested acceptance checks: run with `lsof` unavailable and verify readable error with remediation steps.

### 6) [x] [Medium] Limited table customization and persisted preferences
Why this matters: operators currently carry repetitive mental overhead to hide/show columns or lock preferred sorting on each run.
Current behavior: sort order/columns are mostly runtime defaults, with no durable per-user schema/preferences.
Current behavior: preferences for sort mode, visibility, compact mode, interval, filters, and layout toggles are loaded from `Preferences`, persisted on each change, and restored on startup.
Why this is incomplete: this was previously transient runtime state only.
Implementation details: `src/config.rs` now stores normalized user schema at `~/Library/Application Support/monitr/config.json`; `App::new` loads defaults, and `save_preferences` persists state on preference-affecting actions.
Files to inspect first: `src/app.rs` state model, any existing config/env parsing in `src/main.rs` and startup code.
Trade-offs: config migration and corruption handling must be robust; include fallback defaults.
Recommended rollout: define a small preference schema with version field, load-save on change, and provide a “reset defaults” command.
Suggested acceptance checks: start monitr twice and confirm persisted preferences are reapplied.

### 7) [x] [Low] Data-model inconsistency across command modes
Why this matters: inconsistent field sets and timing across modes make debugging and alerting hard to reason about.
Current behavior: CLI one-shot, inspect pane, and table mode can diverge in both which fields are visible and refresh semantics.
Why this is incomplete today: no canonical source-of-truth schema for metric fields.
Implementation details: establish a shared metric/schema layer, then add render adapters for each mode instead of independent ad-hoc mappings.
Status: implemented.
How:
- Added `src/process_record.rs` containing the canonical `ProcessRecord` schema for shared process metrics and metadata.
- Switched inspect output model to alias `InspectProcess` to `ProcessRecord`, then convert `ProcessRow` via `ProcessRecord::from`.
- Updated snapshot JSON serialization (`src/output.rs`) to flatten `ProcessRecord` and append snapshot-specific trend/delta fields.
- Kept `ProcessRow` as the internal live sampler model and introduced a strict adapter boundary so command modes read from a single canonical process schema.
Files to inspect first: `src/main.rs`, `src/app.rs`, `src/ui.rs`, `src/inspect.rs`.
Trade-offs: initial refactor touches multiple modules; keep incremental (start with a shared enum/struct and conversion methods).
Recommended rollout: map each public field to one source definition, then unit test expected mode output for a fixed fixture.
Suggested acceptance checks: document and test per-mode support matrix (required/optional/unsupported).

### 8) [x] [High] Persisted sort preferences do not round-trip for several sort keys
Why this matters: finding #6 is only partially realized if saved preferences silently revert on restart. Operators who sort by disk, network, PID, or mover changes will lose that preference and fall back to CPU sorting.
Current behavior: `Preferences::from_app` serializes `sort_key` from display labels (`SortKey::title()`), with only `% CPU` and `Impact` normalized. `Preferences::apply_sort_key` expects config tokens such as `DiskRead`, `DiskWrite`, `NetworkIn`, `NetworkOut`, `Trend`, and `Pid`. Saved values like `Read/s`, `Write/s`, `In/s`, `Out/s`, `Change`, and `PID` do not match and fall through to `SortKey::Cpu` on the next startup.
Evidence: inspect `src/config.rs` (`apply_sort_key` vs. `from_app`). This also affects sort keys reachable through header clicks and shortcuts: `D`, network headers, `p`, and Movers sorting.
Files to inspect first: `src/config.rs`, `src/app.rs` (`save_preferences`, `set_sort`, `cycle_sort`, `set_tab`), and tests around preference serialization.
Recommended rollout: replace display-label serialization with an explicit stable config token, for example `SortKey::config_name()` and `SortKey::from_config_name()`, or serialize a serde enum. Keep migration compatibility for already-written display labels so users do not get stuck with stale configs.
Suggested acceptance checks: unit-test every `SortKey` round trip through `Preferences::from_app` and `Preferences::apply_sort_key`; add regression coverage for `DiskRead`, `DiskWrite`, `NetworkIn`, `NetworkOut`, `Trend`, and `Pid`; manually verify that changing sort, quitting, and restarting restores the same sort key and direction.
Implementation status: **complete**. `SortKey` now exposes stable config tokens via `config_name()` and parses both those tokens and legacy display labels via `from_config_name()`. `Preferences::from_app` saves stable tokens instead of UI labels, and `Preferences::apply_sort_key` uses the shared parser.
Changelog: Added regression tests covering every `SortKey` round trip and legacy display-label migration for disk, network, trend, PID, CPU, and energy sort preferences.

### 9) [x] [Medium] `snapshot --full` text output does not match its help or the shared schema
Why this matters: one-shot output is the scriptable surface area. If `--full` claims richer metadata but omits it, future agents and users may assume parity that does not exist.
Current behavior: `monitr snapshot --help` says `--full` shows `PPID, threads, session, start time`, but `render_snapshot_text` only adds `PPID`, `THREADS`, and `USER`. It does not print session or start time. Threads are also usually `-` in snapshot mode because `run_snapshot` samples with `detail_pid: None`, so `selected_details` is not hydrated for rows. Text snapshots also omit per-process network in/out columns even though JSON exposes `network_in_bytes_per_sec`, `network_out_bytes_per_sec`, and totals.
Evidence: `src/main.rs` help text and `src/output.rs` full-table renderer diverge. A local run of `cargo run --locked -- snapshot --full --limit 2` produced the header `PID %CPU MEMORY DISK/S PPID THREADS USER NAME`, while `cargo run --locked -- snapshot --json --limit 1` included per-process network fields.
Files to inspect first: `src/main.rs` (`print_snapshot_help`), `src/output.rs` (`render_snapshot_text`, `ProcessDocument`), `src/sampler.rs` (`sample(Some(pid))` vs. `sample(None)` detail hydration).
Recommended rollout: decide whether `--full` text should be compact-human output or true schema parity. If true parity, add explicit session/start/network columns and hydrate details for the rows actually rendered after filtering and limiting. If compact-human output is preferred, correct the help text and add a documented field-support matrix so JSON remains the canonical full schema.
Suggested acceptance checks: add golden tests for default text, `--full` text, and JSON snapshot output; verify that the help text names only fields that appear; verify that at least one process with available details can show non-`-` thread/session data in full mode or that the docs explicitly describe why those fields are JSON/inspect-only.
Implementation status: **complete**. `snapshot --full` now hydrates detail metadata for the filtered/limited rows it is about to render, so thread and session columns can be populated without adding detail work to compact snapshots. Full text output now includes per-process network in/out columns, PPID, threads, session, and start-time columns, matching the command help and the shared process schema more closely.
Changelog: Added a shared rendered-PID selection helper, targeted snapshot detail hydration, updated snapshot help text, and regression coverage for the expanded full snapshot text table.

### 10) [x] [Medium] README and in-app help are stale after network and preference changes
Why this matters: stale docs make completed work look absent and hide active controls. That increases support load and can cause future agents to re-open or re-implement already-shipped behavior.
Current behavior: README Scope still says network throughput is interface-level and not per-process, while the audit and code now show per-process network rates in the Network tab and JSON output. README Roadmap still lists per-process network attribution as deferred to an on-demand path even though `Sampler::sample` now calls `nettop` during normal sampling. README Controls omit active keys for overview (`o`), compact mode (`x`), and reset defaults (`R`). In-app help lists `o` but does not list `x` or `R`, and still describes Disk/Network inspector panels only as system-level volumes/interfaces without acknowledging the per-process Network table columns.
Evidence: compare `README.md` Scope/Roadmap/Controls with `src/ui.rs` key handling and help rendering, plus `src/sampler.rs` process-network fields.
Files to inspect first: `README.md`, `src/ui.rs` (`render_help`, footer text), `src/app.rs` key handling, and `src/main.rs` command help.
Recommended rollout: update the README feature/scope/roadmap language to match the current product surface; add missing controls for `o`, `x`, and `R`; clarify that interface totals remain in the inspector while per-process network rates are available where attribution succeeds.
Suggested acceptance checks: run `cargo run --locked -- --help`, `cargo run --locked -- snapshot --help`, and manually compare TUI help against README Controls; confirm no completed feature remains listed as future roadmap unless intentionally still partial.
Implementation status: **complete**. README feature, scope, controls, and roadmap language now matches the shipped per-process network attribution and persisted preference controls. TUI help now lists `x` compact mode and `R` reset defaults, and describes Disk/Network as per-process rate tables with system-level inspector totals.
Changelog: Refreshed README and in-app help for per-process network rates, `o` overview, `x` compact mode, `R` reset defaults, and best-effort `nettop` attribution behavior; bumped the crate to `0.3.46` for release.

### 11) [x] [Medium] Per-process network sampling runs `nettop` synchronously on the hot refresh path
Why this matters: finding #1 added useful attribution, but the current implementation can still hurt the primary TUI experience if the external command stalls, prompts, or degrades under process/socket churn.
Current behavior: every `Sampler::sample` calls `collect_process_network_samples`, which calls `platform::process_network_totals()`. On macOS that directly runs `Command::new("nettop").args(["-L1", "-P", "-n", "-x"]).output()` with no timeout, no backoff after repeated failures, and no async isolation from the UI refresh path. The current machine returned quickly in a local `nettop -L1 -P -n -x` probe, but the code has no guardrail if it does not.
Why this is not already covered: finding #1 called out attribution cost as a trade-off, but it is marked complete. This is a follow-up reliability issue in the completed implementation.
Files to inspect first: `src/sampler.rs` (`Sampler::sample`, `collect_process_network_samples`, `platform::process_network_totals`), `src/app.rs` refresh loop, and output totals support/error handling.
Recommended rollout: introduce a bounded process-network collector with a timeout and a session-level failure/backoff policy. Consider moving process-network attribution off the frame-critical refresh path, caching the last successful totals, and surfacing stale/unavailable state without blocking the rest of the sample.
Suggested acceptance checks: add an injectable command runner or platform trait so tests can simulate a slow/hung `nettop`; verify TUI refresh and `snapshot` return within the configured bound; verify repeated failure does not spawn unbounded commands and still reports `process_network_supported: false` with an actionable error.
Implementation status: **complete**. `nettop` is now executed through a timeout-bound runner, and process-network attribution failures clear stale deltas and trigger session-level exponential retry backoff instead of running the command on every refresh. Snapshot totals continue to report `process_network_supported: false` with the failure/backoff reason while CPU, memory, disk, and aggregate network sampling continue.
Changelog: Added timeout coverage for slow command execution and regression coverage for bounded exponential backoff.
Review follow-up (2026-06-12): the original runner polled `try_wait` without draining pipes, so a child emitting more than the OS pipe buffer (~64KB) would stall and be killed as a false timeout, permanently backing off attribution on process-heavy systems. The runner now drains stdout/stderr on background threads while polling, with a regression test covering >64KB of output.

## Potentially low-value / candidates for removal or de-emphasis
Any command mode that only exposes sparse metadata should be expanded or consolidated. If `ports` stays informational-only, explicitly mark it in help text and consider adding a dedicated flag for heavy system scans.

## Nice-to-have improvements
Add smoke tests for expensive system-command paths (e.g., missing/blocked `lsof`, slow/blocked `nettop`). Keep a short behavior rationale in docs for changes around parsing, selection, refresh, and persisted preferences so UX deltas stay understandable.

## Completion tracking
Legend:
- [ ] not started
- [~] in progress
- [x] complete

1. [x] Missing per-process network attribution
2. [x] Async inspect handle hydration in TUI
3. [x] O(n²) history pruning optimization
4. [x] One-shot inspect output parity with TUI fields
5. [x] Better `lsof` error handling and diagnostics
6. [x] Persisted UI/table preferences
7. [x] Canonical metric schema across command modes
8. [x] Persisted sort preferences round-trip
9. [x] Snapshot `--full` text/schema parity
10. [x] README and in-app help refresh
11. [x] Bounded per-process network sampling

## Changelog
Use this section to record what was implemented and why.

- 2026-06-10 — Initial findings expanded and formatted; completion checkboxes and changelog section added.
- 2026-06-10 — Implemented finding #1: process-level network rates are collected when supported, exposed in process rows (full/compact), and included in snapshot/inspect outputs with trend deltas and totals.
- 2026-06-10 — Implemented finding #2: added an async handle cache with 30s TTL in the TUI to prevent frame stalls when repeatedly opening process details.
- 2026-06-10 — Implemented finding #3: replaced the per-snapshot `Vec<u32>` live-PID membership check in `History::record` with a `HashSet<u32>`, making dead-PID pruning linear in the number of live processes rather than quadratic. Public API, call sites, and tests are unchanged; all 43 unit tests pass.
- 2026-06-10 — Implemented finding #4: added a `--full` flag to `inspect` and `snapshot` commands for field parity with TUI, and improved `inspect` accuracy with dual-sampling.
- 2026-06-10 — Implemented finding #5: added `lsof` preflight checks and descriptive error messages for permission issues.
- 2026-06-10 — Implemented finding #6: added a persisted 'Compact Mode' toggle (key 'x') and ensured all UI preferences are correctly saved and restored.
- 2026-06-10 — Implemented finding #7: introduced a shared `ProcessRecord` process schema and used it across inspect/snapshot output paths to remove ad-hoc field divergence.
- 2026-06-11 — Audit sync pass: corrected findings #2–#6 status and implementation notes in `audit.md` to match implemented behavior (async handle hydration, one-shot parity, actionable `lsof` diagnostics, and persisted preferences).
- 2026-06-12 — Current-state audit refresh: left original completed findings intact, then added new open findings for sort preference round-tripping, `snapshot --full` text/schema drift, stale README/help content, and synchronous `nettop` sampling on the refresh path.
- 2026-06-12 — Implemented finding #8: persisted sort preferences now save stable config tokens and load both new tokens and old display labels, with regression tests for every sort key.
- 2026-06-12 — Implemented finding #11: bounded the `nettop` process-network collector with a timeout and retry backoff so sampling degrades instead of repeatedly blocking refreshes.
- 2026-06-12 — Audit review pass: verified findings #1–#8 and #11 against the code (round-trip sort tests, lsof preflight/diagnostics in both inspect and ports paths, async handle hydration, persisted preferences). Hardened the #11 command runner to drain child pipes on background threads, fixing a false-timeout stall when `nettop` output exceeds the OS pipe buffer; added a regression test. Findings #9 and #10 confirmed still open and unchanged.
- 2026-06-12 — Implemented finding #9: `snapshot --full` now hydrates details for rendered rows and prints network in/out, PPID, threads, session, and start-time columns; snapshot help text and regression coverage were updated to match.
- 2026-06-12 — Implemented finding #10: refreshed README and in-app help to describe shipped per-process network rates, persisted preference controls (`o`, `x`, `R`), and the remaining system-level Disk/Network inspector totals; bumped the release version to 0.3.46.