monitord 0.18.0

monitord ... know how happy your systemd is! 😊
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
# monitord

monitord ... know how happy your systemd is! 😊

We offer the following run modes:

- systemd-timer (legacy cron would work too)
  - Refer to monitord.timer and monitord.service unit files
  - Ensure no `daemon:` mode options are set in `monitord.conf`
- daemon mode
  - Enable daemon mode in configuration file
  - Stats will be written to stdout every `daemon_stats_refresh_secs`

Open to more formats / run methods ... Open an issue to discuss. Depends on the dependencies basically.

`monitord` is a config driven binary. We plan to keep CLI arguments to a minimum.

**INFO** level logging is enabled to stderr by default. Use `-l LEVEL` to increase or decrease logging.

## Install

Install via cargo or use as a dependency in your `Cargo.toml`.

- `cargo install monitord`
- Create (copy from repo) a `monitord.conf`
  - Defaults to looking for it at /etc/monitord.conf
- `monitord --help`

```console
crl-linux:monitord cooper$ monitord --help
monitord: Know how happy your systemd is! 😊

Usage: monitord [OPTIONS]

Options:
  -c, --config <CONFIG>
          Location of your monitord config

          [default: /etc/monitord.conf]

  -l, --log-level <LOG_LEVEL>
          Adjust the console log-level

          [default: Info]
          [possible values: error, warn, info, debug, trace]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version
```

### Config

monitord can have the different components monitored. To enable / disabled set the
following in our monitord.conf. This file is [ini format](https://en.wikipedia.org/wiki/INI_file)
to match systemd unit files.

```ini
# Pure ini - no yes/no for bools

[monitord]
# Set a custom dbus address to connect to
# OPTIONAL: If not set, we default to the Unix socket below
dbus_address = unix:path=/run/dbus/system_bus_socket
# Timeout in seconds for dbus connection/collections
# OPTIONAL: default is 30 seconds
dbus_timeout = 30
# Run as a daemon or 1 time
daemon = false
# Time to refresh systemd stats in seconds
# Daemon mode only
daemon_stats_refresh_secs = 60
# Prefix flat-json key with this value
# The value automatically gets a '.' appended (so don't put here)
key_prefix = monitord
# cron/systemd timer output format
# Supported: json, json-flat, json-pretty
output_format = json

# Grab as much stats from DBus GetStats call
# we can from running dbus daemon
# More tested on dbus-broker daemon
[dbus]
# Summary counters - both dbus-broker + dbus-daemon
enabled = false
# dbus.user.* metrics: user stats as reported by dbus-broker
user_stats = false
# dbus.oeer.* metrics: peer stats as reported by dbus-broker
peer_stats = false
# dbus.cgroup.* stats is an aggregation of peer_stats by cgroup
# by dbus-broker
cgroup_stats = false

# Grab networkd stats from files + networkctl
[networkd]
enabled = true
link_state_dir = /run/systemd/netif/links

# Enable grabbing PID 1 stats via procfs
[pid1]
enabled = true

# Services to grab extra stats for
# .service is important as that's what DBus returns from `list_units`
[services]
foo.service

[timers]
enabled = true

[timers.allowlist]
foo.timer

[timers.blocklist]
bar.timer

# Grab unit status counts via dbus
[units]
enabled = true
state_stats = true

# Filter what services you want collect state stats for
# If both lists are configured blocklist is preferred
# If neither exist all units state will generate counters
[units.state_stats.allowlist]
foo.service

[units.state_stats.blocklist]
bar.service

# machines config
[machines]
enabled = true

# Same rules apply as state_stats lists above
[machines.allowlist]
foo

[machines.blocklist]
bar

# Boot blame metrics - shows the N slowest units at boot
# Similar to `systemd-analyze blame`
# Disabled by default
[boot]
enabled = false
# Number of slowest units to report
num_slowest_units = 5

# Optional: only include specific units in boot blame (if empty, all units are checked)
# Same rules apply as state_stats lists above
[boot.allowlist]
# slow-startup.service

# Optional: exclude specific units from boot blame
[boot.blocklist]
# noisy-but-expected.service

# Unit verification using systemd-analyze verify
# Disabled by default as it can be slow on large systems
[verify]
enabled = false

# Optional: only verify specific units (if empty, all units are checked)
[verify.allowlist]
# example.service
# example.timer

# Optional: skip verification for specific units
[verify.blocklist]
# noisy.service
# broken.timer
```

## Machines support

From version `>=0.11` monitord supports obtaining the same set of key from
systemd 'machines' (i.e. `machinectl --list`).

The keys are the same format as below in `json_flat` output but are prefixed with
the `machine` keyword and machine name. For example:

```json
# $KEY_PREFIX.machine.$MACHINE_NAME
{
  ...
  "monitord.machine.foo.pid1.fd_count": 69,
  ...
}
```

## Output Formats

### json

Normal `serde_json` non pretty JSON. All on one line. Most compact format.

### json-flat

Move all key value pairs to the top level and . notate compononets + sub values.
Is semi pretty too + custom. All unittested ...

```json
{
  "boot.blame.dnf5-automatic.service": 204.159,
  "boot.blame.cpe_chef.service": 103.05,
  "boot.blame.sys-module-fuse.device": 16.21,
  "boot.blame.dev-ttyS0.device": 15.809,
  "boot.blame.systemd-networkd-wait-online.service": 1.674,
  "dbus.active_connections": 10,
  "dbus.bus_names": 16,
  "dbus.incomplete_connections": 0,
  "dbus.match_rules": 26,
  "dbus.peak_bus_names": 33,
  "dbus.peak_bus_names_per_connection": 2,
  "dbus.peak_match_rules": 33,
  "dbus.peak_match_rules_per_connection": 13,
  "dbus.cgroup.system.slice-systemd-logind.service.activation_request_bytes": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.activation_request_fds": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.incoming_bytes": 16,
  "dbus.cgroup.system.slice-systemd-logind.service.incoming_fds": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.match_bytes": 6942,
  "dbus.cgroup.system.slice-systemd-logind.service.matches": 5,
  "dbus.cgroup.system.slice-systemd-logind.service.name_objects": 1,
  "dbus.cgroup.system.slice-systemd-logind.service.outgoing_bytes": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.outgoing_fds": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.reply_objects": 0,
  "dbus.peer.org.freedesktop.systemd1.activation_request_bytes": 0,
  "dbus.peer.org.freedesktop.systemd1.activation_request_fds": 0,
  "dbus.peer.org.freedesktop.systemd1.incoming_bytes": 16,
  "dbus.peer.org.freedesktop.systemd1.incoming_fds": 0,
  "dbus.peer.org.freedesktop.systemd1.match_bytes": 46533,
  "dbus.peer.org.freedesktop.systemd1.matches": 33,
  "dbus.peer.org.freedesktop.systemd1.name_objects": 1,
  "dbus.peer.org.freedesktop.systemd1.outgoing_bytes": 0,
  "dbus.peer.org.freedesktop.systemd1.outgoing_fds": 0,
  "dbus.peer.org.freedesktop.systemd1.reply_objects": 0,
  "dbus.user.cooper.bytes": 919236,
  "dbus.user.cooper.fds": 78,
  "dbus.user.cooper.matches": 510,
  "dbus.user.cooper.objects": 80,
  "networkd.eno4.address_state": 3,
  "networkd.eno4.admin_state": 4,
  "networkd.eno4.carrier_state": 5,
  "networkd.eno4.ipv4_address_state": 3,
  "networkd.eno4.ipv6_address_state": 2,
  "networkd.eno4.oper_state": 9,
  "networkd.eno4.required_for_online": 1,
  "networkd.managed_interfaces": 2,
  "networkd.wg0.address_state": 3,
  "networkd.wg0.admin_state": 4,
  "networkd.wg0.carrier_state": 5,
  "networkd.wg0.ipv4_address_state": 3,
  "networkd.wg0.ipv6_address_state": 3,
  "networkd.wg0.oper_state": 9,
  "networkd.wg0.required_for_online": 1,
  "pid1.cpu_time_kernel": 48,
  "pid1.cpu_user_kernel": 41,
  "pid1.fd_count": 245,
  "pid1.memory_usage_bytes": 19165184,
  "pid1.tasks": 1,
  "services.chronyd.service.active_enter_timestamp": 1683556542382710,
  "services.chronyd.service.active_exit_timestamp": 0,
  "services.chronyd.service.cpuusage_nsec": 328951000,
  "services.chronyd.service.inactive_exit_timestamp": 1683556541360626,
  "services.chronyd.service.ioread_bytes": 18446744073709551615,
  "services.chronyd.service.ioread_operations": 18446744073709551615,
  "services.chronyd.service.memory_available": 18446744073709551615,
  "services.chronyd.service.memory_current": 5214208,
  "services.chronyd.service.nrestarts": 0,
  "services.chronyd.service.restart_usec": 100000,
  "services.chronyd.service.state_change_timestamp": 1683556542382710,
  "services.chronyd.service.status_errno": 0,
  "services.chronyd.service.tasks_current": 1,
  "services.chronyd.service.timeout_clean_usec": 18446744073709551615,
  "services.chronyd.service.watchdog_usec": 0,
  "system-state": 3,
  "timers.fstrim.timer.accuracy_usec": 3600000000,
  "timers.fstrim.timer.fixed_random_delay": 0,
  "timers.fstrim.timer.last_trigger_usec": 1743397269608978,
  "timers.fstrim.timer.last_trigger_usec_monotonic": 0,
  "timers.fstrim.timer.next_elapse_usec_monotonic": 0,
  "timers.fstrim.timer.next_elapse_usec_realtime": 1744007133996149,
  "timers.fstrim.timer.persistent": 1,
  "timers.fstrim.timer.randomized_delay_usec": 6000000000,
  "timers.fstrim.timer.remain_after_elapse": 1,
  "timers.fstrim.timer.service_unit_last_state_change_usec": 1743517244700135,
  "timers.fstrim.timer.service_unit_last_state_change_usec_monotonic": 639312703,
  "unit_states.chronyd.service.active_state": 1,
  "unit_states.chronyd.service.loaded_state": 1,
  "unit_states.chronyd.service.unhealthy": 0,
  "units.activating_units": 0,
  "units.active_units": 403,
  "units.automount_units": 1,
  "units.device_units": 150,
  "units.failed_units": 0,
  "units.inactive_units": 159,
  "units.jobs_queued": 0,
  "units.loaded_units": 497,
  "units.masked_units": 25,
  "units.mount_units": 52,
  "units.not_found_units": 38,
  "units.path_units": 4,
  "units.scope_units": 17,
  "units.service_units": 199,
  "units.slice_units": 7,
  "units.socket_units": 28,
  "units.target_units": 54,
  "units.timer_units": 20,
  "units.total_units": 562,
  "verify.failing.device": 43,
  "verify.failing.mount": 15,
  "verify.failing.service": 31,
  "verify.failing.slice": 1,
  "verify.failing.total": 97,
  "version": "255.7-1.fc40"
}
```

### json-pretty

Normal `serde_json` pretty representations of each components structs.

## dbus stats

You're going to need to be root or allow permissiong to pull dbus stats.
For dbus-broker here is example config allow a user `monitord` to query
`getStats`

```xml
[cooper@l33t ~]# cat /etc/dbus-1/system.d/allow_monitord_stats.conf
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE busconfig PUBLIC
 "-//freedesktop//DTD D-BUS Bus Configuration 1.0//EN"
 "http://www.freedesktop.org/standards/dbus/1.0/busconfig.dtd">
<busconfig>
  <policy user="monitord">
    <allow send_destination="org.freedesktop.DBus"
           send_interface="org.freedesktop.DBus.Debug.Stats"
           send_member="GetStats"
           send_path="/org/freedesktop/DBus"
           send_type="method_call"/>
  </policy>
</busconfig>
```

## Development

To do test runs (requires `systemd` and `systemd-networkd` _installed_)

- Pending what you have enabled in your config ...

- `cargo run -- -c monitord.conf -l debug`

Ensure the following pass before submitting a PR (CI checks):

- `cargo test`
- `cargo clippy`
- `cargo fmt`

### Generate codegen APIs

- `cargo install zbus_xmlgen`
- `zbus-xmlgen system org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/chronyd_2eservice`

Then add the following macros to tell clippy to go away:

```rust
#![allow(warnings)]
#![allow(clippy)]
```

### Non Linux development

Sometimes I develop from my Mac OS X laptop. So I thought I'd document and
add the way I build a Fedora Rawhide container and mount the local repo to /repo
in the container to run monitord and test.

- Build the image (w/git, rust tools and systemd)
  - `docker build -t monitord-dev .`
- Start via systemd and mount the monitord repo to /repo
  - `docker run --rm --name monitord-dev -it --privileged --tmpfs /run --tmpfs /tmp -v $(pwd):/repo monitord-dev /sbin/init`
    - `--rm` is optional but will remove the container when stopped

You can now log into the container to build + run tests and run the binary now against systemd.

- `docker exec -it monitord-dev bash`
- `cd /repo ; cargo run -- -c monitord`
  - networkd etc. are not running my default but can be started ...
  - `systemctl start systemd-networkd`
    - No interfaces will be managed tho by default in the container ...

## API Usage

## DBus

All monitord's dbus is done via async (tokio) [zbus](https://crates.io/crates/zbus) crate.

systemd Dbus APIs are in use in the following modules:

- machines
  - `ManagerProxy::list_machines()`
  - Can do most other calls then on the machine's systemd/dbus
- networkd
  - `ManagerProxy::list_links()`
  - Would love to stop parsing `/run/systemd/netif/links` and replace via varlink API
    - https://github.com/systemd/systemd/issues/36877
- system
  - `ManagerProxy::get_version()`
  - `ManagerProxy::system_state()`
- timer
  - `TimerProxy::unit()` - Find service unit of timer
  - `ManagerProxy::get_unit()`
  - `UnitProxy::state_change_timestamp()`
  - `UnitProxy::state_change_timestamp_monotonic()`
- units
  - `ManagerProxy::list_units()` - Main counting of unit stats
  - `ServiceProxy::cpuusage_nsec()`
  - `ServiceProxy::ioread_bytes()`
  - `ServiceProxy::ioread_operations()`
  - `ServiceProxy::memory_current()`
  - `ServiceProxy::memory_available()`
  - `ServiceProxy::nrestarts()`
  - `ServiceProxy::get_processes()`
  - `ServiceProxy::restart_usec()`
  - `ServiceProxy::status_errno()`
  - `ServiceProxy::tasks_current()`
  - `ServiceProxy::timeout_clean_usec()`
  - `ServiceProxy::watchdog_usec()`
  - `UnitProxy::active_enter_timestamp`
  - `UnitProxy::active_exit_timestamp`
  - `UnitProxy::inactive_exit_timestamp()`
  - `UnitProxy::state_change_timestamp()` - Used for raw stat + time_in_state

Some of these modules can be disabled via configuration. Due to this, monitord might not
always be running / calling all these DBus calls per run.

## Varlink

None yet :(.

### varlink 101

varlink might one day replace our DBUS usage. Here are some notes on how to work with systemd varlink
as there isn't really documentation outside `man` pages.

#### Checking interfaces

- varlinkctl is your friend - https://man7.org/linux/man-pages/man1/varlinkctl.1.html

Here is an example with networkd's interfaces:

```
varlinkctl info unix:/run/systemd/netif/io.systemd.Network
varlinkctl introspect unix:/run/systemd/netif/io.systemd.Network io.systemd.Network

cooper@au:~$ varlinkctl call unix:/run/systemd/netif/io.systemd.Network io.systemd.Network.GetStates '{}' -j | jq
{
  "AddressState": "routable",
  "IPv4AddressState": "routable",
  "IPv6AddressState": "routable",
  "CarrierState": "carrier",
  "OnlineState": "online",
  "OperationalState": "routable"
}
```