# Zeus daemon (`zeusd`)
`zeusd` is a system daemon that runs with admin privileges and exposes HTTP API endpoints for GPU management and GPU/CPU power streaming.
## Problem
Energy optimizers in Zeus need to change the GPU's configurations including its power limit or frequency, which requires the Linux security capability `SYS_ADMIN` (which is pretty much `sudo`).
However, it's not a good idea to grant the entire application such strong privileges just to be able to change GPU configurations.
Additionally, monitoring GPU and CPU power across multiple nodes in a cluster requires a lightweight, always-on service that can stream readings to remote clients.
## Solution
`zeusd` runs as a privileged daemon process on the node and provides:
- **GPU management endpoints** that wrap privileged NVML methods, so unprivileged applications can change GPU configuration on their behalf.
- **GPU power streaming** via SSE (Server-Sent Events) using NVML instant power readings.
- **CPU power streaming** via SSE using RAPL energy counters (Intel and modern AMD CPUs).
Power polling is demand-driven: `zeusd` only reads from hardware while at least one client is connected, so idle endpoints consume no resources.
To make this as low latency as possible, `zeusd` was written in Rust.
## How to use `zeusd`
First, install `zeusd`:
```sh
cargo install zeusd
```
Both modes (UDS and TCP) serve the same HTTP API. The only difference is the transport layer.
### UDS mode
UDS (Unix domain socket) mode is the default. It's intended for local communication between processes on the same node.
```sh
sudo zeusd serve --socket-path /var/run/zeusd.sock --socket-permissions 666
```
To allow the Zeus Python library to recognize that `zeusd` is available, set:
```sh
export ZEUSD_SOCK_PATH=/var/run/zeusd.sock
```
When Zeus detects `ZEUSD_SOCK_PATH`, it'll automatically instantiate the right GPU backend and relay privileged GPU management method calls to `zeusd`.
### TCP mode
TCP mode exposes the same API over a TCP socket, making it accessible from remote hosts. This is useful for cluster-wide power monitoring.
```sh
sudo zeusd serve --mode tcp --tcp-bind-address 0.0.0.0:4938
```
Example queries via curl:
```sh
# Discovery
curl http://localhost:4938/discover
# One-shot GPU power reading
curl http://localhost:4938/gpu/get_power
# One-shot CPU power reading (RAPL)
curl http://localhost:4938/cpu/get_power
# SSE stream of GPU power (Ctrl-C to stop)
curl -N http://localhost:4938/gpu/stream_power
# SSE stream of CPU power
curl -N http://localhost:4938/cpu/stream_power
# Filter to specific devices
curl "http://localhost:4938/gpu/get_power?gpu_ids=0,1"
curl "http://localhost:4938/cpu/get_power?cpu_ids=0"
# GPU management via query params
curl -X POST 'http://localhost:4938/gpu/set_power_limit?gpu_ids=0,1&power_limit_mw=200000&block=true'
```
On the Python side, use `PowerStreamingClient` to connect to one or more `zeusd` instances:
```python
from zeus.utils.zeusd import ZeusdConfig
from zeus.monitor.power_streaming import PowerStreamingClient
client = PowerStreamingClient(
servers=[
ZeusdConfig.tcp(host="node1", port=4938, gpu_indices=[0, 1, 2, 3]),
ZeusdConfig.uds(socket_path="/var/run/zeusd.sock", gpu_indices=[0, 1, 2, 3]),
],
)
# Get current power reading once
readings = client.get_power()
# Continuously stream power readings
for power_reading in client:
print(power_reading)
```
See the [Distributed Power Measurement and Aggregation](https://ml.energy/zeus/measure/#distributed-power-measurement-and-aggregation) section in our documentation for more details.
### API groups
`zeusd` organizes its endpoints into API groups that can be selectively enabled with the `--enable` flag. By default, all groups are enabled.
| `gpu-control` | `POST /gpu/set_persistence_mode` | Yes |
| | `POST /gpu/set_power_limit` | |
| | `POST /gpu/set_gpu_locked_clocks` | |
| | `POST /gpu/reset_gpu_locked_clocks` | |
| | `POST /gpu/set_mem_locked_clocks` | |
| | `POST /gpu/reset_mem_locked_clocks` | |
| `gpu-read` | `GET /gpu/get_power` | No |
| | `GET /gpu/stream_power` | |
| | `GET /gpu/get_cumulative_energy` | |
| `cpu-read` | `GET /cpu/get_cumulative_energy` | Yes |
| | `GET /cpu/get_power` | |
| | `GET /cpu/stream_power` | |
The following endpoints are always available regardless of which groups are enabled:
- `GET /discover`
- `GET /time`
If a group that requires root is enabled but the daemon is not running as root, it will exit immediately with an error.
Examples:
```sh
# As root: all groups enabled (default)
sudo zeusd serve --mode tcp --tcp-bind-address 0.0.0.0:4938
# As non-root: GPU monitoring only (no root required)
zeusd serve --mode tcp --tcp-bind-address 0.0.0.0:4938 --enable gpu-read
# As root: monitoring only (GPU + CPU reads, no GPU control)
sudo zeusd serve --mode tcp --tcp-bind-address 0.0.0.0:4938 --enable gpu-read,cpu-read
```
Only the devices needed by the enabled groups are initialized. For example, `--enable gpu-read` skips RAPL initialization entirely, and `--enable cpu-read` skips NVML initialization.
### Authentication
`zeusd` supports optional per-user JWT authentication. When `--signing-key-path` is provided, all endpoints except `/discover` and `/time` require a valid `Authorization: Bearer <token>` header.
**Setting up a signing key:**
```sh
# Generate a 32-byte signing key (shared across all daemons in a cluster)
openssl rand -base64 32 > /etc/zeusd/signing.key
chmod 600 /etc/zeusd/signing.key
```
**Starting the daemon with auth:**
```sh
sudo zeusd serve --mode tcp --tcp-bind-address 0.0.0.0:4938 --signing-key-path /etc/zeusd/signing.key
```
**Issuing tokens:**
```sh
# Token with 7-day expiry and GPU read scope
zeusd token issue \
--signing-key-path /etc/zeusd/signing.key \
--user alice \
--scope gpu-read \
--expires 7d
# Token with multiple scopes and no expiry
zeusd token issue \
--signing-key-path /etc/zeusd/signing.key \
--user alice \
--scope gpu-read,gpu-control,cpu-read \
--expires never
```
`--expires` accepts human-readable durations (`1h`, `7d`, `30d`) or `never`/`0` for tokens that never expire.
**Using tokens with Python clients:**
Set the `ZEUSD_TOKEN` environment variable, or pass the token directly:
```sh
export ZEUSD_TOKEN="eyJ..."
```
Python clients (`ZeusdNVIDIAGPU`, `ZeusdRAPLCPU`, `PowerStreamingClient`) automatically check `/discover` to determine whether auth is required. If auth is required and no token is available, an error is raised with a clear message.
**Using tokens with curl:**
```sh
curl -H "Authorization: Bearer $ZEUSD_TOKEN" http://localhost:4938/gpu/get_power
```
When no `--signing-key-path` is provided, the daemon runs without authentication and all endpoints are freely accessible. The `/discover` endpoint always reports `auth_required: true` or `false` so clients can adapt.
## API Reference
### Discovery
#### `GET /discover`
Returns available devices, capabilities, and enabled API groups.
Response:
| `gpu_ids` | `int[]` | Available GPU indices |
| `cpu_ids` | `int[]` | Available CPU indices |
| `dram_available` | `bool[]` | Per-CPU DRAM energy support (indexed by position in `cpu_ids`) |
| `enabled_api_groups` | `string[]` | API groups enabled on this instance |
| `auth_required` | `bool` | Whether JWT authentication is required |
Example response:
```json
{
"gpu_ids": [0, 1, 2, 3],
"cpu_ids": [0, 1],
"dram_available": [true, false],
"enabled_api_groups": ["gpu-control", "gpu-read", "cpu-read"],
"auth_required": false
}
```
### GPU
All GPU endpoints are under the `/gpu` scope.
#### `POST /gpu/set_power_limit`
Set GPU power management limit.
| `gpu_ids` | `string` | yes | Comma-separated GPU indices |
| `power_limit_mw` | `int` | yes | Power limit in milliwatts |
| `block` | `bool` | yes | Wait for completion |
#### `POST /gpu/set_persistence_mode`
Set GPU persistence mode.
| `gpu_ids` | `string` | yes | Comma-separated GPU indices |
| `enabled` | `bool` | yes | Enable or disable persistence mode |
| `block` | `bool` | yes | Wait for completion |
#### `POST /gpu/set_gpu_locked_clocks`
Lock GPU core clocks to a range.
| `gpu_ids` | `string` | yes | Comma-separated GPU indices |
| `min_clock_mhz` | `int` | yes | Minimum clock in MHz |
| `max_clock_mhz` | `int` | yes | Maximum clock in MHz |
| `block` | `bool` | yes | Wait for completion |
#### `POST /gpu/reset_gpu_locked_clocks`
Reset GPU core locked clocks.
| `gpu_ids` | `string` | yes | Comma-separated GPU indices |
| `block` | `bool` | yes | Wait for completion |
#### `POST /gpu/set_mem_locked_clocks`
Lock GPU memory clocks to a range.
| `gpu_ids` | `string` | yes | Comma-separated GPU indices |
| `min_clock_mhz` | `int` | yes | Minimum clock in MHz |
| `max_clock_mhz` | `int` | yes | Maximum clock in MHz |
| `block` | `bool` | yes | Wait for completion |
#### `POST /gpu/reset_mem_locked_clocks`
Reset GPU memory locked clocks.
| `gpu_ids` | `string` | yes | Comma-separated GPU indices |
| `block` | `bool` | yes | Wait for completion |
#### `GET /gpu/get_cumulative_energy[?gpu_ids=0,1]`
Total energy consumption since driver load (NVML). `gpu_ids` is optional (omit = all GPUs).
Response (map keyed by GPU ID):
```json
{"0": {"energy_mj": 123456}, "1": {"energy_mj": 789012}}
```
#### `GET /gpu/get_power[?gpu_ids=0,1]`
One-shot GPU power reading. `gpu_ids` is optional (omit = all GPUs).
#### `GET /gpu/stream_power[?gpu_ids=0,1]`
SSE stream of GPU power readings. `gpu_ids` is optional (omit = all GPUs).
### CPU
All CPU endpoints are under the `/cpu` scope.
#### `GET /cpu/get_cumulative_energy`
Get cumulative RAPL energy counters.
| `cpu_ids` | `string` | yes | Comma-separated CPU indices |
| `cpu` | `bool` | yes | Include CPU package energy |
| `dram` | `bool` | yes | Include DRAM energy |
Response is a map keyed by CPU ID:
```json
{
"0": {"cpu_energy_uj": 123456, "dram_energy_uj": 78901},
"1": {"cpu_energy_uj": 234567, "dram_energy_uj": null}
}
```
#### `GET /cpu/get_power[?cpu_ids=0,1]`
One-shot CPU power reading (computed from RAPL energy deltas). `cpu_ids` is optional (omit = all CPUs).
#### `GET /cpu/stream_power[?cpu_ids=0,1]`
SSE stream of CPU power readings. `cpu_ids` is optional (omit = all CPUs).
### Full help message
```console
$ zeusd --help
The Zeus daemon manages and monitors compute devices on the node
Usage: zeusd <COMMAND>
Commands:
serve Start the Zeus daemon
token Token management
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
```
```console
$ zeusd serve --help
Start the Zeus daemon
Usage: zeusd serve [OPTIONS]
Options:
--mode <MODE>
Operating mode: UDS or TCP
[default: uds]
Possible values:
- uds: Unix domain socket
- tcp: TCP
--socket-path <SOCKET_PATH>
[UDS mode] Path to the socket Zeusd will listen on
[default: /var/run/zeusd.sock]
--socket-permissions <SOCKET_PERMISSIONS>
[UDS mode] Permissions for the socket file to be created
[default: 666]
--socket-uid <SOCKET_UID>
[UDS mode] UID to chown the socket file to
--socket-gid <SOCKET_GID>
[UDS mode] GID to chown the socket file to
--tcp-bind-address <TCP_BIND_ADDRESS>
[TCP mode] Address to bind to
[default: 127.0.0.1:4938]
--num-workers <NUM_WORKERS>
Number of worker threads to use. Default is the number of logical CPUs
--gpu-power-poll-hz <GPU_POWER_POLL_HZ>
GPU power polling frequency in Hz for the streaming endpoint
[default: 20]
--cpu-power-poll-hz <CPU_POWER_POLL_HZ>
CPU RAPL power polling frequency in Hz for the streaming endpoint
[default: 10]
--enable <ENABLE>
API groups to enable. Each group exposes a set of HTTP endpoints. Groups that require root will cause the daemon to exit at startup if it is not running as root
[default: gpu-control gpu-read cpu-read]
Possible values:
- gpu-control: GPU control operations (set power limit, clocks, persistence mode). Requires root
- gpu-read: GPU read operations (power reading, energy consumption)
- cpu-read: CPU RAPL read operations (energy, power). Requires root
--signing-key-path <SIGNING_KEY_PATH>
Path to the HMAC-SHA256 signing key file for JWT authentication. If not provided, authentication is disabled
-h, --help
Print help (see a summary with '-h')
```
```console
$ zeusd token issue --help
Issue a new JWT token for a user
Usage: zeusd token issue [OPTIONS] --signing-key-path <SIGNING_KEY_PATH> --user <USER> --expires <EXPIRES>
Options:
--signing-key-path <SIGNING_KEY_PATH>
Path to the HMAC-SHA256 signing key file
--user <USER>
User identity to embed in the token (the `sub` claim)
--scope <SCOPE>
API group scopes to grant. Comma-separated
Possible values:
- gpu-control: GPU control operations (set power limit, clocks, persistence mode). Requires root
- gpu-read: GPU read operations (power reading, energy consumption)
- cpu-read: CPU RAPL read operations (energy, power). Requires root
--expires <EXPIRES>
Token lifetime. Human-readable duration (e.g., "1h", "7d", "30d"). Use "never" for tokens that do not expire
-h, --help
Print help (see a summary with '-h')
```