Zeus daemon (zeusd)
zeusd is a system daemon that runs with admin privileges and exposes HTTP API endpoints for GPU management and GPU/CPU power streaming.
Problem
Energy optimizers in Zeus need to change the GPU's configurations including its power limit or frequency, which requires the Linux security capability SYS_ADMIN (which is pretty much sudo).
However, it's not a good idea to grant the entire application such strong privileges just to be able to change GPU configurations.
Additionally, monitoring GPU and CPU power across multiple nodes in a cluster requires a lightweight, always-on service that can stream readings to remote clients.
Solution
zeusd runs as a privileged daemon process on the node and provides:
- GPU management endpoints that wrap privileged NVML methods, so unprivileged applications can change GPU configuration on their behalf.
- GPU power streaming via SSE (Server-Sent Events) using NVML instant power readings.
- CPU power streaming via SSE using RAPL energy counters (Intel and modern AMD CPUs).
Power polling is demand-driven: zeusd only reads from hardware while at least one client is connected, so idle endpoints consume no resources.
To make this as low latency as possible, zeusd was written in Rust.
How to use zeusd
First, install zeusd:
Both modes (UDS and TCP) serve the same HTTP API. The only difference is the transport layer.
UDS mode
UDS (Unix domain socket) mode is the default. It's intended for local communication between processes on the same node.
To allow the Zeus Python library to recognize that zeusd is available, set:
When Zeus detects ZEUSD_SOCK_PATH, it'll automatically instantiate the right GPU backend and relay privileged GPU management method calls to zeusd.
TCP mode
TCP mode exposes the same API over a TCP socket, making it accessible from remote hosts. This is useful for cluster-wide power monitoring.
Example queries via curl:
# Discovery
# One-shot GPU power reading
# One-shot CPU power reading (RAPL)
# SSE stream of GPU power (Ctrl-C to stop)
# SSE stream of CPU power
# Filter to specific devices
# GPU management via query params
On the Python side, use PowerStreamingClient to connect to one or more zeusd instances:
=
# Get current power reading once
=
# Continuously stream power readings
See the Distributed Power Measurement and Aggregation section in our documentation for more details.
API groups
zeusd organizes its endpoints into API groups that can be selectively enabled with the --enable flag. By default, all groups are enabled.
| Group | Endpoints | Requires root |
|---|---|---|
gpu-control |
POST /gpu/set_persistence_mode |
Yes |
POST /gpu/set_power_limit |
||
POST /gpu/set_gpu_locked_clocks |
||
POST /gpu/reset_gpu_locked_clocks |
||
POST /gpu/set_mem_locked_clocks |
||
POST /gpu/reset_mem_locked_clocks |
||
gpu-read |
GET /gpu/get_power |
No |
GET /gpu/stream_power |
||
GET /gpu/get_cumulative_energy |
||
cpu-read |
GET /cpu/get_cumulative_energy |
Yes |
GET /cpu/get_power |
||
GET /cpu/stream_power |
The following endpoints are always available regardless of which groups are enabled:
GET /discoverGET /time
If a group that requires root is enabled but the daemon is not running as root, it will exit immediately with an error.
Examples:
# As root: all groups enabled (default)
# As non-root: GPU monitoring only (no root required)
# As root: monitoring only (GPU + CPU reads, no GPU control)
Only the devices needed by the enabled groups are initialized. For example, --enable gpu-read skips RAPL initialization entirely, and --enable cpu-read skips NVML initialization.
Authentication
zeusd supports optional per-user JWT authentication. When --signing-key-path is provided, all endpoints except /discover and /time require a valid Authorization: Bearer <token> header.
Setting up a signing key:
# Generate a 32-byte signing key (shared across all daemons in a cluster)
Starting the daemon with auth:
Issuing tokens:
# Token with 7-day expiry and GPU read scope
# Token with multiple scopes and no expiry
--expires accepts human-readable durations (1h, 7d, 30d) or never/0 for tokens that never expire.
Using tokens with Python clients:
Set the ZEUSD_TOKEN environment variable, or pass the token directly:
Python clients (ZeusdNVIDIAGPU, ZeusdRAPLCPU, PowerStreamingClient) automatically check /discover to determine whether auth is required. If auth is required and no token is available, an error is raised with a clear message.
Using tokens with curl:
When no --signing-key-path is provided, the daemon runs without authentication and all endpoints are freely accessible. The /discover endpoint always reports auth_required: true or false so clients can adapt.
API Reference
Discovery
GET /discover
Returns available devices, capabilities, and enabled API groups.
Response:
| Field | Type | Description |
|---|---|---|
gpu_ids |
int[] |
Available GPU indices |
cpu_ids |
int[] |
Available CPU indices |
dram_available |
bool[] |
Per-CPU DRAM energy support (indexed by position in cpu_ids) |
enabled_api_groups |
string[] |
API groups enabled on this instance |
auth_required |
bool |
Whether JWT authentication is required |
Example response:
GPU
All GPU endpoints are under the /gpu scope.
POST /gpu/set_power_limit
Set GPU power management limit.
| Query param | Type | Required | Description |
|---|---|---|---|
gpu_ids |
string |
yes | Comma-separated GPU indices |
power_limit_mw |
int |
yes | Power limit in milliwatts |
block |
bool |
yes | Wait for completion |
POST /gpu/set_persistence_mode
Set GPU persistence mode.
| Query param | Type | Required | Description |
|---|---|---|---|
gpu_ids |
string |
yes | Comma-separated GPU indices |
enabled |
bool |
yes | Enable or disable persistence mode |
block |
bool |
yes | Wait for completion |
POST /gpu/set_gpu_locked_clocks
Lock GPU core clocks to a range.
| Query param | Type | Required | Description |
|---|---|---|---|
gpu_ids |
string |
yes | Comma-separated GPU indices |
min_clock_mhz |
int |
yes | Minimum clock in MHz |
max_clock_mhz |
int |
yes | Maximum clock in MHz |
block |
bool |
yes | Wait for completion |
POST /gpu/reset_gpu_locked_clocks
Reset GPU core locked clocks.
| Query param | Type | Required | Description |
|---|---|---|---|
gpu_ids |
string |
yes | Comma-separated GPU indices |
block |
bool |
yes | Wait for completion |
POST /gpu/set_mem_locked_clocks
Lock GPU memory clocks to a range.
| Query param | Type | Required | Description |
|---|---|---|---|
gpu_ids |
string |
yes | Comma-separated GPU indices |
min_clock_mhz |
int |
yes | Minimum clock in MHz |
max_clock_mhz |
int |
yes | Maximum clock in MHz |
block |
bool |
yes | Wait for completion |
POST /gpu/reset_mem_locked_clocks
Reset GPU memory locked clocks.
| Query param | Type | Required | Description |
|---|---|---|---|
gpu_ids |
string |
yes | Comma-separated GPU indices |
block |
bool |
yes | Wait for completion |
GET /gpu/get_cumulative_energy[?gpu_ids=0,1]
Total energy consumption since driver load (NVML). gpu_ids is optional (omit = all GPUs).
Response (map keyed by GPU ID):
GET /gpu/get_power[?gpu_ids=0,1]
One-shot GPU power reading. gpu_ids is optional (omit = all GPUs).
GET /gpu/stream_power[?gpu_ids=0,1]
SSE stream of GPU power readings. gpu_ids is optional (omit = all GPUs).
CPU
All CPU endpoints are under the /cpu scope.
GET /cpu/get_cumulative_energy
Get cumulative RAPL energy counters.
| Query param | Type | Required | Description |
|---|---|---|---|
cpu_ids |
string |
yes | Comma-separated CPU indices |
cpu |
bool |
yes | Include CPU package energy |
dram |
bool |
yes | Include DRAM energy |
Response is a map keyed by CPU ID:
GET /cpu/get_power[?cpu_ids=0,1]
One-shot CPU power reading (computed from RAPL energy deltas). cpu_ids is optional (omit = all CPUs).
GET /cpu/stream_power[?cpu_ids=0,1]
SSE stream of CPU power readings. cpu_ids is optional (omit = all CPUs).
Full help message
$ zeusd --help
The Zeus daemon manages and monitors compute devices on the node
Usage: zeusd <COMMAND>
Commands:
serve Start the Zeus daemon
token Token management
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
$ zeusd serve --help
Start the Zeus daemon
Usage: zeusd serve [OPTIONS]
Options:
--mode <MODE>
Operating mode: UDS or TCP
[default: uds]
Possible values:
- uds: Unix domain socket
- tcp: TCP
--socket-path <SOCKET_PATH>
[UDS mode] Path to the socket Zeusd will listen on
[default: /var/run/zeusd.sock]
--socket-permissions <SOCKET_PERMISSIONS>
[UDS mode] Permissions for the socket file to be created
[default: 666]
--socket-uid <SOCKET_UID>
[UDS mode] UID to chown the socket file to
--socket-gid <SOCKET_GID>
[UDS mode] GID to chown the socket file to
--tcp-bind-address <TCP_BIND_ADDRESS>
[TCP mode] Address to bind to
[default: 127.0.0.1:4938]
--num-workers <NUM_WORKERS>
Number of worker threads to use. Default is the number of logical CPUs
--gpu-power-poll-hz <GPU_POWER_POLL_HZ>
GPU power polling frequency in Hz for the streaming endpoint
[default: 20]
--cpu-power-poll-hz <CPU_POWER_POLL_HZ>
CPU RAPL power polling frequency in Hz for the streaming endpoint
[default: 10]
--enable <ENABLE>
API groups to enable. Each group exposes a set of HTTP endpoints. Groups that require root will cause the daemon to exit at startup if it is not running as root
[default: gpu-control gpu-read cpu-read]
Possible values:
- gpu-control: GPU control operations (set power limit, clocks, persistence mode). Requires root
- gpu-read: GPU read operations (power reading, energy consumption)
- cpu-read: CPU RAPL read operations (energy, power). Requires root
--signing-key-path <SIGNING_KEY_PATH>
Path to the HMAC-SHA256 signing key file for JWT authentication. If not provided, authentication is disabled
-h, --help
Print help (see a summary with '-h')
$ zeusd token issue --help
Issue a new JWT token for a user
Usage: zeusd token issue [OPTIONS] --signing-key-path <SIGNING_KEY_PATH> --user <USER> --expires <EXPIRES>
Options:
--signing-key-path <SIGNING_KEY_PATH>
Path to the HMAC-SHA256 signing key file
--user <USER>
User identity to embed in the token (the `sub` claim)
--scope <SCOPE>
API group scopes to grant. Comma-separated
Possible values:
- gpu-control: GPU control operations (set power limit, clocks, persistence mode). Requires root
- gpu-read: GPU read operations (power reading, energy consumption)
- cpu-read: CPU RAPL read operations (energy, power). Requires root
--expires <EXPIRES>
Token lifetime. Human-readable duration (e.g., "1h", "7d", "30d"). Use "never" for tokens that do not expire
-h, --help
Print help (see a summary with '-h')