hpc-node 2026.1.28

Shared contracts for node-level resource management between HPC systems: cgroup v2 conventions, namespace handoff protocol, mount management, and readiness signaling.
Documentation

hpc-node

CI crates.io License Rust

Shared contracts for node-level resource management between HPC systems. This crate defines traits and types for cgroup v2 management, Linux namespace handoff, mount lifecycle, and boot readiness signaling.

This crate enables multiple applications (like Pact and Lattice) to share common node management conventions while implementing their own backends independently.

Features

  • cgroup v2 Conventions: Shared slice naming (pact.slice/, workload.slice/), ownership model, scope management trait, resource limits
  • Namespace Handoff: Protocol for passing Linux namespace FDs between processes via unix socket (SCM_RIGHTS)
  • Mount Management: Refcounted mount trait with lazy unmount and crash-recovery reconstruction
  • Readiness Signaling: Boot readiness gate trait for coordinating init and workload systems

Installation

Add to your Cargo.toml:

[dependencies]
hpc-node = "2026.1"

Or to use the latest development version from git:

[dependencies]
hpc-node = { git = "https://github.com/witlox/hpc-core" }

Usage

1. Implement the CgroupManager Trait

Each application provides its own cgroup management backend:

use hpc_node::{CgroupManager, CgroupHandle, CgroupError, CgroupMetrics, ResourceLimits, slices};

struct MyCgroupManager;

impl CgroupManager for MyCgroupManager {
    fn create_hierarchy(&self) -> Result<(), CgroupError> {
        // Create pact.slice/ and workload.slice/ in cgroup v2 filesystem
        todo!()
    }

    fn create_scope(
        &self,
        parent_slice: &str,
        name: &str,
        limits: &ResourceLimits,
    ) -> Result<CgroupHandle, CgroupError> {
        // Create a scoped cgroup under parent_slice with resource limits
        todo!()
    }

    fn destroy_scope(&self, handle: &CgroupHandle) -> Result<(), CgroupError> {
        // Kill all processes in scope via cgroup.kill, then remove
        todo!()
    }

    fn read_metrics(&self, path: &str) -> Result<CgroupMetrics, CgroupError> {
        // Read memory.current, cpu.stat, cgroup.procs from cgroup filesystem
        todo!()
    }

    fn is_scope_empty(&self, handle: &CgroupHandle) -> Result<bool, CgroupError> {
        // Check if cgroup.procs is empty
        todo!()
    }
}

2. Query Slice Ownership

use hpc_node::{SliceOwner, cgroup::slice_owner, cgroup::slices};

// Determine who owns a cgroup path
assert_eq!(slice_owner(slices::PACT_GPU), Some(SliceOwner::Pact));
assert_eq!(slice_owner(slices::WORKLOAD_ROOT), Some(SliceOwner::Workload));

3. Namespace Handoff

use hpc_node::{NamespaceRequest, NamespaceType, namespace::HANDOFF_SOCKET_PATH};

// Lattice requests namespaces from pact via unix socket
let request = NamespaceRequest {
    allocation_id: "alloc-42".to_string(),
    namespaces: vec![NamespaceType::Pid, NamespaceType::Net, NamespaceType::Mount],
    uenv_image: Some("pytorch-2.5.sqfs".to_string()),
};
// Send request over HANDOFF_SOCKET_PATH, receive FDs via SCM_RIGHTS

4. Mount Refcounting

use hpc_node::MountManager;

// Implementer provides refcounted mount management
// acquire_mount() → increments refcount (mounts if first)
// release_mount() → decrements refcount (starts hold timer at zero)
// force_unmount() → immediate unmount (emergency mode only)
// reconstruct_state() → rebuild refcounts from /proc/mounts after crash

Architecture

What's Provided (Shared Contract)

Component Description
CgroupManager trait Hierarchy creation, scope lifecycle, metrics reading
SliceOwner enum Ownership model: Pact (system services) vs Workload (allocations)
slices constants Well-known cgroup paths for consistent hierarchy
NamespaceProvider / NamespaceConsumer traits FD handoff protocol
MountManager trait Refcounted mount lifecycle with lazy unmount
ReadinessGate trait Boot readiness signaling
Well-known paths Socket paths, mount base directories

What You Provide (Application-Specific)

Component Description
CgroupManager impl Your cgroup v2 filesystem operations
NamespaceProvider impl Your unshare(2) + FD management
MountManager impl Your mount(2) + refcount tracking
ReadinessGate impl Your boot sequence completion signal

Design Principles

  • Traits and types only — no Linux-specific code, no implementations
  • No runtime coupling — pact and lattice have no runtime dependency on each other
  • Convention over configuration — well-known paths prevent drift
  • Both systems work independently — lattice creates its own hierarchy when pact is absent