Skip to main content

Crate infernum_arbiter

Crate infernum_arbiter 

Source
Expand description

§Arbiter - Unified GPU Coordination

“The judge allocates resources justly”

Arbiter coordinates GPU resources between Infernum (LLM inference) and Dantalion (diffusion/image generation), enabling simultaneous multimodal workloads on a single GPU.

§Core Principles

  1. Quality-Aware Scheduling: Both systems can run at reduced quality when sharing GPU, with quality improving as resources become available.

  2. Priority-Based Arbitration: User-facing workloads get priority, background improvement yields when needed.

  3. Unified Fragment Cache: HoloTensor fragments are cached across both systems, avoiding redundant loading.

§Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         ARBITER                                 │
│  Monitors GPU memory, coordinates quality targets, routes work  │
└──────────────────────┬──────────────────────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         │                           │
         ▼                           ▼
┌─────────────────────┐     ┌─────────────────────┐
│     INFERNUM        │     │     DANTALION       │
│  (LLM Inference)    │     │  (Diffusion)        │
│                     │     │                     │
│  Quality: 40-100%   │     │  Quality: 30-100%   │
│  via HoloTensor     │     │  via ProgressiveLoad│
└─────────────────────┘     └─────────────────────┘
         │                           │
         └─────────────┬─────────────┘
                       │
                       ▼
         ┌─────────────────────────────┐
         │    UNIFIED FRAGMENT CACHE   │
         │  VRAM ← RAM ← NVMe ← CDN    │
         └─────────────────────────────┘

§Example

use arbiter::{Arbiter, ArbiterConfig, WorkloadType, Priority};

let arbiter = Arbiter::new(ArbiterConfig::auto_detect())?;

// Request LLM inference at high priority
let llm_allocation = arbiter.request_allocation(
    WorkloadType::LlmInference,
    Priority::UserFacing,
).await?;

// LLM gets 70% quality, Dantalion drops to 40%
let diffusion_allocation = arbiter.request_allocation(
    WorkloadType::ImageGeneration,
    Priority::Background,
).await?;

Re-exports§

pub use allocation::Allocation;
pub use allocation::AllocationRequest;
pub use allocation::AllocationResult;
pub use cache::CacheConfig;
pub use cache::CacheStats;
pub use cache::CacheTier;
pub use cache::FragmentCache;
pub use coordinator::Coordinator;
pub use coordinator::CoordinatorConfig;
pub use gpu::DetectionMethod;
pub use gpu::GpuDetectionResult;
pub use gpu::GpuDetector;
pub use gpu::GpuInfo;
pub use gpu::GpuVendor;
pub use memory::GpuMemoryTracker;
pub use memory::MemoryPressure;
pub use memory::MemoryStats;
pub use priority::Priority;
pub use priority::WorkloadType;
pub use quality::QualityAllocation;
pub use quality::QualityBudget;
pub use quality::QualityPolicy;

Modules§

allocation
Allocation types and requests.
cache
Fragment cache for HoloTensor weights.
coordinator
Coordinator for quality targets between workloads.
gpu
GPU detection and information gathering.
memory
GPU memory tracking and pressure monitoring.
priority
Priority and workload type definitions.
quality
Quality budget and allocation for workloads.

Structs§

Arbiter
The main GPU arbiter coordinating Infernum and Dantalion.
ArbiterConfig
Configuration for the Arbiter.
ArbiterState
Current state of the Arbiter.
ArbiterStats
Statistics for the Arbiter.

Enums§

ArbiterError
Errors from Arbiter operations.

Type Aliases§

Result
Result type for Arbiter operations.