Arbiter - Unified GPU Coordination
"The judge allocates resources justly"
Arbiter coordinates GPU resources between Infernum (LLM inference) and Dantalion (diffusion/image generation), enabling simultaneous multimodal workloads on a single GPU.
Core Principles
-
Quality-Aware Scheduling: Both systems can run at reduced quality when sharing GPU, with quality improving as resources become available.
-
Priority-Based Arbitration: User-facing workloads get priority, background improvement yields when needed.
-
Unified Fragment Cache: HoloTensor fragments are cached across both systems, avoiding redundant loading.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ ARBITER │
│ Monitors GPU memory, coordinates quality targets, routes work │
└──────────────────────┬──────────────────────────────────────────┘
│
┌─────────────┴─────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ INFERNUM │ │ DANTALION │
│ (LLM Inference) │ │ (Diffusion) │
│ │ │ │
│ Quality: 40-100% │ │ Quality: 30-100% │
│ via HoloTensor │ │ via ProgressiveLoad│
└─────────────────────┘ └─────────────────────┘
│ │
└─────────────┬─────────────┘
│
▼
┌─────────────────────────────┐
│ UNIFIED FRAGMENT CACHE │
│ VRAM ← RAM ← NVMe ← CDN │
└─────────────────────────────┘
Example
use arbiter::{Arbiter, ArbiterConfig, WorkloadType, Priority};
let arbiter = Arbiter::new(ArbiterConfig::auto_detect())?;
// Request LLM inference at high priority
let llm_allocation = arbiter.request_allocation(
WorkloadType::LlmInference,
Priority::UserFacing,
).await?;
// LLM gets 70% quality, Dantalion drops to 40%
let diffusion_allocation = arbiter.request_allocation(
WorkloadType::ImageGeneration,
Priority::Background,
).await?;