Halldyll Deploy Pods
A declarative, idempotent, and reconcilable deployment system for RunPod GPU pods.
Think of it as Terraform/Kubernetes for RunPod — define your GPU infrastructure as code, and let Halldyll handle the rest.
Features
- Declarative — Define your infrastructure in a simple YAML file
- Idempotent — Run
applymultiple times, get the same result - Drift Detection — Automatically detect and fix configuration drift
- Reconciliation Loop — Continuously converge to desired state
- State Management — Track deployments locally or on S3
- Multi-environment — Support for dev, staging, prod environments
- Guardrails — Cost limits, GPU limits, TTL auto-stop
- Auto Model Download — Automatically download HuggingFace models on pod startup
- Inference Engines — Auto-start vLLM, TGI, or Ollama with your models
Installation
From Crates.io
From Source
Quick Start
1. Initialize a new project
2. Configure your deployment
Edit halldyll.deploy.yaml:
project:
name: "my-ml-stack"
environment: "prod"
cloud_type: SECURE
state:
backend: local
pods:
- name: "inference"
gpu:
type: "NVIDIA A40"
count: 1
runtime:
image: "vllm/vllm-openai:latest"
env:
MODEL_NAME: "meta-llama/Llama-3-8B"
ports:
- "8000/http"
volumes:
- name: "hf-cache"
mount: "/root/.cache/huggingface"
persistent: true
3. Set your RunPod API key
4. Deploy!
Commands
| Command | Description |
|---|---|
halldyll init [path] |
Initialize a new project |
halldyll validate |
Validate configuration file |
halldyll plan |
Show deployment plan (dry-run) |
halldyll apply |
Apply the deployment plan |
halldyll status |
Show current deployment status |
halldyll reconcile |
Auto-fix drift from desired state |
halldyll drift |
Detect configuration drift |
halldyll destroy |
Destroy all deployed resources |
halldyll logs <pod> |
View pod logs |
halldyll state |
Manage deployment state |
Configuration Reference
Project Configuration
project:
name: "my-project" # Required: unique project name
environment: "dev" # Optional: dev, staging, prod (default: dev)
region: "EU" # Optional: EU, US, etc.
cloud_type: SECURE # Optional: SECURE or COMMUNITY
compute_type: GPU # Optional: GPU or CPU
State Backend
state:
backend: local # local or s3
# For S3:
bucket: "my-state-bucket"
prefix: "halldyll/my-project"
region: "us-east-1"
Pod Configuration
pods:
- name: "my-pod"
gpu:
type: "NVIDIA A40" # GPU type
count: 1 # Number of GPUs
min_vram_gb: 40 # Optional: minimum VRAM
fallback: # Optional: fallback GPU types
- "NVIDIA L40S"
- "NVIDIA RTX A6000"
ports:
- "22/tcp" # SSH
- "8000/http" # HTTP endpoint
volumes:
- name: "data"
mount: "/data"
persistent: true
size_gb: 100
runtime:
image: "runpod/pytorch:2.1.0-py3.10-cuda11.8.0"
env:
MY_VAR: "value"
health_check:
endpoint: "/health"
port: 8000
interval_secs: 30
timeout_secs: 5
Model Configuration (Auto-download and Start)
pods:
- name: "llm-server"
gpu:
type: "NVIDIA A40"
count: 1
runtime:
image: "vllm/vllm-openai:latest"
ports:
- "8000/http"
# Models are automatically downloaded and engines started
models:
- id: "llama-3-8b"
provider: huggingface # huggingface, bundle, or custom
repo: "meta-llama/Meta-Llama-3-8B-Instruct"
load:
engine: vllm # vllm, tgi, ollama, or transformers
quant: awq # Optional: awq, gptq, fp8
max_seq_len: 8192 # Optional: max sequence length
options: # Optional: engine-specific options
tensor-parallel-size: 1
Supported Inference Engines
| Engine | Description | Auto-Start | Use Case |
|---|---|---|---|
vllm |
High-performance LLM serving | Yes | Production LLM APIs, OpenAI-compatible |
tgi |
HuggingFace Text Generation Inference | Yes | HuggingFace models, streaming |
ollama |
Easy-to-use LLM runner | Yes | Local development, quick testing |
transformers |
HuggingFace Transformers library | No | Custom scripts, fine-tuning |
Multi-Model Deployment Example
Deploy different models on different pods:
pods:
# LLM API Server
- name: "llm-api"
gpu:
type: "NVIDIA A40"
count: 1
runtime:
image: "vllm/vllm-openai:latest"
ports:
- "8000/http"
models:
- id: "llama-3-8b"
provider: huggingface
repo: "meta-llama/Meta-Llama-3-8B-Instruct"
load:
engine: vllm
max_seq_len: 8192
# Embedding Server
- name: "embeddings"
gpu:
type: "NVIDIA RTX 4090"
count: 1
runtime:
image: "ghcr.io/huggingface/text-embeddings-inference:latest"
ports:
- "8080/http"
models:
- id: "bge-large"
provider: huggingface
repo: "BAAI/bge-large-en-v1.5"
load:
engine: tgi
# Vision Model
- name: "vision-api"
gpu:
type: "NVIDIA A40"
count: 1
runtime:
image: "ghcr.io/huggingface/text-generation-inference:latest"
ports:
- "8000/http"
models:
- id: "llava"
provider: huggingface
repo: "llava-hf/llava-v1.6-mistral-7b-hf"
load:
engine: tgi
Quantization Options
Reduce memory usage with quantization:
models:
- id: "llama-70b-awq"
provider: huggingface
repo: "TheBloke/Llama-2-70B-Chat-AWQ"
load:
engine: vllm
quant: awq # 4-bit AWQ quantization
max_seq_len: 4096
| Quant Method | Memory Reduction | Quality | Speed |
|---|---|---|---|
awq |
~75% | High | Fast |
gptq |
~75% | High | Medium |
fp8 |
~50% | Very High | Fast |
Guardrails (Optional)
guardrails:
max_hourly_cost: 10.0 # Maximum hourly cost in USD
max_gpus: 4 # Maximum total GPUs
ttl_hours: 24 # Auto-stop after N hours
allow_gpu_fallback: false # Allow fallback to other GPU types
Architecture
┌─────────────────────────────────────────────────────────┐
│ halldyll.deploy.yaml │
│ (Desired State) │
└───────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ConfigParser + Validator │
└───────────────────────┬─────────────────────────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ StateStore │ │ PodObserver │
│ (Local or S3) │ │ (RunPod API) │
└────────┬────────┘ └────────┬────────┘
│ │
└──────────────┬─────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ DiffEngine │
│ (Compare Desired vs Observed) │
└───────────────────────┬─────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ Reconciler │
│ (Execute Plan → Converge State) │
└─────────────────────────────────────────────────────────┘
Library Usage
You can also use Halldyll as a library in your Rust projects:
use ;
async
Environment Variables
| Variable | Description | Required |
|---|---|---|
RUNPOD_API_KEY |
Your RunPod API key | Yes |
HF_TOKEN |
HuggingFace API token (for gated models like Llama) | For gated models |
HALLDYLL_CONFIG |
Path to config file | No |
AWS_ACCESS_KEY_ID |
AWS credentials (for S3 state) | No |
AWS_SECRET_ACCESS_KEY |
AWS credentials (for S3 state) | No |
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Geryan Roy (@Mr-soloDev)
- Email: geryan.roy@icloud.com
Acknowledgments
- RunPod for the amazing GPU cloud platform
- Inspired by Terraform, Kubernetes, and other declarative infrastructure tools