# magikrun
**OCI-Compliant Container Runtime with Pod Runtime Interface (PRI)**
[](LICENSE)
[]()
[](https://crates.io/crates/magikrun)
[](https://docs.rs/magikrun)
`magikrun` is a container runtime abstraction layer that provides both OCI Runtime Spec compliance and a novel **Pod Runtime Interface (PRI)** for atomic pod operations across heterogeneous isolation backends.
## From CRI to PRI: Evolution of Container Interfaces
### The CRI Problem
Kubernetes introduced the **Container Runtime Interface (CRI)** to decouple from Docker. CRI treats pods as composite structures built step-by-step:
```
CRI Workflow:
─────────────────────────────────────────────────────────────────────
RunPodSandbox() → Creates pause container + namespaces
CreateContainer(A) → Prepares container A (not running)
StartContainer(A) → Runs container A
CreateContainer(B) → Prepares container B
StartContainer(B) → Runs container B
...wait for user/health checks...
StopContainer(B) → Graceful stop
StopContainer(A) → Graceful stop
RemoveContainer(B) → Cleanup container B
RemoveContainer(A) → Cleanup container A
StopPodSandbox() → Stop pause container
RemovePodSandbox() → Delete namespace holder
─────────────────────────────────────────────────────────────────────
11+ API calls, intermediate failure states possible
```
**Problems with CRI:**
1. **Partial failure states**: What if `StartContainer(B)` fails after `A` is running?
2. **Complex rollback logic**: Kubelet must track and undo partial progress
3. **Race conditions**: Namespace sharing during step-by-step creation
4. **Overhead**: Multiple RPC round-trips per pod
5. **Impedance mismatch**: Pods are the scheduling unit, but CRI operates on containers
### The PRI Solution
**Pod Runtime Interface (PRI)** treats pods as the atomic unit of deployment:
```
PRI Workflow:
─────────────────────────────────────────────────────────────────────
run_pod(spec) → RUNNING (all containers) or ERROR (nothing)
...pod lifecycle...
stop_pod(id, grace) → Graceful shutdown
delete_pod(id) → Complete cleanup
─────────────────────────────────────────────────────────────────────
3 operations, no intermediate states
```
**PRI Properties:**
- **Atomic**: Either all containers start or none do (automatic rollback)
- **Stateless**: No orphaned resources on failure
- **Immutable**: Replace pods, don't repair them
- **Self-healing friendly**: Failed pod = delete + reschedule (simple!)
### Why Now?
Several factors make PRI the right approach today:
1. **MicroVMs**: Hardware VM isolation (KVM, Hypervisor.framework) provides natural atomicity—VM boot is all-or-nothing
2. **WASM**: WebAssembly runtimes create isolated stores atomically
3. **GitOps/Immutable Infrastructure**: Pods are cattle, not pets—replace don't repair
4. **Edge/IoT**: Simpler semantics reduce resource overhead
5. **Security**: Fewer intermediate states = smaller attack surface
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ magikrun │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Pod Runtime Interface (PRI) │ │
│ │ run_pod(spec) → stop_pod(id) → delete_pod(id) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ NativePod │ │ MicroVmPod │ │ WasmPod │ │ │
│ │ │ Runtime │ │ Runtime │ │ Runtime │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │ │
│ └──────────┼────────────────┼────────────────┼──────────────────┘ │
│ │ uses │ uses │ uses │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ OCI Runtime (Single Container) │ │
│ │ create(id, bundle) → start → kill → delete │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │ │
│ │ │NativeRuntime │ │ KrunRuntime │ │ WasmtimeRuntime │ │ │
│ │ │ (youki) │ │ (libkrun) │ │ (wasmtime) │ │ │
│ │ └──────────────┘ └──────────────┘ └───────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────────────┐ │
│ │ Image & Bundle Building │ │
│ │ OCI Registry → Layers → Rootfs + config.json │ │
│ │ Path traversal protection │ Size limits │ Whiteout handling │ │
│ └───────────────────────────┼───────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────────────┐ │
│ │ Content-Addressed Storage │ │
│ │ SHA-256 verification │ Deduplication │ Atomic writes │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Infra Framework │ │
│ │ Infra + InfraExtension (workplane, mesh, service discovery) │ │
│ │ Runs inside infra-container (same code: native & MicroVM) │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```
### Layer Relationships
| **PRI (pod module)** | OCI Runtime | Atomic pod lifecycle with rollback |
| **OCI Runtime (runtime module)** | Bundle Builder, Storage | Single container lifecycle |
| **Image & Bundle (image module)** | Storage, Registry | Pull images, build OCI bundles |
| **Storage** | Filesystem | Content-addressed blob storage |
| **Infra (infra module)** | — | Framework for infra-container binaries |
## Module Structure
| `image` | CRI ImageService pattern: pull images, build bundles |
| `runtime` | OCI RuntimeService: single-container lifecycle (create/start/kill/delete) |
| `pod` | **PRI**: Atomic pod lifecycle with `PodRuntime` trait |
| `infra` | Infra-container framework for extensions (workplane, mesh) |
## OCI Runtimes (Single Container)
| `NativeRuntime` | Linux | youki (libcontainer) | Standard Linux containers |
| `KrunRuntime` | Linux/macOS | libkrun | MicroVM-isolated containers |
| `WasmtimeRuntime` | All | wasmtime | WASM/WASI workloads |
| `WindowsRuntime` | Windows | (stub) | Windows container support |
These are the **building blocks** that PRI Pod Runtimes compose for atomic pod operations.
## Pod Runtimes (PRI)
| `NativePodRuntime` | Linux | youki + pause | Namespaces + cgroups v2 | Emulated (rollback) |
| `MicroVmPodRuntime` | Linux/macOS | libkrun | Hardware VM (KVM/HVF) | Natural (VM boot) |
| `WasmPodRuntime` | All | wasmtime | WASM sandbox | Natural (store) |
### Atomicity Explained
**Native Pods** (emulated atomicity):
```
1. Create pause container (holds namespaces)
2. Start infra-container (joins namespaces)
3. Start app containers (join namespaces)
4. If any step fails → rollback all previous steps
5. Return success only when ALL running
```
**MicroVM Pods** (natural atomicity):
```
1. Build composite rootfs with all container images
2. Bake pod spec into VM filesystem
3. Boot VM (vminit spawns containers)
4. VM either boots completely or fails → nothing to clean up
```
**WASM Pods** (natural atomicity):
```
1. Create wasmtime Store (isolated memory)
2. Load all modules
3. Link and instantiate
4. If any fails → store is dropped, no leaks
```
## The Infra-Container
Every pod has an **infra-container** that:
- Holds Linux namespaces (network, IPC, UTS) for other containers to join
- Runs workplane extensions (service discovery, Raft consensus, mesh)
- Provides the "pause" functionality but with active capabilities
### Symmetric Design
The same infra-container code runs in both native and MicroVM modes:
```
Native Mode: MicroVM Mode:
───────────────────────── ─────────────────────────────────────
┌─────────────────────────┐ ┌─────────────────────────────────────┐
│ Pod │ │ VM │
│ ┌───────────────────┐ │ │ vminit (PID 1) │
│ │ infra-container │ │ │ │ spawns containers │
│ │ (workplane) │ │ │ │ reaps zombies │
│ └───────────────────┘ │ │ │ forwards signals │
│ ┌───────────────────┐ │ │ │ │
│ │ app-container │ │ │ ├─► infra-container (SAME CODE) │
│ └───────────────────┘ │ │ └─► app-container │
└─────────────────────────┘ └─────────────────────────────────────┘
```
## Security Model
### Isolation Hierarchy (Defense-in-Depth)
```
┌───────────────────────────────────────────────────┐
│ MicroVM │ ← Hardware VM boundary
│ ┌─────────────────────────────────────────────┐ │
│ │ Native Container │ │ ← Kernel namespace boundary
│ │ ┌───────────────────────────────────────┐ │ │
│ │ │ WASM │ │ │ ← Language sandbox boundary
│ │ └───────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────┘
```
| `MicroVmPodRuntime` | Hardware VM | Minimal (VMM only) | Untrusted workloads |
| `NativePodRuntime` | Kernel namespaces | Kernel syscalls | Multi-tenant pods |
| `WasmPodRuntime` | WASM sandbox | WASI only | Portable plugins |
### Security Constants
| `MAX_LAYER_SIZE` | 512 MiB | Per-layer extraction limit |
| `MAX_ROOTFS_SIZE` | 4 GiB | Total rootfs size limit |
| `MAX_LAYERS` | 128 | Maximum layers per image |
| `MAX_CONTAINERS_PER_POD` | 16 | Pod container limit |
| `MAX_WASM_MODULE_SIZE` | 256 MiB | WASM module limit |
| `DEFAULT_WASM_FUEL` | 1B ops | WASM instruction limit |
| `MAX_VM_MEMORY_MIB` | 4,096 | MicroVM memory limit |
| `MAX_VCPUS` | 8 | MicroVM vCPU limit |
| `IMAGE_PULL_TIMEOUT` | 300s | Registry timeout |
## Usage
### PRI (Recommended)
```rust
use magikrun::pod::{PodRuntime, PodSpec, NativePodRuntime};
use std::time::Duration;
#[tokio::main]
async fn main() -> magikrun::Result<()> {
let runtime = NativePodRuntime::new()?;
// Parse K8s-compatible pod manifest
let spec = PodSpec::from_yaml(include_bytes!("pod.yaml"))?;
// Atomic: either fully running or error (nothing created)
let handle = runtime.run_pod(&spec).await?;
println!("Pod {} running", handle.id);
// Check status
let status = runtime.pod_status(&handle.id).await?;
println!("Phase: {:?}", status.phase);
// Graceful stop with 30s grace period
runtime.stop_pod(&handle.id, Duration::from_secs(30)).await?;
runtime.delete_pod(&handle.id, false).await?;
Ok(())
}
```
### OCI Runtime (Low-Level)
```rust
use magikrun::image::{ImageService, BundleBuilder, OciContainerConfig};
use magikrun::runtime::{NativeRuntime, OciRuntime};
#[tokio::main]
async fn main() -> magikrun::Result<()> {
// CRI ImageService pattern
let image_service = ImageService::new()?;
let bundle_builder = BundleBuilder::with_storage(image_service.storage().clone())?;
// Pull image
let image = image_service.pull("alpine:3.18").await?;
// Build OCI bundle
let bundle = bundle_builder.build_oci_bundle(&image, &OciContainerConfig {
name: "my-container".to_string(),
command: Some(vec!["/bin/sh".to_string()]),
..Default::default()
})?;
// OCI lifecycle
let runtime = NativeRuntime::new();
runtime.create("my-container", bundle.path()).await?;
runtime.start("my-container").await?;
// ... later ...
runtime.kill("my-container", magikrun::runtime::Signal::Term).await?;
runtime.delete("my-container").await?;
Ok(())
}
```
## Binaries
| `magikrun` | Host | OCI runtime CLI, image management |
| `vminit` | VM `/init` | Minimal init for MicroVMs (PID 1) |
### vminit
The `vminit` binary runs as PID 1 inside MicroVMs. It:
- Reads baked pod spec from `/pod/spec.json`
- Spawns containers using libcontainer
- Reaps zombie processes
- Forwards signals (SIGTERM, SIGINT) to containers
- Handles TSI (exec/logs) requests via vsock
```bash
# Build static binary for VMs
cargo build --release --bin vminit --target x86_64-unknown-linux-musl
```
## Platform Support
| Native containers | ✅ | ❌ | ❌ | `/proc/self/ns/*` |
| MicroVMs | ✅ (KVM) | ✅ (HVF) | ❌ | `/dev/kvm` or `kern.hv_support` |
| WASM | ✅ | ✅ | ✅ | Always available |
| cgroups v2 | ✅ | ❌ | ❌ | `/sys/fs/cgroup/cgroup.controllers` |
## Dependencies
### Core
- `tokio` (1.x) - Async runtime
- `async-trait` - Async trait definitions
- `serde` / `serde_json` - Serialization
### OCI
- `oci-spec` (0.8) - OCI image manifest types
- `oci-distribution` (0.11) - Registry client
### Runtimes
- `libcontainer` / `libcgroups` (0.4) - youki container runtime
- `wasmtime` / `wasmtime-wasi` (27) - WASM execution
- `krun-sys` (1.10) - libkrun FFI bindings
### Security
- `sha2` / `hex` - Content-addressed hashing
- `flate2` / `tar` - Bounded layer extraction
## Testing
```bash
# Library tests only (fast)
cargo test --lib
# All tests including integration
cargo test
# With verbose output
cargo test -- --nocapture
```
## License
Apache-2.0. See [LICENSE](LICENSE) for details.