embeddenator 0.20.0-alpha.1

# Self-Hosted CI Infrastructure Project Specification

**Project:** Embeddenator Self-Hosted CI/CD System  
**Version:** 1.0  
**Last Updated:** 2025-12-22  
**Status:** Active Development  
**Document Owner:** DevOps Engineering Team

---

## Table of Contents

1. [Executive Summary](#executive-summary)
2. [Project Overview](#project-overview)
3. [Current State](#current-state)
4. [System Architecture](#system-architecture)
5. [Requirements](#requirements)
6. [Runner Types and Configurations](#runner-types-and-configurations)
7. [Testing and Validation](#testing-and-validation)
8. [Future Roadmap](#future-roadmap)
9. [Appendix](#appendix)

---

## Executive Summary

The Embeddenator project requires self-hosted GitHub Actions runners to support multi-architecture builds (ARM64, x86_64) and specialized hardware (GPU) that are not available in GitHub's standard hosted runner fleet. This document specifies the design, implementation, and operational requirements for a robust, automated self-hosted CI infrastructure.

### Key Objectives

1. **Multi-Architecture Support**: Native ARM64 builds without emulation overhead
2. **GPU Acceleration**: Support for GPU-accelerated workloads in future development
3. **Cost Optimization**: Auto-scaling runners with idle timeout management
4. **Automation**: Fully automated runner lifecycle management
5. **Reliability**: High availability with automatic recovery and health monitoring

### Success Criteria

- ✅ ARM64 CI workflow completes successfully on self-hosted runners
- ✅ Runner automation system manages lifecycle without manual intervention
- ✅ Cost savings >50% compared to cloud-based alternatives
- ✅ Build time <15 minutes for full test suite on ARM64
- ✅ Zero manual runner management required during normal operation

---

## Project Overview

### Background

Embeddenator is a Rust-based holographic computing substrate that requires testing and building on multiple architectures:

- **AMD64 (x86_64)**: Primary architecture, fully supported by GitHub-hosted runners
- **ARM64 (aarch64)**: Required for multi-arch Docker images, not available in standard GitHub runners
- **GPU Support**: Future requirement for VSA acceleration research

The project previously attempted to use ARM64 GitHub-hosted runners, but these do not exist in the standard offering. QEMU emulation on x86_64 runners proved too slow (5-10x overhead) and unreliable for CI/CD purposes.

### Solution Approach

Implement self-hosted runner infrastructure with:

1. **Automated Runner Management**: Python-based lifecycle automation (`runner_manager.py`)
2. **Multi-Runner Support**: Deploy multiple runners for parallel builds
3. **Cost Optimization**: Auto-deregistration after idle timeout
4. **Cross-Architecture Support**: Native execution on target architectures
5. **GPU Capability**: Infrastructure ready for GPU-accelerated workloads

---

## Current State

### Completed ✅

- **v0.1.0 Release**: Core VSA implementation with AMD64 CI
- **v0.2.0 Release**: Comprehensive test suite, clippy fixes
- **Runner Automation Framework**: Complete Python-based automation system
  - Auto-registration with short-lived tokens
  - Lifecycle management (register → run → monitor → deregister)
  - Multi-runner deployment support
  - Multi-architecture support (x64, ARM64, RISC-V)
  - QEMU emulation support for cross-architecture testing
  - GPU runner support with hardware detection
  - Auto-scaling based on workload
- **CI/CD Workflow Separation**: Three workflows (pre-checks, amd64, arm64)
- **Documentation**: Comprehensive README, workflow docs, runner automation guide

### In Progress 🚧

- **ARM64 Infrastructure Deployment**: Hardware/VM provisioning
- **Self-Hosted Runner Deployment**: Physical deployment pending
- **ARM64 CI Testing**: Validation workflow ready but untested

### Pending ⏳

- **ARM64 Auto-Trigger**: Enable automatic runs on main branch
- **GPU Runner Configuration**: Hardware acquisition and setup
- **Production Monitoring**: Observability and alerting setup

---

## System Architecture

### High-Level Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                     GitHub Actions                          │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │ Pre-Checks   │  │ AMD64 Build  │  │ ARM64 Build  │    │
│  │ (hosted)     │  │ (hosted)     │  │ (self-hosted)│    │
│  └──────────────┘  └──────────────┘  └──────┬───────┘    │
│                                              │             │
└──────────────────────────────────────────────┼─────────────┘
                                               │
                                               │ HTTPS
                                               │
┌──────────────────────────────────────────────▼─────────────┐
│              Self-Hosted Infrastructure                     │
│                                                             │
│  ┌────────────────────────────────────────────────────┐   │
│  │           Runner Manager (runner_manager.py)       │   │
│  │  - Registration automation                         │   │
│  │  - Lifecycle management                           │   │
│  │  - Health monitoring                              │   │
│  │  - Auto-scaling                                   │   │
│  └────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │ ARM64 Runner │  │ ARM64 Runner │  │ GPU Runner   │    │
│  │ #1 (4 core) │  │ #2 (4 core) │  │ (8 core+GPU) │    │
│  └──────────────┘  └──────────────┘  └──────────────┘    │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐ │
│  │           Shared Storage & Cache                     │ │
│  │  - Build artifacts                                   │ │
│  │  - Cargo registry cache                              │ │
│  │  - Docker layer cache                                │ │
│  └──────────────────────────────────────────────────────┘ │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐ │
│  │           Monitoring & Logging                       │ │
│  │  - Runner health metrics                             │ │
│  │  - Build performance tracking                        │ │
│  │  - Resource utilization                              │ │
│  └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```

### Component Architecture

#### 1. Runner Manager (`runner_manager.py`)

**Purpose**: Centralized automation for runner lifecycle management

**Modules**:
- `runner_automation/config.py`: Configuration management
- `runner_automation/github_api.py`: GitHub API client
- `runner_automation/installer.py`: Runner installation
- `runner_automation/runner.py`: Individual runner lifecycle
- `runner_automation/manager.py`: Multi-runner orchestration
- `runner_automation/emulation.py`: QEMU cross-architecture support
- `runner_automation/cli.py`: Command-line interface

**Key Features**:
- Token management (registration tokens, PAT validation)
- Multi-runner coordination
- Health checks and monitoring
- Automatic cleanup and recovery
- Cross-architecture emulation support
- GPU hardware detection and management

---

## Requirements

### Functional Requirements

#### FR-1: Multi-Architecture Build Support
- **Priority**: P0 (Critical)
- **Description**: Support native builds on ARM64 architecture
- **Acceptance**: ARM64 CI workflow completes successfully with all tests passing

#### FR-2: Automated Runner Lifecycle
- **Priority**: P0 (Critical)
- **Description**: Automated registration, startup, monitoring, and deregistration
- **Acceptance**: Zero manual intervention required for normal operations

#### FR-3: Multi-Runner Deployment
- **Priority**: P1 (High)
- **Description**: Deploy and manage multiple runners simultaneously
- **Acceptance**: Support 2+ concurrent runners with independent lifecycles

#### FR-4: Auto-Scaling
- **Priority**: P1 (High)
- **Description**: Scale runners based on workload and idle timeout
- **Acceptance**: Runners auto-deregister after configurable idle period

#### FR-5: GPU Support
- **Priority**: P2 (Medium)
- **Description**: Support GPU-accelerated workloads
- **Acceptance**: GPU runner can be deployed and detected by workflows

### Non-Functional Requirements

#### NFR-1: Performance
- **Build Time**: Full test suite <15 minutes on ARM64
- **Startup Time**: Runner registration and startup <2 minutes
- **Network**: HTTPS connection to GitHub Actions <100ms latency

#### NFR-2: Reliability
- **Uptime**: 99.5% runner availability during business hours
- **Auto-Recovery**: Automatic restart on runner failure
- **Monitoring**: Health checks every 30 seconds

#### NFR-3: Security
- **Token Management**: Short-lived registration tokens (<1 hour)
- **Access Control**: Runners in private network with minimal exposure
- **Secrets**: No secrets stored in runner filesystem
- **Updates**: Automatic security updates for runner software

---

## Runner Types and Configurations

### Standard ARM64 Runner

**Specifications**:
- **Architecture**: ARM64 (aarch64)
- **CPU**: 4 cores (minimum), 8 cores (recommended)
- **Memory**: 8GB (minimum), 16GB (recommended)
- **Storage**: 100GB SSD (minimum), 200GB (recommended)
- **Network**: 100Mbps (minimum)

**Labels**: `["self-hosted", "linux", "ARM64"]`

**Use Cases**:
- ARM64 CI workflow builds
- ARM64 Docker image builds
- ARM64 integration tests

---

## Testing and Validation

### Deployment Validation

#### Phase 1: Manual Testing
1. Deploy single ARM64 runner
2. Manually trigger `ci-arm64.yml` workflow
3. Verify architecture detection (`uname -m` → `aarch64`)
4. Confirm all tests pass
5. Validate artifact generation

**Success Criteria**:
- ✅ Workflow completes in <15 minutes
- ✅ All tests pass
- ✅ Artifacts uploaded correctly

#### Phase 2: Auto Mode Testing
1. Enable auto mode with 5-minute idle timeout
2. Trigger multiple workflows back-to-back
3. Let runner idle and verify auto-deregistration
4. Trigger new workflow and verify auto-registration

**Success Criteria**:
- ✅ Auto-deregistration after idle timeout
- ✅ Re-registration on new job
- ✅ No manual intervention required

---

## Future Roadmap

### Phase 1: Current (2025 Q1)
- ✅ Runner automation framework complete
- 🚧 ARM64 runner deployment
- ⏳ ARM64 CI enablement

### Phase 2: GPU Support (2025 Q2)
- GPU runner deployment
- CUDA/ROCm environment setup
- GPU-accelerated VSA prototypes

### Phase 3: Advanced Features (2025 Q3)
- Auto-scaling based on queue depth
- Multi-region runner deployment
- Advanced caching strategies

---

## Appendix

### A. Reference Documentation

- [GitHub Actions Runner Documentation](https://docs.github.com/en/actions/hosting-your-own-runners)
- [Runner Automation Guide](RUNNER_AUTOMATION.md)
- [ARM64 Setup Guide](../.github/workflows/ARM64_RUNNER_SETUP.md)
- [Workflow README](../.github/workflows/README.md)

---

**Document Status**: ACTIVE  
**Review Schedule**: Quarterly  
**Next Review**: 2025-03-22