embeddenator 0.20.0-alpha.1

Sparse ternary VSA holographic computing substrate
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
# GitHub Actions Workflows

This directory contains the CI/CD workflows for Embeddenator.

## Workflow Structure

The CI/CD pipeline is split into separate workflows to avoid duplication and provide clear separation of concerns:

### 1. **ci-pre-checks.yml** - Pre-build Validation
**Triggers:** Every push to `main`, every pull request

**Purpose:** Fast checks that must pass before platform-specific builds

**Jobs:**
- Code formatting (`cargo fmt`)
- Linting (`cargo clippy`)
- Unit tests (`cargo test --lib --bins`)
- Doc tests (`cargo test --doc`)
- Python syntax validation
- YAML validation

**Runtime:** ~3-5 minutes

**Note:** This workflow runs on every push/PR and must pass before other workflows run.

---

### 2. **ci-amd64.yml** - AMD64 Build and Test (Required Pre-Merge Check)
**Triggers:** Every pull request, every push to `main`, after pre-checks pass

**Purpose:** Full build and test on AMD64 (x86_64) architecture - **MUST PASS FOR MERGE**

**Jobs:**
- Build release binary
- Run full test suite
- Run integration tests via orchestrator
- Upload artifacts on failure

**Runner:** `ubuntu-latest` (GitHub-hosted, amd64)

**Runtime:** ~5-7 minutes

**Status:** ✅ **ACTIVE** (Required status check for PR merges)

**Note:** This workflow is configured as a **required pre-merge check**. Pull requests cannot be merged to `main` until this workflow completes successfully. This ensures the build is not broken before code reaches the main branch.

---

### 3. **ci-arm64.yml** - ARM64 Build and Test (Self-Hosted Configuration)
**Triggers:** Manual (`workflow_dispatch`) only (pending self-hosted runner setup)

**Purpose:** Full build and test on ARM64 (aarch64) architecture

**Jobs:**
- Verify architecture
- Build release binary
- Run full test suite
- Run integration tests via orchestrator
- Upload artifacts on failure

**Runner:** Self-hosted ARM64 with labels `["self-hosted", "linux", "ARM64"]`

**Runtime:** 
- Self-hosted native: ~8-12 minutes (estimated)

**Status:** ⚠️ **CONFIGURED BUT DISABLED** - Ready for self-hosted runners

**Deployment Plan:**
1. **Phase 1** (Current): Manual testing with self-hosted runners via `workflow_dispatch`
2. **Phase 2** (After validation): Enable automatic trigger on merge to `main` only
3. **Phase 3** (Future): Consider making it a required check if ARM64 becomes critical

**Why Self-Hosted?**
ARM64 support requires self-hosted runners because GitHub Actions does not provide standard hosted ARM64 runners. Self-hosted runners will be deployed and configured with the required labels before this workflow is fully enabled.

#### ARM64 Configuration and Deployment Roadmap

**Current Status:**
The ARM64 workflow is fully configured for self-hosted runners but currently disabled pending runner deployment.

**Root Cause of Previous Issues:**

The ARM64 workflow was hanging because:

1. **Invalid Runner Labels**: Previous attempts used `ubuntu-24.04-arm64-4core` and `ubuntu-24.04-arm64` which don't exist
2. **GitHub Actions Limitation**: Standard GitHub-hosted runners are AMD64 only - no native ARM64 support
3. **Hanging Behavior**: Jobs queued indefinitely waiting for non-existent runners

**Previous Attempts (Resolved):**
- Commit b968753: Used `ubuntu-24.04-arm64` ❌ (invalid label)
- Commit 9790fd3: Used `ubuntu-24.04-arm64-4core` ❌ (invalid label)
- Commit 7252015: Temporarily disabled (correct action)
- Commit 4502381: Diagnosed root cause and documented solutions
- **Current**: Configured for self-hosted with labels `["self-hosted", "linux", "ARM64"]`
**Deployment Roadmap:**

**Phase 1: Infrastructure Setup** (Pending)
- Deploy ARM64 hardware (physical server, VM, or cloud instance)
- Install GitHub Actions runner software on ARM64 hardware
- Register runner with labels: `self-hosted`, `linux`, `ARM64`
- Verify runner appears in: Settings → Actions → Runners

**Phase 2: Testing and Validation**
- Manual workflow trigger testing via `workflow_dispatch`
- Verify architecture detection (`uname -m` should show `aarch64`)
- Run full test suite and validate results
- Performance benchmarking and optimization

**Phase 3: Production Enablement**
- Enable automatic trigger on merge to `main` branch only
- Monitor performance and stability
- Document any ARM64-specific issues or workarounds

**Phase 4: Future Enhancements** (Optional)
- Consider making ARM64 a required check if it becomes critical
- Evaluate multi-runner setup for redundancy
- Optimize build caching for ARM64

**Why Self-Hosted?**

Self-hosted ARM64 runners are the **only practical solution** because:
- ✅ Fast native execution (no emulation overhead)
- ✅ Full control over hardware specs
- ✅ Cost-effective for frequent builds
- ✅ Can be deployed on existing infrastructure
- ❌ GitHub doesn't provide standard hosted ARM64 runners
- ❌ QEMU emulation is 5-10x slower and unreliable

---

## Automated Runner Management System

🎉 **NEW**: Embeddenator now includes a comprehensive Python-based automation system for managing self-hosted runners!

### Overview

The `runner_manager.py` script provides complete lifecycle automation:
- ✨ Automated registration with short-lived tokens
- 🔄 Complete lifecycle management (register → run → deregister)
- ⏱️ Auto-deregistration after configurable idle timeout
- 🎯 Manual mode for persistent runners
- 🚀 Multi-runner deployment support
- 📊 Health monitoring and status reporting

### Quick Start

```bash
# 1. Configure (required: GITHUB_REPOSITORY and GITHUB_TOKEN)
cp .env.example .env
# Edit .env with your repository and GitHub PAT

# 2. Run in auto mode (registers, starts, monitors, auto-deregisters when idle)
python3 runner_manager.py run

# 3. Or run in manual mode (keeps running until stopped)
RUNNER_MODE=manual python3 runner_manager.py run
```

### Key Features

**Auto Mode** (Cost Optimized):
- Registers runner automatically
- Monitors job queue
- Auto-deregisters after idle timeout (default: 5 minutes)
- Perfect for sporadic CI/CD builds

**Manual Mode** (Persistent):
- Keeps runner alive until explicitly stopped
- Ideal for development environments
- Full control over lifecycle

**Multi-Runner Support**:
```bash
# Deploy 4 runners for parallel builds
python3 runner_manager.py run --runner-count 4
```

### Documentation

For complete documentation, see: [`docs/RUNNER_AUTOMATION.md`](../../docs/RUNNER_AUTOMATION.md)

Topics covered:
- Configuration options (50+ environment variables)
- Deployment modes and strategies
- Advanced features (ephemeral runners, resource management)
- Troubleshooting and security best practices
- Integration examples (systemd, Docker, cron)

---

## Manual Self-Hosted Runner Setup

If you prefer manual setup instead of using the automation system:

**Self-Hosted Runner Setup Guide:**

```bash
# On ARM64 hardware (aarch64):

# 1. Download runner
mkdir actions-runner && cd actions-runner
curl -o actions-runner-linux-arm64-2.311.0.tar.gz -L \
  https://github.com/actions/runner/releases/download/v2.311.0/actions-runner-linux-arm64-2.311.0.tar.gz
tar xzf ./actions-runner-linux-arm64-2.311.0.tar.gz

# 2. Configure with labels
./config.sh --url https://github.com/tzervas/embeddenator \
  --token YOUR_TOKEN \
  --labels self-hosted,linux,ARM64 \
  --name arm64-runner-1

# 3. Install as service
sudo ./svc.sh install
sudo ./svc.sh start

# 4. Verify
sudo ./svc.sh status
```

**Monitoring:**
- Runner status: Settings → Actions → Runners
- Workflow runs: Actions tab → CI arm64 Build and Test
- Logs: Check runner machine logs at `_diag/` directory

---

### 4. **build-holographic-os.yml** - OS Container Builds
**Triggers:** Manual (`workflow_dispatch`)

**Purpose:** Build holographic OS containers for specific configurations

**Status:** ✅ Active (manual trigger)

---

### 5. **build-push-images.yml** - Multi-OS Image Pipeline
**Triggers:** Manual (`workflow_dispatch`)

**Purpose:** Build and push multiple OS configurations to GHCR

**Default Targets:** AMD64 only (debian-stable, debian-testing, ubuntu-stable)
- ARM64 builds available via manual input but not recommended in CI (emulation too slow)

**Features:**
- Matrix-based parallel builds
- Configurable via comma-separated input
- Optional test execution before build
- GHCR push with proper tagging

**Status:** ✅ Active (manual trigger)

---

### 6. **nightly-builds.yml** - Automated Nightly Builds
**Triggers:** Daily at 2 AM UTC (`cron: '0 2 * * *'`)

**Purpose:** Build bleeding-edge images with nightly Rust and latest OS packages

**Targets:** AMD64 only (ARM64 emulation too slow for CI)
- Debian Testing/Sid (amd64)
- Ubuntu Devel/Rolling (amd64)

**Note:** ARM64 builds should be done locally or on self-hosted ARM64 runners

**Status:** ✅ Active (scheduled)

---

### 7. **build-push-arm64.yml** - ARM64 Image Builds (Self-Hosted)
**Triggers:** Manual (`workflow_dispatch`)

**Purpose:** Build and push ARM64 images using self-hosted runners (local or QEMU-emulated)

**Runner Options:**
- **Large runner:** 10 cores, 16GB RAM - builds 4 configs in parallel
- **Multi runners:** 4x runners with 4 cores, 6GB RAM each - distributed builds
- **Native ARM64:** Physical ARM64 hardware

**Features:**
- Disk space management (automatic cleanup)
- Configurable parallelism based on runner type
- GHCR push with proper tagging
- Build metrics and monitoring

**Default Targets:** debian-stable-arm64, debian-testing-arm64, ubuntu-stable-arm64

**Setup:** See [ARM64_RUNNER_SETUP.md](./ARM64_RUNNER_SETUP.md) for detailed instructions

**Status:** ✅ Active (manual trigger, requires self-hosted runner)

---

## Workflow Dependencies

```
On Pull Request:
  ├─ ci-pre-checks.yml (runs first, fast validation)
  │   └─ If successful:
  │       └─ ci-amd64.yml (REQUIRED FOR MERGE - must pass)
  │
  └─ ci-arm64.yml (DISABLED - awaiting self-hosted runners)
                   (Future: will run on merge to main only)

On Push to Main:
  ├─ ci-pre-checks.yml
  ├─ ci-amd64.yml
  └─ ci-arm64.yml (will be enabled post-deployment)

Manual/Scheduled:
  ├─ build-holographic-os.yml (manual)
  ├─ build-push-images.yml (manual - AMD64 only)
  ├─ build-push-arm64.yml (manual - ARM64 only, self-hosted)
  └─ nightly-builds.yml (scheduled daily at 2 AM UTC - AMD64 only)
```

**Key Points:**
- **Pre-checks** run on every PR commit (fast feedback)
- **AMD64** is a **required check** - PR cannot merge if it fails
- **ARM64** will trigger only on merge to main (post-deployment) to reduce costs
- Manual workflows remain independent

## Key Improvements from Previous Version

### Before (Old ci.yml):
❌ Two separate jobs: `build-test` + `multi-arch`
❌ Duplicate runs: amd64 tests ran twice
❌ Sequential execution: `multi-arch` waited for `build-test`
❌ Invalid ARM64 runner causing hangs
❌ Total time: ~25-30 minutes (with duplicates and waits)

### After (New Structure):
✅ Three separate workflows: pre-checks, amd64, arm64
✅ No duplication: each test runs once
✅ Parallel execution: pre-checks and platform builds can overlap
✅ ARM64 properly diagnosed and documented
✅ Total time: ~5-7 minutes (amd64 only, no duplicates)

## Performance Metrics

| Workflow | Runtime | Status | Notes |
|----------|---------|--------|-------|
| ci-pre-checks.yml | ~3-5 min | ✅ Active | Every PR/push |
| ci-amd64.yml | ~5-7 min | ✅ Active | **Required for merge** |
| ci-arm64.yml | ~8-12 min (est.) | ⚠️ Configured, pending deployment | Self-hosted only |
| **Total (current)** | **~5-7 min** | **-50% vs before** | AMD64 only |
| **Total (future w/ ARM64)** | **~10-15 min** | **Merge to main only** | Parallel execution |

## Testing Locally

### Pre-checks:
```bash
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --lib --bins --verbose
cargo test --doc --verbose
```

### AMD64 Build:
```bash
cargo build --release --verbose
cargo test --verbose
python3 orchestrator.py --mode build --verbose
python3 orchestrator.py --mode test --verbose
```

### ARM64 (if self-hosted runner available):
```bash
# On ARM64 hardware:
./ci_build_monitor.sh linux/arm64 full 3600
```

## Troubleshooting

### "Workflow not triggering"
- Check: `.github/workflows/` files have correct `on:` triggers
- Check: YAML syntax is valid (`python3 -c "import yaml; yaml.safe_load(open('file.yml'))"`)
- Check: Workflow file is committed and pushed

### "Workflow hanging/queued forever"
- **Most likely:** Invalid runner label
- Check: Runner actually exists (`gh api /repos/OWNER/REPO/actions/runners`)
- Fix: Use valid runner label or setup self-hosted

### "ARM64 tests failing"
- Check: Architecture with `uname -m` (should show `aarch64` or `arm64`)
- Check: If emulation, ensure QEMU is properly setup
- Check: Timeout values are sufficient (emulation is slow)

## Contributing

When adding new workflows:
1. Choose appropriate triggers (avoid duplication)
2. Use descriptive job/step names
3. Add timeout values to prevent hangs
4. Upload artifacts on failure for debugging
5. Document runner requirements
6. Test manually before enabling automatic triggers

## Security

- All workflows use pinned action versions (`@v4`, not `@latest`)
- Permissions are explicitly declared (`contents: read`)
- Secrets are only used where necessary
- Self-hosted runners should be on private repos only

## Architecture Support Roadmap

| Architecture | Status | Runner Type | Trigger | Notes |
|--------------|--------|-------------|---------|-------|
| **amd64 (x86_64)** | ✅ Production | GitHub-hosted | Every PR (required) | Stable, fast |
| **arm64 (aarch64)** | 🚧 Configured | Self-hosted | Manual only (pending) | Ready for deployment |

**Future ARM64 Timeline:**
1. **Completed**: Root cause analysis and workflow configuration
2. 🚧 **In Progress**: Self-hosted runner infrastructure setup
3.**Next**: Manual testing and validation
4.**Future**: Auto-trigger on merge to main
5.**Future**: Evaluate as required check (optional)

---

**Last Updated:** 2025-12-16
**Maintained by:** @tzervas, @copilot

**Document Version:** 2.0 (Self-Hosted ARM64 Configuration)