orb8 0.0.6 - Docs.rs

# orb8 Development Roadmap

**Granular Technical Implementation Plan**

This roadmap breaks down orb8 development into **dependency-based phases** without timeline constraints. Each phase includes technical implementation details, testing strategies, and clear success criteria.

---

## Table of Contents

1. [Roadmap Philosophy](#roadmap-philosophy)
2. [Phase 0: Foundation & Monorepo Setup](#phase-0-foundation--monorepo-setup)
3. [Phase 1: eBPF Infrastructure](#phase-1-ebpf-infrastructure)
4. [Phase 2: Container Identification](#phase-2-container-identification)
5. [Phase 3: Network Tracing MVP](#phase-3-network-tracing-mvp)
6. [Phase 4: Cluster Mode Architecture](#phase-4-cluster-mode-architecture)
7. [Phase 5: Metrics & Observability](#phase-5-metrics--observability)
8. [Phase 6: Syscall Monitoring](#phase-6-syscall-monitoring)
9. [Phase 7: GPU Telemetry (Research & MVP)](#phase-7-gpu-telemetry-research--mvp)
10. [Phase 8: Advanced Features](#phase-8-advanced-features)
11. [Future Enhancements](#future-enhancements)

---

## Roadmap Philosophy

### Principles

1. **One Thing Well**: Each phase delivers a complete, working feature
2. **No Timelines**: Phases complete when done, not by deadlines
3. **Research Spikes**: Explicitly budget time for uncertainty
4. **User Validation**: Get real users after each major phase
5. **Technical Debt**: Document it, don't accumulate it

### Success Criteria Per Phase

Each phase must meet:
- ✅ Feature complete (not prototype)
- ✅ Integration tests passing
- ✅ Documentation written
- ✅ Deployed to test cluster
- ✅ Validated with real workloads

### Phase Dependencies

```
Phase 0 (Foundation)
  ↓
Phase 1 (eBPF Infrastructure) ←─────┐
  ↓                                  │
Phase 2 (Container ID)               │
  ↓                                  │
Phase 3 (Network MVP) ←─ User Validation
  ↓                                  │
Phase 4 (Cluster Mode)               │
  ↓                                  │
Phase 5 (Metrics) ←────── User Validation
  ↓                                  │
Phase 6 (Syscall) ─────────────────┘
  ↓
Phase 7 (GPU) ←──── Research Spike
  ↓
Phase 8 (Advanced)
```

---

## Phase 0: Foundation & Monorepo Setup

**Goal**: Establish Cargo workspace, development environment, and CI/CD

**Status**: ✅ COMPLETE

### Tasks

#### 0.1: Cargo Workspace Structure

- [x] Root `Cargo.toml` with workspace members
- [x] `orb8-probes/` crate skeleton
- [x] `orb8-common/` crate with shared types
- [x] `orb8-agent/` crate skeleton
- [x] `orb8-server/` crate skeleton
- [x] `orb8-cli/` crate skeleton
- [x] `orb8-proto/` crate skeleton

**Deliverable**:
```bash
cargo build --workspace  # All crates compile
```

#### 0.2: Development Environment

- [x] Lima/QEMU VM configuration (`.lima/orb8-dev.yaml`)
- [x] Makefile with `make magic`, `make dev`, `make shell`
- [x] Linux kernel 5.15+ with BTF enabled
- [x] aya dependencies and build toolchain

**Deliverable**:
```bash
make dev    # VM ready with all tools
make shell  # Can enter VM and run cargo build
```

#### 0.3: CI/CD Pipeline

- [x] GitHub Actions workflow for `cargo test`
- [x] GitHub Actions workflow for `cargo clippy`
- [x] GitHub Actions workflow for `cargo fmt --check`
- [x] Crates.io publishing (orb8, orb8-common, orb8-cli, orb8-agent)
- [ ] Container image builds (agent, server, CLI)
- [ ] Multi-arch builds (amd64, arm64)

**Deliverable**: PR checks pass before merge

#### 0.4: Project Documentation

- [x] README.md with project overview
- [x] ARCHITECTURE.md (detailed technical design)
- [x] ROADMAP.md (this file)
- [x] DEVELOPMENT.md (dev setup guide)
- [ ] CONTRIBUTING.md (contribution guidelines)
- [x] LICENSE (Apache 2.0)

**Deliverable**: New contributors can onboard in <30 minutes

---

## Distribution Strategy

orb8 uses multiple distribution channels depending on the use case.

### crates.io (v0.0.1)

| Crate | Purpose | Install Command |
|-------|---------|-----------------|
| `orb8` | Root library with re-exports | `cargo add orb8` |
| `orb8-common` | Shared types (eBPF/userspace) | `cargo add orb8-common` |
| `orb8-cli` | CLI command definitions | `cargo add orb8-cli` |
| `orb8-agent` | Node agent binary (Linux-only) | `cargo install orb8-agent` |

### Not Published to crates.io

| Crate | Reason |
|-------|--------|
| `orb8-probes` | eBPF bytecode (bpfel target, embedded in agent) |
| `orb8-server` | Stub implementation (Phase 4) |
| `orb8-proto` | Stub implementation (Phase 4) |

### Future Distribution (Phase 4+)

- [ ] Container images on ghcr.io (`orb8/agent`, `orb8/server`)
- [ ] Helm chart for Kubernetes deployment
- [ ] GitHub Releases with pre-built binaries
- [ ] Multi-arch builds (amd64, arm64)
- [ ] `orb8-server` and `orb8-proto` on crates.io (when implemented)

### Release Process

**Version Scheme**: [Semantic Versioning](https://semver.org/)
- `0.0.x` - Initial development (breaking changes expected)
- `0.1.0` - First feature-complete release (Phase 3: Network MVP)
- `1.0.0` - Production-ready (Phase 8 complete)

**Automated Release** (via GitHub Actions):
1. Update version in workspace `Cargo.toml` files
2. Update `CHANGELOG.md` with release notes
3. Create and push tag: `git tag -a v0.0.2 -m "Release v0.0.2" && git push origin v0.0.2`
4. CI validates, publishes to crates.io, creates GitHub Release

**Manual Fallback**:
```bash
# Publish in dependency order
cargo publish -p orb8-common
sleep 15
cargo publish -p orb8-cli
sleep 15
cargo publish -p orb8-agent
sleep 15
cargo publish -p orb8
```

**Required Secret**: `CARGO_REGISTRY_TOKEN` (add via GitHub repo settings)

---

## Phase 1: eBPF Infrastructure

**Goal**: Load, attach, and manage eBPF probes written in Rust using aya

**Dependencies**: Phase 0

**Status**: ✅ COMPLETE

### Tasks

#### 1.1: aya-bpf Setup

**Files**: `orb8-probes/`

- [x] Create `orb8-probes/Cargo.toml` with aya-bpf dependencies
- [x] Create `orb8-probes/build.rs` for eBPF compilation
- [x] Verify eBPF programs compile to `.bpf.o` format
- [x] Test on kernel 5.15+ with BTF enabled

**Implementation**:
```rust
// orb8-probes/build.rs
use aya_bpf_compiler::*;

fn main() {
    build_ebpf([
        "src/network_probe.rs",
    ])
    .target("bpfel-unknown-none")
    .compile()
    .unwrap();
}
```

**Success Criteria**:
- ✅ `cargo build -p orb8-probes` produces `.bpf.o` files
- ✅ Files are valid eBPF ELF objects (verify with `llvm-objdump`)

#### 1.2: Minimal "Hello World" Probe

**Files**: `orb8-probes/src/network_probe.rs`

- [x] Create skeleton tc probe that logs "Hello from eBPF"
- [x] Use aya-log-ebpf for logging
- [x] Attach to `lo` (loopback) interface for testing
- [x] Verify logs appear via `/sys/kernel/debug/tracing/trace_pipe`

**Implementation**:
```rust
#![no_std]
#![no_main]

use aya_bpf::{macros::classifier, programs::TcContext, bindings::TC_ACT_OK};
use aya_log_ebpf::info;

#[classifier]
pub fn network_probe(ctx: TcContext) -> i32 {
    info!(&ctx, "Hello from eBPF! packet_len={}", ctx.len());
    TC_ACT_OK
}

#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
    loop {}
}
```

**Success Criteria**:
- ✅ Probe loads without verifier errors
- ✅ Logs appear when pinging localhost
- ✅ No kernel panics or oops

#### 1.3: Probe Loader (User-Space)

**Files**: `orb8-agent/src/probe_loader.rs`

- [x] Create `ProbeManager` struct
- [x] Implement `load_probe()` using aya library
- [x] Implement `attach_tc()` for network interfaces
- [x] Implement `unload_all()` for cleanup
- [ ] Implement pre-flight validation (before loading any probes):
  - Check kernel version >= 5.8 (`uname -r`)
  - Verify BTF availability (`/sys/kernel/btf/vmlinux` exists)
  - Validate required capabilities (CAP_BPF, CAP_NET_ADMIN, CAP_SYS_ADMIN)
  - Test compile a trivial probe to verify toolchain
- [x] Handle errors gracefully (verifier failures, permissions)
- [ ] Graceful degradation strategy:
  - If network probe fails, continue with syscall probe only
  - If all probes fail, exit with clear error and remediation steps
- [ ] Implement diagnostics command: `orb8 diagnose`
  - Check kernel version, BTF, capabilities
  - Attempt to load test probe
  - Report status and suggest fixes

**Implementation**:
```rust
// orb8-agent/src/probe_loader.rs
use aya::{Bpf, programs::{Tc, TcAttachType}};

pub struct ProbeManager {
    bpf: Bpf,
}

impl ProbeManager {
    pub fn load_network_probe() -> Result<Self> {
        let mut bpf = Bpf::load_file("network_probe.bpf.o")?;

        let program: &mut Tc = bpf
            .program_mut("network_probe")
            .unwrap()
            .try_into()?;

        program.load()?;
        program.attach("lo", TcAttachType::Ingress)?;

        Ok(Self { bpf })
    }

    pub fn unload(self) {
        // aya automatically detaches on drop
    }
}
```

**Success Criteria**:
- ✅ Pre-flight checks pass on supported kernels (5.8+)
- ✅ Pre-flight checks fail gracefully on unsupported kernels with helpful errors
- ✅ `orb8-agent` can load probe
- ✅ Probe persists until agent exits
- ✅ Clean unload without leaking eBPF resources
- ✅ `orb8 diagnose` command provides actionable troubleshooting info

#### 1.4: eBPF Maps - Ring Buffer

**Files**: `orb8-probes/src/network_probe.rs`, `orb8-agent/src/collector.rs`

- [x] Define ring buffer in eBPF probe
- [x] Write test events from eBPF → ring buffer
- [x] Poll ring buffer from user-space
- [x] Deserialize events into Rust structs

**Implementation (eBPF side)**:
```rust
use aya_bpf::{macros::map, maps::RingBuf};

#[repr(C)]
#[derive(Clone, Copy)]
struct TestEvent {
    packet_len: u32,
    timestamp: u64,
}

#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(256 * 1024, 0);

#[classifier]
pub fn network_probe(ctx: TcContext) -> i32 {
    let event = TestEvent {
        packet_len: ctx.len(),
        timestamp: unsafe { bpf_ktime_get_ns() },
    };

    EVENTS.output(&event, 0);
    TC_ACT_OK
}
```

**Implementation (User-space side)**:
```rust
// orb8-agent/src/collector.rs
use aya::maps::RingBuf;

pub struct EventCollector {
    ring_buf: RingBuf<TestEvent>,
}

impl EventCollector {
    pub fn poll(&mut self) -> Vec<TestEvent> {
        let mut events = Vec::new();
        while let Some(data) = self.ring_buf.next() {
            let event: TestEvent = unsafe {
                std::ptr::read(data.as_ptr() as *const _)
            };
            events.push(event);
        }
        events
    }
}
```

**Success Criteria**:
- ✅ Events flow from kernel → user space
- ✅ No data corruption
- ✅ High-throughput stress test (1M+ events/sec)

#### 1.5: Testing Infrastructure

- [ ] Unit tests for probe loader
- [ ] Integration test: load probe, send traffic, collect events
- [ ] Benchmark: measure overhead of empty probe

**Test**:
```rust
#[test]
fn test_probe_lifecycle() {
    let manager = ProbeManager::load_network_probe().unwrap();

    // Send traffic via loopback
    std::process::Command::new("ping")
        .args(["-c", "10", "127.0.0.1"])
        .output()
        .unwrap();

    // Should have events
    let collector = EventCollector::new(&manager);
    let events = collector.poll();
    assert!(events.len() >= 10);

    // Cleanup
    drop(manager);
}
```

**Phase 1 Deliverables**

✅ eBPF probes compile with aya-bpf
✅ User-space agent loads and attaches probes
✅ Ring buffer communication working
✅ Integration test passes on Linux VM
✅ Documentation: "eBPF Probe Development Guide"

---

## Phase 2: Container Identification

**Goal**: Map eBPF events to Kubernetes pods via cgroup IDs

**Dependencies**: Phase 1

**Status**: ✅ COMPLETE (MVP implementation)

> **Note**: `bpf_get_current_cgroup_id()` is not available for TC classifiers on some kernels (5.15).
> MVP uses K8s API-based pod enrichment with cgroup_id=0 fallback.

### Tasks

#### 2.1: cgroup ID Extraction in eBPF

**Files**: `orb8-probes/src/network_probe.rs`

- [ ] Call `bpf_get_current_cgroup_id()` in probe
- [ ] Include cgroup_id in event struct
- [ ] Verify cgroup ID is non-zero and stable

**Implementation**:
```rust
#[repr(C)]
struct NetworkEvent {
    cgroup_id: u64,  // NEW
    packet_len: u32,
    timestamp: u64,
}

#[classifier]
pub fn network_probe(ctx: TcContext) -> i32 {
    let cgroup_id = unsafe {
        aya_bpf::helpers::bpf_get_current_cgroup_id()
    };

    let event = NetworkEvent {
        cgroup_id,
        packet_len: ctx.len(),
        timestamp: unsafe { bpf_ktime_get_ns() },
    };

    EVENTS.output(&event, 0);
    TC_ACT_OK
}
```

**Success Criteria**:
- ✅ cgroup_id field populated
- ✅ Different containers have different cgroup IDs
- ✅ Same container has stable cgroup ID across events

#### 2.2: cgroup Filesystem Resolver

**Files**: `orb8-agent/src/k8s/cgroup.rs`

- [ ] Traverse `/sys/fs/cgroup/kubepods.slice/` hierarchy
- [ ] Map pod UID + container ID → cgroup inode
- [ ] Handle all QoS classes (Guaranteed, Burstable, BestEffort)
- [ ] Handle cgroup v2 vs v1 (prefer v2)
- [ ] Auto-detect container runtime from node (containerd, CRI-O, Docker)
- [ ] Support all runtime-specific cgroup path formats:
  - containerd: `cri-containerd-{id}.scope`
  - CRI-O: `crio-{id}.scope`
  - Docker: `docker-{id}.scope`
- [ ] Handle missing cgroups with retry logic (pod may not be ready yet)
- [ ] Add integration tests with all three container runtimes

**Implementation**:
```rust
// orb8-agent/src/k8s/cgroup.rs
use std::fs;
use std::os::unix::fs::MetadataExt;

pub struct CgroupResolver {
    cgroup_root: String,
}

impl CgroupResolver {
    pub fn get_pod_cgroup_id(
        &self,
        pod_uid: &str,
        container_id: &str,
    ) -> Result<u64> {
        let qos_classes = ["", "burstable-", "besteffort-"];

        for qos in qos_classes {
            let path = format!(
                "{}/kubepods.slice/kubepods-{}pod{}.slice/cri-containerd-{}.scope",
                self.cgroup_root,
                qos,
                pod_uid.replace("-", "_"),
                container_id
            );

            if let Ok(metadata) = fs::metadata(&path) {
                return Ok(metadata.ino());
            }
        }

        Err(Error::CgroupNotFound)
    }
}
```

**Success Criteria**:
- ✅ Resolves cgroup ID for running pods
- ✅ Handles all QoS classes
- ✅ Supports containerd, CRI-O, and Docker runtimes
- ✅ Auto-detects runtime without manual configuration
- ✅ Returns error for non-existent pods

#### 2.3: Kubernetes Pod Watcher

**Files**: `orb8-agent/src/k8s/watcher.rs`

- [ ] Watch all pods using kube-rs
- [ ] Extract pod UID, namespace, name, container IDs
- [ ] Resolve cgroup ID for each container
- [ ] Maintain in-memory map: `cgroup_id → PodMetadata`
- [ ] Handle pod lifecycle (Added, Modified, Deleted)

**Implementation**:
```rust
// orb8-agent/src/k8s/watcher.rs
use kube::{Api, Client, runtime::{watcher, WatchStreamExt}};
use k8s_openapi::api::core::v1::Pod;
use std::collections::HashMap;

pub struct PodWatcher {
    k8s_client: Client,
    cgroup_resolver: CgroupResolver,
    metadata_map: HashMap<u64, PodMetadata>,
}

impl PodWatcher {
    pub async fn watch(&mut self) -> Result<()> {
        let pods: Api<Pod> = Api::all(self.k8s_client.clone());
        let mut stream = watcher(pods, Default::default()).boxed();

        while let Some(event) = stream.try_next().await? {
            match event {
                Event::Applied(pod) => self.handle_pod_added(pod).await?,
                Event::Deleted(pod) => self.handle_pod_deleted(pod).await?,
                _ => {}
            }
        }

        Ok(())
    }

    async fn handle_pod_added(&mut self, pod: Pod) -> Result<()> {
        let pod_uid = pod.metadata.uid.unwrap();
        let namespace = pod.metadata.namespace.unwrap();
        let name = pod.metadata.name.unwrap();

        if let Some(status) = pod.status {
            for container in status.container_statuses.unwrap_or_default() {
                if let Some(container_id) = container.container_id {
                    let id = container_id.split("://").nth(1).unwrap();
                    let cgroup_id = self.cgroup_resolver
                        .get_pod_cgroup_id(&pod_uid, id)?;

                    self.metadata_map.insert(cgroup_id, PodMetadata {
                        namespace: namespace.clone(),
                        pod_name: name.clone(),
                        container_name: container.name.clone(),
                    });

                    info!("Mapped cgroup {} → {}/{}", cgroup_id, namespace, name);
                }
            }
        }

        Ok(())
    }
}
```

**Success Criteria**:
- ✅ Watches all pods in cluster
- ✅ Builds cgroup_id → pod mapping
- ✅ Updates map on pod lifecycle events
- ✅ Handles network failures gracefully

#### 2.3.1: Watch Reliability (Critical for Production)

**Purpose**: Ensure pod watcher recovers from disconnections without losing metadata

- [ ] Implement reconnection logic with exponential backoff
  - Initial retry: 1 second
  - Max backoff: 30 seconds
  - Retry indefinitely (never give up)
- [ ] Full resync on reconnect
  - List all pods via Kubernetes API
  - Rebuild entire `cgroup_id → pod` map
  - Log number of pods resynced
- [ ] Buffer events for unknown cgroups
  - Queue events for unknown cgroup IDs (up to 10 seconds)
  - Retry enrichment when metadata arrives
  - Discard after timeout to prevent memory leak
- [ ] Expose watch health metrics
  - `orb8_k8s_watch_connected` (gauge: 0 or 1)
  - `orb8_k8s_watch_reconnections_total` (counter)
  - `orb8_k8s_last_sync_timestamp` (gauge, Unix timestamp)
  - `orb8_k8s_pods_tracked` (gauge)

**Implementation Pattern**:
```rust
// Spawn watch with automatic reconnection
tokio::spawn(async move {
    let mut backoff = Duration::from_secs(1);

    loop {
        match pod_watcher.watch().await {
            Ok(_) => {
                warn!("Pod watch stream ended, reconnecting...");
                backoff = Duration::from_secs(1); // Reset backoff on clean exit
            }
            Err(e) => {
                error!("Pod watch failed: {}, reconnecting in {:?}", e, backoff);
                tokio::time::sleep(backoff).await;
                backoff = std::cmp::min(backoff * 2, Duration::from_secs(30));
            }
        }

        // Full resync before reconnecting
        if let Err(e) = pod_watcher.resync_all().await {
            error!("Resync failed: {}", e);
        }
    }
});
```

**Success Criteria**:
- ✅ Watch reconnects automatically after network failures
- ✅ All pods resynced within 5 seconds of reconnection
- ✅ No events permanently lost to "unknown" cgroup
- ✅ Metrics expose watch health status

#### 2.4: Event Enrichment

**Files**: `orb8-agent/src/enricher.rs`

- [ ] Look up cgroup_id in metadata map
- [ ] Enrich events with namespace, pod name, container name
- [ ] Handle unknown cgroup IDs gracefully

**Implementation**:
```rust
// orb8-agent/src/enricher.rs
pub struct EventEnricher {
    metadata_map: Arc<RwLock<HashMap<u64, PodMetadata>>>,
}

impl EventEnricher {
    pub fn enrich(&self, event: NetworkEvent) -> EnrichedEvent {
        let metadata_map = self.metadata_map.read().unwrap();

        if let Some(metadata) = metadata_map.get(&event.cgroup_id) {
            EnrichedEvent {
                namespace: metadata.namespace.clone(),
                pod_name: metadata.pod_name.clone(),
                container_name: metadata.container_name.clone(),
                packet_len: event.packet_len,
                timestamp: event.timestamp,
            }
        } else {
            // Unknown cgroup - might be host process
            EnrichedEvent {
                namespace: "unknown".to_string(),
                pod_name: format!("cgroup-{}", event.cgroup_id),
                container_name: "unknown".to_string(),
                packet_len: event.packet_len,
                timestamp: event.timestamp,
            }
        }
    }
}
```

**Success Criteria**:
- ✅ Events enriched with correct pod metadata
- ✅ Unknown cgroups handled without panic
- ✅ Thread-safe access to metadata map

#### 2.5: Integration Test

- [ ] Deploy test pod to Kubernetes
- [ ] Load probe, trigger traffic from test pod
- [ ] Verify events correctly attributed to test pod
- [ ] Verify namespace and pod name are correct

**Test Scenario**:
```bash
# Deploy nginx test pod
kubectl run test-nginx --image=nginx

# Run agent
orb8-agent &

# Generate traffic from pod
kubectl exec test-nginx -- curl localhost

# Verify events
# Expected: events with namespace=default, pod_name=test-nginx
```

**Phase 2 Deliverables**

✅ cgroup ID extraction in eBPF probes
✅ Kubernetes pod watcher
✅ cgroup → pod mapping
✅ Event enrichment with pod metadata
✅ Integration test with real Kubernetes pod
✅ Documentation: "Container Identification Design"

---

## Phase 3: Network Tracing MVP

**Goal**: Production-ready network flow tracing per pod

**Dependencies**: Phase 2

**Estimated Effort**: 2-3 weeks

### Tasks

#### 3.1: Full Network Event Structure

**Files**: `orb8-common/src/events.rs`, `orb8-probes/src/network_probe.rs`

- [ ] Define complete `NetworkFlowEvent` struct
- [ ] Extract src/dst IP, src/dst port, protocol
- [ ] Parse Ethernet, IP, TCP/UDP headers
- [ ] Handle malformed packets gracefully

**Implementation**:
```rust
// orb8-common/src/events.rs
#[repr(C)]
#[derive(Clone, Copy)]
pub struct NetworkFlowEvent {
    pub cgroup_id: u64,
    pub timestamp_ns: u64,
    pub src_ip: u32,      // IPv4 only for MVP
    pub dst_ip: u32,
    pub src_port: u16,
    pub dst_port: u16,
    pub protocol: u8,     // IPPROTO_TCP, IPPROTO_UDP
    pub bytes: u32,
    pub direction: u8,    // 0=ingress, 1=egress
}
```

**Implementation (eBPF)**:
```rust
// orb8-probes/src/network_probe.rs
use aya_bpf::bindings::{ethhdr, iphdr, tcphdr, udphdr};

fn try_network_probe(ctx: TcContext) -> Result<i32, ()> {
    let cgroup_id = unsafe { bpf_get_current_cgroup_id() };

    // Parse Ethernet header
    let eth = unsafe { ptr_at::<ethhdr>(&ctx, 0)? };
    if unsafe { (*eth).h_proto } != ETH_P_IP.to_be() {
        return Ok(TC_ACT_OK); // Not IPv4, skip
    }

    // Parse IP header
    let ip = unsafe { ptr_at::<iphdr>(&ctx, ETH_HLEN)? };
    let protocol = unsafe { (*ip).protocol };

    // Parse transport header
    let (src_port, dst_port) = match protocol {
        IPPROTO_TCP => {
            let tcp = unsafe { ptr_at::<tcphdr>(&ctx, ETH_HLEN + IP_HLEN)? };
            (unsafe { (*tcp).source }.to_be(), unsafe { (*tcp).dest }.to_be())
        },
        IPPROTO_UDP => {
            let udp = unsafe { ptr_at::<udphdr>(&ctx, ETH_HLEN + IP_HLEN)? };
            (unsafe { (*udp).source }.to_be(), unsafe { (*udp).dest }.to_be())
        },
        _ => (0, 0),
    };

    let event = NetworkFlowEvent {
        cgroup_id,
        timestamp_ns: unsafe { bpf_ktime_get_ns() },
        src_ip: unsafe { (*ip).saddr },
        dst_ip: unsafe { (*ip).daddr },
        src_port,
        dst_port,
        protocol,
        bytes: ctx.len(),
        direction: 0, // ingress
    };

    FLOW_EVENTS.output(&event, 0).map_err(|_| ())?;

    Ok(TC_ACT_OK)
}
```

**Success Criteria**:
- ✅ Correctly parses TCP and UDP packets
- ✅ Skips non-IP packets without errors
- ✅ Handles fragmented packets (or documents limitation)

#### 3.2: Multi-Interface Attachment

**Files**: `orb8-agent/src/probe_loader.rs`

- [ ] Discover all network interfaces (except loopback)
- [ ] Attach probe to each interface (ingress + egress)
- [ ] Handle interface hotplug (containers starting/stopping)

**Implementation**:
```rust
// orb8-agent/src/probe_loader.rs
use nix::net::if_::if_nameindex;

impl ProbeManager {
    pub fn attach_to_all_interfaces(&mut self) -> Result<()> {
        let interfaces = if_nameindex()?;

        for iface in interfaces {
            let name = iface.name().to_str().unwrap();

            // Skip loopback
            if name == "lo" {
                continue;
            }

            // Skip non-veth (only monitor container traffic)
            if !name.starts_with("veth") && !name.starts_with("eth") {
                continue;
            }

            self.attach_tc(name, TcAttachType::Ingress)?;
            self.attach_tc(name, TcAttachType::Egress)?;

            info!("Attached to interface {}", name);
        }

        Ok(())
    }
}
```

**Success Criteria**:
- ✅ Attaches to all veth interfaces
- ✅ Captures both ingress and egress traffic
- ✅ Doesn't break on interface churn

#### 3.3: Flow Aggregation

**Files**: `orb8-agent/src/aggregator.rs`

- [ ] Aggregate raw packet events into flows
- [ ] Flow key: (src_ip, dst_ip, src_port, dst_port, protocol)
- [ ] Track bytes sent/received per flow
- [ ] Time-window aggregation (e.g., 10-second buckets)

**Implementation**:
```rust
// orb8-agent/src/aggregator.rs
use std::collections::HashMap;
use std::time::{Duration, Instant};

#[derive(Hash, Eq, PartialEq)]
struct FlowKey {
    namespace: String,
    pod_name: String,
    src_ip: u32,
    dst_ip: u32,
    src_port: u16,
    dst_port: u16,
    protocol: u8,
}

struct FlowStats {
    bytes_sent: u64,
    bytes_received: u64,
    packets_sent: u64,
    packets_received: u64,
    first_seen: Instant,
    last_seen: Instant,
}

pub struct FlowAggregator {
    flows: HashMap<FlowKey, FlowStats>,
    window_duration: Duration,
}

impl FlowAggregator {
    pub fn add_event(&mut self, event: EnrichedNetworkFlow) {
        let key = FlowKey {
            namespace: event.namespace,
            pod_name: event.pod_name,
            src_ip: event.src_ip,
            dst_ip: event.dst_ip,
            src_port: event.src_port,
            dst_port: event.dst_port,
            protocol: event.protocol,
        };

        let stats = self.flows.entry(key).or_insert(FlowStats {
            bytes_sent: 0,
            bytes_received: 0,
            packets_sent: 0,
            packets_received: 0,
            first_seen: Instant::now(),
            last_seen: Instant::now(),
        });

        if event.direction == 1 {  // egress
            stats.bytes_sent += event.bytes as u64;
            stats.packets_sent += 1;
        } else {  // ingress
            stats.bytes_received += event.bytes as u64;
            stats.packets_received += 1;
        }

        stats.last_seen = Instant::now();
    }

    pub fn flush_expired_flows(&mut self) -> Vec<(FlowKey, FlowStats)> {
        let now = Instant::now();
        let mut expired = Vec::new();

        self.flows.retain(|key, stats| {
            if now.duration_since(stats.last_seen) > self.window_duration {
                expired.push((key.clone(), stats.clone()));
                false
            } else {
                true
            }
        });

        expired
    }
}
```

**Success Criteria**:
- ✅ Aggregates packets into flows
- ✅ Correctly separates ingress/egress
- ✅ Expires old flows to prevent memory leak

#### 3.3.1: Network Event Sampling (Critical for High-Traffic Pods)

**Purpose**: Prevent ring buffer overflow at high event rates

**Problem**: At 1M events/sec with 64-byte events, a 1MB ring buffer fills in ~16ms, causing severe event loss.

- [ ] Implement sampling for high-volume flows
  - Sample 1:10 for flows exceeding 10,000 packets/sec
  - Always capture first 10 packets of new flows (for connection establishment)
  - Always capture TCP SYN, FIN, RST packets (critical flow state)
  - Add sampling metadata to events for accurate extrapolation
- [ ] Make ring buffer size configurable
  - Environment variable: `ORB8_RING_BUFFER_SIZE` (default: 1MB, max: 32MB)
  - Per-probe configuration (network vs syscall may need different sizes)
  - Validate size is power of 2 (eBPF requirement)
- [ ] Expose ring buffer health metrics
  - `orb8_ring_buffer_drops_total{probe="network"}` (counter)
  - `orb8_ring_buffer_utilization{probe="network"}` (gauge, 0.0-1.0)
  - `orb8_ring_buffer_size_bytes{probe="network"}` (gauge)
  - `orb8_ring_buffer_events_total{probe="network"}` (counter)
- [ ] Implement backpressure signaling
  - When ring buffer >90% full, signal eBPF probe to increase sampling
  - Adaptive sampling rate based on buffer pressure
  - Log warnings when sustained high pressure detected

**Implementation**:
```rust
// In eBPF probe
if ring_buffer_utilization() > 0.9 {
    // Drop 9 out of 10 events during high pressure
    if bpf_get_prandom_u32() % 10 != 0 {
        return TC_ACT_OK;  // Drop event
    }
}
```

**Success Criteria**:
- ✅ Ring buffer drops <0.1% under normal load (1M events/sec)
- ✅ Sampling preserves TCP state transitions (SYN, FIN, RST)
- ✅ Metrics expose ring buffer health
- ✅ Buffer size configurable without recompilation

#### 3.4: CLI Output Formatting

**Files**: `orb8-cli/src/commands/trace.rs`

- [ ] Display flows in human-readable table
- [ ] Format IP addresses (u32 → dotted decimal)
- [ ] Sort by bytes descending
- [ ] Support JSON output format

**Implementation**:
```rust
// orb8-cli/src/commands/trace.rs
use prettytable::{Table, Row, Cell};

pub async fn handle_trace_network(
    namespace: Option<String>,
    pod: Option<String>,
    duration: Duration,
) -> Result<()> {
    // Collect flows
    let flows = collect_flows(namespace, pod, duration).await?;

    // Display as table
    let mut table = Table::new();
    table.add_row(row![
        "NAMESPACE", "POD", "SRC", "DST", "PROTO", "BYTES", "PACKETS"
    ]);

    for flow in flows {
        table.add_row(row![
            flow.namespace,
            flow.pod_name,
            format_ip(flow.src_ip),
            format_ip(flow.dst_ip),
            format_proto(flow.protocol),
            flow.bytes_sent + flow.bytes_received,
            flow.packets_sent + flow.packets_received,
        ]);
    }

    table.printstd();

    Ok(())
}

fn format_ip(ip: u32) -> String {
    let octets = ip.to_be_bytes();
    format!("{}.{}.{}.{}", octets[0], octets[1], octets[2], octets[3])
}
```

**Success Criteria**:
- ✅ Human-readable output
- ✅ JSON format for scripting
- ✅ Correct IP address formatting

#### 3.5: End-to-End Test

- [ ] Deploy multi-pod test scenario (client → server)
- [ ] Run orb8 network tracing
- [ ] Verify flows captured in both directions
- [ ] Verify byte counts match actual traffic

**Test Scenario**:
```bash
# Deploy client and server pods
kubectl run server --image=nginx
kubectl run client --image=curlimages/curl -- sh -c "while true; do curl http://server; sleep 1; done"

# Run orb8 tracing
orb8 trace network --namespace default --duration 30s

# Expected output:
# NAMESPACE  POD      SRC         DST         PROTO  BYTES   PACKETS
# default    client   10.0.1.5    10.0.1.6    TCP    15KB    50
# default    server   10.0.1.6    10.0.1.5    TCP    150KB   50
```

**Phase 3 Deliverables**

✅ Full network packet parsing (IP, TCP, UDP)
✅ Multi-interface attachment
✅ Flow aggregation
✅ CLI with human-readable output
✅ End-to-end test with real pods
✅ Documentation: "Network Tracing User Guide"
✅ **Public Release**: v0.2.0 - Network Tracing MVP

**User Validation Checkpoint**: Get 10-50 users to try network tracing

---

## Phase 4: Cluster Mode Architecture

**Goal**: DaemonSet deployment with central API server

**Dependencies**: Phase 3

**Estimated Effort**: 2-3 weeks

### Tasks

#### 4.1: gRPC Service Definition

**Files**: `orb8-proto/proto/orb8.proto`

- [ ] Define `OrbitService` with query RPCs
- [ ] Define message types (FlowQuery, FlowResponse, etc.)
- [ ] Generate Rust code with tonic

**Implementation**:
```protobuf
// orb8-proto/proto/orb8.proto
syntax = "proto3";

package orb8;

service OrbitService {
    rpc QueryFlows(FlowQuery) returns (FlowResponse);
    rpc StreamFlows(StreamRequest) returns (stream FlowEvent);
    rpc GetAgentStatus(StatusRequest) returns (StatusResponse);
}

message FlowQuery {
    string namespace = 1;
    optional string pod_name = 2;
    optional int64 start_time_ns = 3;
    optional int64 end_time_ns = 4;
}

message FlowResponse {
    repeated NetworkFlow flows = 1;
}

message NetworkFlow {
    string namespace = 1;
    string pod_name = 2;
    string src_ip = 3;
    string dst_ip = 4;
    uint32 src_port = 5;
    uint32 dst_port = 6;
    string protocol = 7;
    uint64 bytes = 8;
    uint64 packets = 9;
    int64 timestamp_ns = 10;
}
```

**Build Script**:
```rust
// orb8-proto/build.rs
fn main() {
    tonic_build::configure()
        .build_server(true)
        .build_client(true)
        .compile(&["proto/orb8.proto"], &["proto"])
        .unwrap();
}
```

**Success Criteria**:
- ✅ Protobuf compiles without errors
- ✅ Rust code generated in `target/`

#### 4.2: Agent gRPC Server

**Files**: `orb8-agent/src/api_server.rs`

- [ ] Implement `OrbitService` trait
- [ ] Query local aggregator for flow data
- [ ] Filter by namespace/pod
- [ ] Handle time-range queries

**Implementation**:
```rust
// orb8-agent/src/api_server.rs
use orb8_proto::orbit_service_server::{OrbitService, OrbitServiceServer};
use orb8_proto::{FlowQuery, FlowResponse, NetworkFlow};
use tonic::{Request, Response, Status};

pub struct AgentApiServer {
    aggregator: Arc<RwLock<FlowAggregator>>,
}

#[tonic::async_trait]
impl OrbitService for AgentApiServer {
    async fn query_flows(
        &self,
        request: Request<FlowQuery>,
    ) -> Result<Response<FlowResponse>, Status> {
        let query = request.into_inner();
        let aggregator = self.aggregator.read().unwrap();

        let flows: Vec<NetworkFlow> = aggregator
            .get_flows()
            .filter(|f| {
                if let Some(ref ns) = query.namespace {
                    if f.namespace != *ns {
                        return false;
                    }
                }
                if let Some(ref pod) = query.pod_name {
                    if f.pod_name != *pod {
                        return false;
                    }
                }
                true
            })
            .map(|f| f.to_proto())
            .collect();

        Ok(Response::new(FlowResponse { flows }))
    }
}

pub async fn serve(aggregator: Arc<RwLock<FlowAggregator>>) -> Result<()> {
    let addr = "0.0.0.0:9090".parse()?;
    let server = AgentApiServer { aggregator };

    Server::builder()
        .add_service(OrbitServiceServer::new(server))
        .serve(addr)
        .await?;

    Ok(())
}
```

**Success Criteria**:
- ✅ gRPC server listens on port 9090
- ✅ Responds to QueryFlows RPC
- ✅ Filters work correctly

#### 4.3: Central API Server

**Files**: `orb8-server/src/main.rs`, `orb8-server/src/api.rs`

- [ ] Discover all agent pods via Kubernetes API
- [ ] Route queries to appropriate node agents
- [ ] Aggregate results from multiple agents
- [ ] Expose external gRPC API on port 8080

**Implementation**:
```rust
// orb8-server/src/api.rs
use kube::{Api, Client};
use k8s_openapi::api::core::v1::Pod;

pub struct CentralApiServer {
    k8s_client: Client,
}

impl CentralApiServer {
    async fn discover_agents(&self) -> Result<Vec<String>> {
        let pods: Api<Pod> = Api::namespaced(self.k8s_client.clone(), "orb8-system");

        let pod_list = pods.list(&Default::default()).await?;

        let agent_addrs: Vec<String> = pod_list
            .items
            .iter()
            .filter(|p| {
                p.metadata.labels.as_ref()
                    .and_then(|l| l.get("app"))
                    .map(|v| v == "orb8-agent")
                    .unwrap_or(false)
            })
            .filter_map(|p| {
                p.status.as_ref()
                    .and_then(|s| s.pod_ip.as_ref())
                    .map(|ip| format!("{}:9090", ip))
            })
            .collect();

        Ok(agent_addrs)
    }
}

#[tonic::async_trait]
impl OrbitService for CentralApiServer {
    async fn query_flows(
        &self,
        request: Request<FlowQuery>,
    ) -> Result<Response<FlowResponse>, Status> {
        let query = request.into_inner();

        // Get all agent addresses
        let agent_addrs = self.discover_agents().await
            .map_err(|e| Status::internal(format!("Agent discovery failed: {}", e)))?;

        // Query all agents in parallel
        let mut handles = Vec::new();
        for addr in agent_addrs {
            let query_clone = query.clone();
            let handle = tokio::spawn(async move {
                let mut client = OrbitServiceClient::connect(format!("http://{}", addr)).await?;
                client.query_flows(query_clone).await
            });
            handles.push(handle);
        }

        // Aggregate results
        let mut all_flows = Vec::new();
        for handle in handles {
            if let Ok(Ok(response)) = handle.await {
                all_flows.extend(response.into_inner().flows);
            }
        }

        Ok(Response::new(FlowResponse { flows: all_flows }))
    }
}
```

**Success Criteria**:
- ✅ Discovers all agent pods
- ✅ Queries agents in parallel
- ✅ Aggregates results correctly

#### 4.4: CLI Cluster Mode

**Files**: `orb8-cli/src/client.rs`

- [ ] Connect to central API server (auto-discover from kubeconfig)
- [ ] Send query via gRPC
- [ ] Display results

**Implementation**:
```rust
// orb8-cli/src/client.rs
pub struct ClusterClient {
    server_addr: String,
}

impl ClusterClient {
    pub fn from_kubeconfig() -> Result<Self> {
        // Discover orb8-server service
        let server_addr = "orb8-server.orb8-system.svc.cluster.local:8080".to_string();
        Ok(Self { server_addr })
    }

    pub async fn query_flows(
        &self,
        namespace: Option<String>,
        pod: Option<String>,
    ) -> Result<Vec<NetworkFlow>> {
        let mut client = OrbitServiceClient::connect(
            format!("http://{}", self.server_addr)
        ).await?;

        let request = FlowQuery {
            namespace,
            pod_name: pod,
            start_time_ns: None,
            end_time_ns: None,
        };

        let response = client.query_flows(request).await?;
        Ok(response.into_inner().flows)
    }
}
```

**Success Criteria**:
- ✅ CLI connects to central server
- ✅ Queries work end-to-end
- ✅ Auto-discovery from kubeconfig

#### 4.5: Kubernetes Manifests

**Files**: `deploy/`

- [ ] Namespace (`orb8-system`)
- [ ] ServiceAccount and RBAC
- [ ] DaemonSet for agents
- [ ] Deployment for central server
- [ ] Service for central server

**DaemonSet**:
```yaml
# deploy/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: orb8-agent
  namespace: orb8-system
spec:
  selector:
    matchLabels:
      app: orb8-agent
  template:
    spec:
      hostNetwork: true
      hostPID: true
      serviceAccountName: orb8-agent
      containers:
      - name: agent
        image: orb8/agent:latest
        securityContext:
          privileged: true
        ports:
        - containerPort: 9090
          name: grpc
        - containerPort: 9091
          name: metrics
```

**Success Criteria**:
- ✅ `kubectl apply -f deploy/` succeeds
- ✅ Agents running on all nodes
- ✅ Central server accessible from CLI

**Phase 4 Deliverables**

✅ gRPC service definition
✅ Agent gRPC API server
✅ Central API server with agent discovery
✅ CLI cluster mode
✅ Kubernetes deployment manifests
✅ End-to-end cluster mode test
✅ Documentation: "Cluster Mode Deployment Guide"

---

## Phase 5: Metrics & Observability

**Goal**: Prometheus exporter, Grafana dashboards

**Dependencies**: Phase 4

**Estimated Effort**: 1-2 weeks

### Tasks

#### 5.1: Prometheus Exporter

**Files**: `orb8-agent/src/prom_exporter.rs`

- [ ] Expose `/metrics` endpoint on port 9091
- [ ] Export flow metrics as Prometheus gauges/counters
- [ ] Labels: namespace, pod, src_ip, dst_ip, protocol
- [ ] Export ring buffer health metrics:
  - `orb8_ring_buffer_drops_total{probe="network|syscall"}` (counter)
  - `orb8_ring_buffer_utilization{probe="network|syscall"}` (gauge, 0.0-1.0)
  - `orb8_ring_buffer_size_bytes{probe="network|syscall"}` (gauge)
  - `orb8_ring_buffer_events_total{probe="network|syscall"}` (counter)
- [ ] Export Kubernetes watch health metrics:
  - `orb8_k8s_watch_connected` (gauge: 0 or 1)
  - `orb8_k8s_watch_reconnections_total` (counter)
  - `orb8_k8s_last_sync_timestamp` (gauge, Unix timestamp)
  - `orb8_k8s_pods_tracked` (gauge)

**Implementation**:
```rust
// orb8-agent/src/prom_exporter.rs
use prometheus::{Registry, CounterVec, GaugeVec, Encoder, TextEncoder};
use warp::Filter;

pub struct PrometheusExporter {
    registry: Registry,
    flow_bytes: CounterVec,
    flow_packets: CounterVec,
    active_flows: GaugeVec,
}

impl PrometheusExporter {
    pub fn new() -> Self {
        let registry = Registry::new();

        let flow_bytes = CounterVec::new(
            Opts::new("orb8_flow_bytes_total", "Total bytes per flow"),
            &["namespace", "pod", "direction", "protocol"],
        ).unwrap();

        let flow_packets = CounterVec::new(
            Opts::new("orb8_flow_packets_total", "Total packets per flow"),
            &["namespace", "pod", "direction", "protocol"],
        ).unwrap();

        let active_flows = GaugeVec::new(
            Opts::new("orb8_active_flows", "Number of active flows"),
            &["namespace", "pod"],
        ).unwrap();

        registry.register(Box::new(flow_bytes.clone())).unwrap();
        registry.register(Box::new(flow_packets.clone())).unwrap();
        registry.register(Box::new(active_flows.clone())).unwrap();

        Self {
            registry,
            flow_bytes,
            flow_packets,
            active_flows,
        }
    }

    pub fn update_metrics(&self, flows: &[(FlowKey, FlowStats)]) {
        for (key, stats) in flows {
            self.flow_bytes
                .with_label_values(&[
                    &key.namespace,
                    &key.pod_name,
                    "egress",
                    &format_proto(key.protocol),
                ])
                .inc_by(stats.bytes_sent);

            self.flow_packets
                .with_label_values(&[
                    &key.namespace,
                    &key.pod_name,
                    "egress",
                    &format_proto(key.protocol),
                ])
                .inc_by(stats.packets_sent);
        }
    }

    pub async fn serve(&self) {
        let registry = self.registry.clone();

        let metrics_route = warp::path("metrics")
            .map(move || {
                let encoder = TextEncoder::new();
                let metric_families = registry.gather();
                let mut buffer = Vec::new();
                encoder.encode(&metric_families, &mut buffer).unwrap();
                String::from_utf8(buffer).unwrap()
            });

        warp::serve(metrics_route)
            .run(([0, 0, 0, 0], 9091))
            .await;
    }
}
```

**Success Criteria**:
- ✅ `/metrics` endpoint returns Prometheus format
- ✅ Metrics update in real-time
- ✅ Labels correctly populated

#### 5.2: Prometheus ServiceMonitor

**Files**: `deploy/servicemonitor.yaml`

- [ ] ServiceMonitor CRD for Prometheus Operator
- [ ] Scrape agent metrics on port 9091

**Implementation**:
```yaml
# deploy/servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: orb8-agent
  namespace: orb8-system
spec:
  selector:
    matchLabels:
      app: orb8-agent
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
```

**Success Criteria**:
- ✅ Prometheus scrapes agents successfully
- ✅ Metrics visible in Prometheus UI

#### 5.3: Grafana Dashboards

**Files**: `deploy/grafana-dashboard.json`

- [ ] Network flow dashboard
- [ ] Top talkers (by bytes)
- [ ] Pod-to-pod communication graph
- [ ] Time-series of bytes/packets

**Dashboard Panels**:
1. **Top Pods by Egress Traffic** (bar chart)
2. **Network Bytes Over Time** (time series)
3. **Active Flows** (gauge)
4. **Protocol Breakdown** (pie chart)
5. **Pod Communication Matrix** (heatmap)

**Success Criteria**:
- ✅ Dashboard imports without errors
- ✅ Real-time data visualization
- ✅ Useful for debugging network issues

**Phase 5 Deliverables**

✅ Prometheus exporter on agents
✅ ServiceMonitor for auto-discovery
✅ Grafana dashboard
✅ Documentation: "Metrics and Monitoring Guide"
✅ **Public Release**: v0.3.0 - Cluster Mode with Metrics

**User Validation Checkpoint**: Get feedback on Prometheus integration

---

## Phase 6: Syscall Monitoring

**Goal**: System call tracing for security anomaly detection

**Dependencies**: Phase 1 (eBPF infra), Phase 2 (container ID)

**Estimated Effort**: 1-2 weeks

### Tasks

#### 6.1: Syscall Probe

**Files**: `orb8-probes/src/syscall_probe.rs`

- [ ] Attach to `tracepoint/raw_syscalls/sys_enter`
- [ ] Capture cgroup_id, PID, syscall ID
- [ ] Sampling: 1:100 for hot syscalls (read/write)

**Implementation**:
```rust
// orb8-probes/src/syscall_probe.rs
#![no_std]
#![no_main]

use aya_bpf::{
    macros::{tracepoint, map},
    maps::RingBuf,
    programs::TracePointContext,
};
use orb8_common::SyscallEvent;

#[map]
static SYSCALL_EVENTS: RingBuf = RingBuf::with_byte_size(512 * 1024, 0);

#[map]
static SAMPLE_RATE: aya_bpf::maps::HashMap<u32, u32> =
    aya_bpf::maps::HashMap::with_max_entries(1, 0);

#[tracepoint(name = "syscall_probe")]
pub fn syscall_probe(ctx: TracePointContext) -> u32 {
    match try_syscall_probe(ctx) {
        Ok(_) => 0,
        Err(_) => 1,
    }
}

fn try_syscall_probe(ctx: TracePointContext) -> Result<(), ()> {
    let syscall_id: i64 = unsafe { ctx.read_at(8)? };

    // Sample hot syscalls
    if is_hot_syscall(syscall_id as u32) {
        // Only trace 1 in 100
        if unsafe { bpf_get_prandom_u32() } % 100 != 0 {
            return Ok(());
        }
    }

    let cgroup_id = unsafe { bpf_get_current_cgroup_id() };
    let pid_tgid = unsafe { bpf_get_current_pid_tgid() };
    let pid = (pid_tgid >> 32) as u32;

    let event = SyscallEvent {
        cgroup_id,
        pid,
        syscall_id: syscall_id as u32,
        timestamp_ns: unsafe { bpf_ktime_get_ns() },
    };

    SYSCALL_EVENTS.output(&event, 0).map_err(|_| ())?;

    Ok(())
}

fn is_hot_syscall(id: u32) -> bool {
    matches!(id, 0 | 1 | 2 | 3)  // read, write, open, close
}
```

**Success Criteria**:
- ✅ Captures syscalls without overwhelming ring buffer
- ✅ Sampling reduces overhead on hot paths
- ✅ Rare syscalls (execve, ptrace) always traced

#### 6.2: Anomaly Detection

**Files**: `orb8-agent/src/syscall_analyzer.rs`

- [ ] Baseline normal syscall patterns per pod
- [ ] Detect anomalies (unusual syscalls, frequency spikes)
- [ ] Alert on suspicious behavior

**Implementation**:
```rust
// orb8-agent/src/syscall_analyzer.rs
use std::collections::HashMap;

pub struct SyscallAnalyzer {
    baselines: HashMap<String, SyscallBaseline>,
}

struct SyscallBaseline {
    syscall_histogram: HashMap<u32, u64>,
    alert_threshold: u64,
}

impl SyscallAnalyzer {
    pub fn analyze(&mut self, pod: &str, event: SyscallEvent) -> Option<Alert> {
        let baseline = self.baselines.entry(pod.to_string())
            .or_insert_with(SyscallBaseline::new);

        baseline.syscall_histogram
            .entry(event.syscall_id)
            .and_modify(|count| *count += 1)
            .or_insert(1);

        // Detect anomalies
        if is_dangerous_syscall(event.syscall_id) {
            return Some(Alert::DangerousSyscall {
                pod: pod.to_string(),
                syscall: syscall_name(event.syscall_id),
            });
        }

        None
    }
}

fn is_dangerous_syscall(id: u32) -> bool {
    matches!(id,
        101 |  // ptrace
        139 |  // syslog
        165 |  // mount
        304    // open_by_handle_at
    )
}
```

**Success Criteria**:
- ✅ Baselines built for normal pods
- ✅ Alerts generated for anomalies
- ✅ Low false positive rate (<5%)

**Phase 6 Deliverables**

✅ Syscall tracing probe
✅ Sampling to reduce overhead
✅ Anomaly detection algorithm
✅ CLI command for syscall tracing
✅ Documentation: "Syscall Monitoring Guide"

---

## Phase 7: GPU Telemetry (Research & MVP)

**Goal**: Per-pod GPU utilization and memory tracking

**Dependencies**: Phase 2 (container ID)

**Estimated Effort**: 3-4 weeks (including research spike)

> **Design Reference**: See [GPU Telemetry Design](docs/ARCHITECTURE.md#gpu-telemetry-design) for industry context, per-pod attribution mechanisms, and approach comparison.

**Approach**: DCGM integration via dcgm-exporter sidecar. Per-pod attribution achieved by correlating GPU device IDs with pod allocations via the kubelet pod-resources API.

### Tasks

#### 7.1: Research Spike - Validate Per-Pod Attribution

- [ ] Deploy dcgm-exporter in test cluster with GPU Operator
- [ ] Query kubelet pod-resources API for GPU allocations
- [ ] Validate GPU UUID → pod mapping accuracy
- [ ] Test with MIG-partitioned GPUs if available
- [ ] Document any gaps vs Coroot/alternative approaches

**Expected Outcome**: Confirm dcgm-exporter + pod-resources API provides accurate per-pod GPU metrics

#### 7.2: DCGM Sidecar Deployment

**Files**: `deploy/daemonset.yaml` (update)

- [ ] Add DCGM exporter container to agent pod
- [ ] Expose DCGM metrics on localhost:9400
- [ ] Configure DCGM to scrape all GPUs

#### 7.3: GPU Metrics Collector

**Files**: `orb8-agent/src/gpu/dcgm_collector.rs`

- [ ] Scrape DCGM exporter via HTTP
- [ ] Parse Prometheus format
- [ ] Map GPU device ID → pod using device plugin API

#### 7.4: GPU Metrics in Prometheus

**Files**: `orb8-agent/src/prom_exporter.rs` (extend)

- [ ] Export `orb8_gpu_utilization` gauge
- [ ] Export `orb8_gpu_memory_used` gauge
- [ ] Labels: namespace, pod, gpu_id

#### 7.5: CLI GPU Commands

**Files**: `orb8-cli/src/commands/trace.rs`

- [ ] `orb8 trace gpu --namespace <ns>`
- [ ] Display GPU utilization per pod
- [ ] Display GPU memory usage

**Phase 7 Deliverables**

- Research spike validating per-pod GPU attribution
- DCGM sidecar deployment (dcgm-exporter)
- GPU UUID → pod mapping via pod-resources API
- Prometheus GPU metrics (utilization, memory, temperature)
- CLI GPU commands
- Grafana GPU dashboard
- Documentation: "GPU Telemetry Guide"
- **Public Release**: v0.4.0 - GPU Telemetry

**User Validation Checkpoint**: Get feedback from ML/AI teams

---

## Phase 8: Advanced Features

**Goal**: Production hardening and advanced capabilities

**Dependencies**: Phases 3-7

**Estimated Effort**: Ongoing

### Tasks

#### 8.1: Standalone Mode

**Files**: `orb8-cli/src/standalone.rs`

- [ ] Implement standalone mode (no DaemonSet required)
- [ ] CLI uses `kubectl exec` to access node
- [ ] Temporarily load probes, collect data, cleanup

**Implementation**:
```rust
// orb8-cli/src/standalone.rs
pub struct StandaloneTracer {
    kube_client: Client,
}

impl StandaloneTracer {
    pub async fn trace_network(
        &self,
        namespace: &str,
        pod: &str,
        duration: Duration,
    ) -> Result<Vec<NetworkFlow>> {
        // 1. Find node running pod
        let node = self.find_node_for_pod(namespace, pod).await?;

        // 2. Create privileged debug pod on that node
        let debug_pod = self.create_debug_pod(&node).await?;

        // 3. Copy probe binary to debug pod
        self.upload_probe(&debug_pod).await?;

        // 4. Run agent in standalone mode
        let output = self.exec_in_pod(
            &debug_pod,
            format!("orb8-agent --standalone --duration={}", duration.as_secs())
        ).await?;

        // 5. Parse output
        let flows = parse_flows(&output)?;

        // 6. Cleanup
        self.delete_debug_pod(&debug_pod).await?;

        Ok(flows)
    }
}
```

**Success Criteria**:
- ✅ Works without DaemonSet installation
- ✅ Cleans up temporary resources
- ✅ Useful for ad-hoc debugging

#### 8.2: TUI Dashboard

**Files**: `orb8-cli/src/commands/dashboard.rs`

- [ ] Real-time TUI using ratatui
- [ ] Display top flows, pods, protocols
- [ ] Interactive filtering and sorting

**Implementation**:
```rust
// orb8-cli/src/commands/dashboard.rs
use ratatui::{
    backend::CrosstermBackend,
    widgets::{Block, Borders, List, ListItem},
    Terminal,
};

pub async fn run_dashboard() -> Result<()> {
    let backend = CrosstermBackend::new(std::io::stdout());
    let mut terminal = Terminal::new(backend)?;

    loop {
        // Fetch latest flows
        let flows = fetch_flows().await?;

        // Render
        terminal.draw(|f| {
            let size = f.size();
            let items: Vec<ListItem> = flows
                .iter()
                .map(|flow| {
                    ListItem::new(format!(
                        "{}/{} → {} {} bytes",
                        flow.namespace, flow.pod_name, flow.dst_ip, flow.bytes
                    ))
                })
                .collect();

            let list = List::new(items)
                .block(Block::default().borders(Borders::ALL).title("Network Flows"));

            f.render_widget(list, size);
        })?;

        // Refresh every 2 seconds
        tokio::time::sleep(Duration::from_secs(2)).await;
    }
}
```

**Success Criteria**:
- ✅ Interactive TUI dashboard
- ✅ Real-time updates
- ✅ Keyboard navigation

#### 8.3: Historical Storage

**Files**: `orb8-server/src/storage.rs`

- [ ] Optional TimescaleDB backend
- [ ] Store flow history (configurable retention)
- [ ] Query historical data via CLI

**Success Criteria**:
- ✅ Long-term metric storage
- ✅ Efficient time-range queries
- ✅ Configurable retention policy

#### 8.4: Multi-Cluster Support

**Files**: `orb8-server/src/multi_cluster.rs`

- [ ] Federate multiple clusters
- [ ] Cross-cluster flow correlation
- [ ] Single pane of glass dashboard

**Success Criteria**:
- ✅ Monitor multiple clusters from one CLI
- ✅ Aggregate metrics across clusters

**Phase 8 Deliverables**

✅ Standalone mode for ad-hoc tracing
✅ Interactive TUI dashboard
✅ Historical data storage (optional)
✅ Multi-cluster federation (optional)
✅ **Public Release**: v1.0.0 - Production Ready

---

## Future Enhancements

**Not scheduled, but under consideration**

### DNS Tracing

- Parse DNS queries/responses in network probe
- Track DNS failures per pod
- Detect DNS exfiltration

### IPv6 Support

- Extend network probe to parse IPv6 headers
- Update flow aggregation logic

### eBPF GPU Probes (Research)

- Revisit eBPF hooks into NVIDIA driver
- Kernel-level GPU event tracing
- Requires driver stability analysis

### WebAssembly Plugin System

- Load custom probes as WASM plugins
- Community-contributed probe marketplace

### AI-Powered Insights

- ML model for anomaly detection
- Predictive alerts
- Auto-remediation suggestions

---

## Summary

This roadmap provides a **phase-based, dependency-driven** implementation plan for orb8. Each phase delivers tangible value and can be validated with real users before proceeding.

**Key Principles**:
- ✅ No artificial deadlines
- ✅ Focus on quality over speed
- ✅ User validation at each major milestone
- ✅ Technical debt explicitly managed
- ✅ Research spikes for high-uncertainty areas

**Current Status**: Phase 2 complete. Phase 3 in progress.

**Completed**:
- Phase 0: Foundation & Monorepo
- Phase 1: eBPF Infrastructure (probe loading, ring buffer, event polling)
- Phase 2: Container Identification (K8s watcher, pod cache, gRPC API)

**Next Step**: Phase 3 (Network Tracing MVP public release)

---

**Document Version**: 1.3
**Last Updated**: 2025-12-04
**Authors**: orb8 maintainers