procref 0.1.0

Cross-platform process reference counting for shared service lifecycle management
Documentation
# SurrealDB Lifecycle Management V2 Architecture

## Executive Summary

Current design has fundamental flaws with POSIX semaphores that cannot be reliably fixed. This document proposes a **complete architectural redesign** using file-based process registry with kernel-guaranteed cleanup.

## Current Problems

| Problem | Impact | Root Cause |
|---------|--------|------------|
| Semaphore stuck after SIGKILL | System-wide hang | POSIX semaphore doesn't auto-release |
| Complex recovery logic | Unreliable cleanup | Trying to fix unfixable design |
| ref_count drift | Memory leak / orphan server | Multiple failure modes compound |
| Platform inconsistency | Windows behaves differently | Named semaphores aren't portable |

## Design Philosophy V2

### Principle 1: Kernel as Source of Truth

> "Don't track state that the kernel already tracks perfectly."

The kernel knows exactly which processes are alive and which file locks they hold. We should **derive** our coordination state from kernel state, not maintain a parallel shadow state.

### Principle 2: Crash-Safe by Construction

> "If it can fail silently, it will."

Every primitive we use must be automatically cleaned up by the kernel on crash. No atexit handlers, no signal handlers - just kernel guarantees.

### Principle 3: Simple State Model

> "One source of truth, not three."

Current design has: ref_count in lock file + semaphore count + actual processes. V2 has: **just file locks** (kernel-managed).

## V2 Architecture

### Component 1: Process Registry (File-Lock Based)

```
~/.gsc/surrealdb/
├── server.json          # Server metadata (port, pid, version)
├── server.lock          # Exclusive lock: only startup coordinator holds this
└── clients/
    ├── proc_12345.lock  # Client process 12345's registration
    ├── proc_12346.lock  # Client process 12346's registration
    └── proc_12347.lock  # Client process 12347's registration
```

**How it works:**

1. Each client process creates `proc_{pid}.lock` and holds flock on it
2. Process crash → kernel releases flock → file unlocked (but exists)
3. New client scans directory, tries flock on each file
   - If flock succeeds → stale file from dead process → delete it
   - If flock fails → live process → count it
4. Server shutdown when `clients/` directory has no locked files

### Component 2: Server Coordinator

```rust
pub struct ServerCoordinatorV2 {
    config: Config,
}

impl ServerCoordinatorV2 {
    pub async fn connect(&self) -> Result<Surreal<Client>> {
        // Step 1: Register ourselves (creates proc_PID.lock with flock)
        let _registration = self.register_client()?;

        // Step 2: Ensure server is running (uses server.lock for coordination)
        let port = self.ensure_server().await?;

        // Step 3: Connect (no ref_count, no semaphore)
        self.connect_to_port(port).await

        // Note: _registration keeps lock until this process exits
    }

    fn register_client(&self) -> Result<ClientRegistration> {
        let clients_dir = self.config.base_dir.join("clients");
        fs::create_dir_all(&clients_dir)?;

        // Clean stale registrations first (optional optimization)
        self.cleanup_stale_clients(&clients_dir)?;

        // Check capacity
        let live_count = self.count_live_clients(&clients_dir)?;
        if live_count >= self.config.max_clients {
            return Err(anyhow!("Capacity reached: {} of {} clients",
                live_count, self.config.max_clients));
        }

        // Create our registration
        let my_file = clients_dir.join(format!("proc_{}.lock", std::process::id()));
        let file = File::create(&my_file)?;
        file.lock_exclusive()?;  // Kernel holds this until we exit

        Ok(ClientRegistration { _file: file, path: my_file })
    }

    fn count_live_clients(&self, clients_dir: &Path) -> Result<usize> {
        let mut count = 0;
        for entry in fs::read_dir(clients_dir)? {
            let path = entry?.path();
            if path.extension().map_or(false, |e| e == "lock") {
                // Try to lock - if we can, it's stale
                if let Ok(file) = File::open(&path) {
                    if file.try_lock_exclusive().is_ok() {
                        // Stale - we got the lock, meaning no one else has it
                        drop(file);
                        fs::remove_file(&path)?;
                    } else {
                        // Live - someone else holds the lock
                        count += 1;
                    }
                }
            }
        }
        Ok(count)
    }
}
```

### Component 3: Server Lifecycle

```rust
impl ServerCoordinatorV2 {
    async fn ensure_server(&self) -> Result<u16> {
        let server_lock = self.config.base_dir.join("server.lock");

        // Acquire exclusive lock for server startup coordination
        let lock_file = OpenOptions::new()
            .write(true).create(true).open(&server_lock)?;
        lock_file.lock_exclusive()?;  // Blocks until we have exclusive access

        // Check if server is already running
        if let Some(info) = self.read_server_info()? {
            if self.is_server_alive(&info).await {
                return Ok(info.port);
            }
            // Server is dead, clean up
            self.cleanup_dead_server()?;
        }

        // Start new server
        let port = self.start_server().await?;
        self.write_server_info(port)?;

        // Release startup lock (other clients can now connect)
        drop(lock_file);

        Ok(port)
    }

    fn should_shutdown_server(&self) -> Result<bool> {
        let clients_dir = self.config.base_dir.join("clients");
        Ok(self.count_live_clients(&clients_dir)? == 0)
    }
}

/// RAII guard for client registration
struct ClientRegistration {
    _file: File,  // Holds flock
    path: PathBuf,
}

impl Drop for ClientRegistration {
    fn drop(&mut self) {
        // File lock auto-released by kernel
        // Optionally delete our file (or leave for cleanup)
        let _ = fs::remove_file(&self.path);
    }
}
```

## Comparison: V1 vs V2

| Aspect | V1 (Current) | V2 (Proposed) |
|--------|--------------|---------------|
| **Crash recovery** | Complex (check ref_count, semaphore) | Automatic (kernel flock) |
| **State tracking** | 3 sources (ref_count, semaphore, reality) | 1 source (flock files) |
| **Cleanup code** | 50+ lines of emergency_cleanup | 0 lines (kernel does it) |
| **Debugging** | Hard (semaphore is opaque) | Easy (`ls`, `lsof`) |
| **Platform support** | POSIX semaphore quirks | Universal flock |
| **Failure modes** | Many (any can get out of sync) | Few (kernel is reliable) |

## Migration Strategy

### Phase 1: Parallel Implementation
- Add V2 code alongside V1
- Feature flag to choose implementation
- V1 remains default

### Phase 2: Testing
- Test V2 with crash scenarios (SIGKILL, OOM, etc.)
- Verify cross-platform behavior
- Performance benchmarking

### Phase 3: Cutover
- Make V2 default
- Deprecate V1
- Migration script for existing installations

### Phase 4: Cleanup
- Remove V1 code
- Clean up semaphore files from existing installations

## Edge Cases Handled

### Case 1: Process killed during registration
- File created but lock not acquired → next client's cleanup removes it
- File created and locked → kernel releases on crash → next client's cleanup removes it

### Case 2: Server killed externally
- `server.json` exists but server dead
- Next client detects via `is_server_alive()` → starts new server

### Case 3: Disk full
- Can't create lock file → graceful error
- Won't corrupt existing state

### Case 4: Power failure
- All flock released by kernel
- `clients/` may have stale files → cleaned on next startup

### Case 5: NFS/Network filesystem
- Don't use it for `~/.gsc/` (should be local)
- Detect and error if not local filesystem

## Implementation Checklist

- [ ] Implement `SlotManager` using flock
- [ ] Implement `ClientRegistration` RAII guard
- [ ] Implement server lifecycle without ref_count
- [ ] Add `--use-v2` flag for testing
- [ ] Write integration tests with crash simulation
- [ ] Add migration script
- [ ] Update documentation
- [ ] Performance benchmark
- [ ] Cross-platform testing (Linux, macOS, Windows)