procref 0.1.0 - Docs.rs

# Robust Multi-Process SurrealDB Coordination Design

## Problem Statement

Multiple MCP services (gcode, gspec, etc.) need to share a single SurrealDB server instance on localhost. The current design uses POSIX named semaphores, but they have a critical flaw: **processes killed with SIGKILL don't release semaphores**, leaving the system in a stale state.

## Research Findings

### Comparison of Synchronization Primitives

| Primitive | Auto-Release on Crash | Cross-Platform | Multi-Slot Support |
|-----------|----------------------|----------------|-------------------|
| **flock/fcntl (file locks)** | ✅ Yes (kernel-managed) | ✅ Yes (Unix, Windows) | ❌ Binary only |
| **POSIX semaphores** | ❌ No | ✅ Yes | ✅ Yes |
| **SysV semaphores + SEM_UNDO** | ✅ Yes | ❌ Unix only | ✅ Yes |
| **Robust pthread mutex** | ✅ Yes | ⚠️ Partial | ❌ Binary only |

References:
- [flock(2) man page](https://man7.org/linux/man-pages/man2/flock.2.html): "Locks are automatically removed when the process exits or terminates"
- [POSIX semaphore deadlock](https://www.experts-exchange.com/questions/27821618/posix-semaphore-deadlock.html): "If the process is crashed... the semaphore will be in locked state"
- [File locking in Linux](https://gavv.net/articles/file-locks/): Comprehensive comparison

## Proposed Design: File-Lock-Based Slot System

### Core Idea

Replace POSIX semaphore with **N individual lock files**, one per slot. Each process acquires an exclusive lock on one slot file to "claim" that slot.

```
~/.gsc/slots/
├── slot_00.lock   # Process A holds flock
├── slot_01.lock   # Process B holds flock
├── slot_02.lock   # Available (no flock)
├── ...
└── slot_99.lock   # Available
```

### Why This Works

1. **Automatic cleanup**: flock is kernel-managed; crash → automatic release
2. **Cross-platform**: Works on Unix (flock/fcntl) and Windows (LockFileEx)
3. **No stale state**: Kernel guarantees lock status reflects reality
4. **Simple debugging**: `lsof` shows which processes hold which locks

### Implementation

```rust
pub struct SlotManager {
    slot_dir: PathBuf,
    max_slots: usize,
    held_slot: Option<(usize, File)>, // (slot_id, locked file handle)
}

impl SlotManager {
    /// Try to acquire any available slot
    pub fn acquire_slot(&mut self) -> Result<usize> {
        for slot_id in 0..self.max_slots {
            let path = self.slot_dir.join(format!("slot_{:02}.lock", slot_id));
            let file = OpenOptions::new()
                .write(true)
                .create(true)
                .open(&path)?;

            // Try non-blocking lock
            if file.try_lock_exclusive().is_ok() {
                // Write our PID for debugging
                writeln!(&file, "{}", std::process::id())?;
                self.held_slot = Some((slot_id, file));
                return Ok(slot_id);
            }
        }
        Err(anyhow!("All {} slots are in use", self.max_slots))
    }

    /// Release slot (also happens automatically on drop/crash)
    pub fn release_slot(&mut self) {
        if let Some((_, file)) = self.held_slot.take() {
            let _ = file.unlock();
            // File dropped → lock released automatically
        }
    }
}

impl Drop for SlotManager {
    fn drop(&mut self) {
        self.release_slot();
    }
}
```

### Server Lifecycle Integration

```rust
pub async fn connect(&self) -> Result<Surreal<Client>> {
    // Step 1: Acquire a slot (replaces semaphore)
    let slot = self.slot_manager.acquire_slot()
        .context("All connection slots are in use")?;

    // Step 2: Ensure server is running (unchanged)
    let (port, pid) = self.ensure_server_running().await?;

    // Step 3: Increment ref_count in lock file (unchanged)
    self.increment_ref_count()?;

    // Step 4: Connect
    self.connect_to_port(port).await
}
```

### Comparison with Current Design

| Aspect | Current (Semaphore) | Proposed (File Locks) |
|--------|--------------------|-----------------------|
| Crash recovery | ❌ Manual cleanup needed | ✅ Automatic |
| Detection of stale state | Complex (check ref_count, PID) | Not needed |
| Cross-platform | ✅ | ✅ |
| Performance | Fast (kernel semaphore) | Slightly slower (file I/O) |
| Debugging | Hard (opaque semaphore) | Easy (ls, lsof) |
| Max concurrent | Config-based | File-count based |

## Migration Path

1. **Phase 1**: Add slot-based locking alongside existing semaphore
2. **Phase 2**: Deprecate semaphore, use slots by default
3. **Phase 3**: Remove semaphore code

## Edge Cases

### Q: What if lock files are on NFS?
A: Don't do that. Use local filesystem only (`~/.gsc/` is always local).

### Q: What if user deletes slot files while running?
A: Process still holds flock on deleted inode. Next process can't acquire.
   Mitigation: Check file existence before connect operations.

### Q: Performance impact of file I/O vs semaphore?
A: Negligible for server coordination (once per connect, not per query).

## Alternative: Hybrid Approach

Keep semaphore for speed, but add "watchdog" mechanism:

1. Each process writes its PID to a registry file on connect
2. Background thread periodically scans registry
3. If PID is dead but semaphore seems full → reset semaphore

This is more complex but preserves semaphore performance.

## Recommendation

**Use file-lock-based slot system** because:
1. Simplest to implement correctly
2. Kernel guarantees correct behavior
3. Easy to debug and operate
4. Cross-platform without platform-specific code