# Robust Multi-Process SurrealDB Coordination Design
## Problem Statement
Multiple MCP services (gcode, gspec, etc.) need to share a single SurrealDB server instance on localhost. The current design uses POSIX named semaphores, but they have a critical flaw: **processes killed with SIGKILL don't release semaphores**, leaving the system in a stale state.
## Research Findings
### Comparison of Synchronization Primitives
| **flock/fcntl (file locks)** | ✅ Yes (kernel-managed) | ✅ Yes (Unix, Windows) | ❌ Binary only |
| **POSIX semaphores** | ❌ No | ✅ Yes | ✅ Yes |
| **SysV semaphores + SEM_UNDO** | ✅ Yes | ❌ Unix only | ✅ Yes |
| **Robust pthread mutex** | ✅ Yes | ⚠️ Partial | ❌ Binary only |
References:
- [flock(2) man page](https://man7.org/linux/man-pages/man2/flock.2.html): "Locks are automatically removed when the process exits or terminates"
- [POSIX semaphore deadlock](https://www.experts-exchange.com/questions/27821618/posix-semaphore-deadlock.html): "If the process is crashed... the semaphore will be in locked state"
- [File locking in Linux](https://gavv.net/articles/file-locks/): Comprehensive comparison
## Proposed Design: File-Lock-Based Slot System
### Core Idea
Replace POSIX semaphore with **N individual lock files**, one per slot. Each process acquires an exclusive lock on one slot file to "claim" that slot.
```
~/.gsc/slots/
├── slot_00.lock # Process A holds flock
├── slot_01.lock # Process B holds flock
├── slot_02.lock # Available (no flock)
├── ...
└── slot_99.lock # Available
```
### Why This Works
1. **Automatic cleanup**: flock is kernel-managed; crash → automatic release
2. **Cross-platform**: Works on Unix (flock/fcntl) and Windows (LockFileEx)
3. **No stale state**: Kernel guarantees lock status reflects reality
4. **Simple debugging**: `lsof` shows which processes hold which locks
### Implementation
```rust
pub struct SlotManager {
slot_dir: PathBuf,
max_slots: usize,
held_slot: Option<(usize, File)>, // (slot_id, locked file handle)
}
impl SlotManager {
/// Try to acquire any available slot
pub fn acquire_slot(&mut self) -> Result<usize> {
for slot_id in 0..self.max_slots {
let path = self.slot_dir.join(format!("slot_{:02}.lock", slot_id));
let file = OpenOptions::new()
.write(true)
.create(true)
.open(&path)?;
// Try non-blocking lock
if file.try_lock_exclusive().is_ok() {
// Write our PID for debugging
writeln!(&file, "{}", std::process::id())?;
self.held_slot = Some((slot_id, file));
return Ok(slot_id);
}
}
Err(anyhow!("All {} slots are in use", self.max_slots))
}
/// Release slot (also happens automatically on drop/crash)
pub fn release_slot(&mut self) {
if let Some((_, file)) = self.held_slot.take() {
let _ = file.unlock();
// File dropped → lock released automatically
}
}
}
impl Drop for SlotManager {
fn drop(&mut self) {
self.release_slot();
}
}
```
### Server Lifecycle Integration
```rust
pub async fn connect(&self) -> Result<Surreal<Client>> {
// Step 1: Acquire a slot (replaces semaphore)
let slot = self.slot_manager.acquire_slot()
.context("All connection slots are in use")?;
// Step 2: Ensure server is running (unchanged)
let (port, pid) = self.ensure_server_running().await?;
// Step 3: Increment ref_count in lock file (unchanged)
self.increment_ref_count()?;
// Step 4: Connect
self.connect_to_port(port).await
}
```
### Comparison with Current Design
| Crash recovery | ❌ Manual cleanup needed | ✅ Automatic |
| Detection of stale state | Complex (check ref_count, PID) | Not needed |
| Cross-platform | ✅ | ✅ |
| Performance | Fast (kernel semaphore) | Slightly slower (file I/O) |
| Debugging | Hard (opaque semaphore) | Easy (ls, lsof) |
| Max concurrent | Config-based | File-count based |
## Migration Path
1. **Phase 1**: Add slot-based locking alongside existing semaphore
2. **Phase 2**: Deprecate semaphore, use slots by default
3. **Phase 3**: Remove semaphore code
## Edge Cases
### Q: What if lock files are on NFS?
A: Don't do that. Use local filesystem only (`~/.gsc/` is always local).
### Q: What if user deletes slot files while running?
A: Process still holds flock on deleted inode. Next process can't acquire.
Mitigation: Check file existence before connect operations.
### Q: Performance impact of file I/O vs semaphore?
A: Negligible for server coordination (once per connect, not per query).
## Alternative: Hybrid Approach
Keep semaphore for speed, but add "watchdog" mechanism:
1. Each process writes its PID to a registry file on connect
2. Background thread periodically scans registry
3. If PID is dead but semaphore seems full → reset semaphore
This is more complex but preserves semaphore performance.
## Recommendation
**Use file-lock-based slot system** because:
1. Simplest to implement correctly
2. Kernel guarantees correct behavior
3. Easy to debug and operate
4. Cross-platform without platform-specific code