# SurrealDB Lifecycle Management V2 Architecture
## Executive Summary
Current design has fundamental flaws with POSIX semaphores that cannot be reliably fixed. This document proposes a **complete architectural redesign** using file-based process registry with kernel-guaranteed cleanup.
## Current Problems
| Semaphore stuck after SIGKILL | System-wide hang | POSIX semaphore doesn't auto-release |
| Complex recovery logic | Unreliable cleanup | Trying to fix unfixable design |
| ref_count drift | Memory leak / orphan server | Multiple failure modes compound |
| Platform inconsistency | Windows behaves differently | Named semaphores aren't portable |
## Design Philosophy V2
### Principle 1: Kernel as Source of Truth
> "Don't track state that the kernel already tracks perfectly."
The kernel knows exactly which processes are alive and which file locks they hold. We should **derive** our coordination state from kernel state, not maintain a parallel shadow state.
### Principle 2: Crash-Safe by Construction
> "If it can fail silently, it will."
Every primitive we use must be automatically cleaned up by the kernel on crash. No atexit handlers, no signal handlers - just kernel guarantees.
### Principle 3: Simple State Model
> "One source of truth, not three."
Current design has: ref_count in lock file + semaphore count + actual processes. V2 has: **just file locks** (kernel-managed).
## V2 Architecture
### Component 1: Process Registry (File-Lock Based)
```
~/.gsc/surrealdb/
├── server.json # Server metadata (port, pid, version)
├── server.lock # Exclusive lock: only startup coordinator holds this
└── clients/
├── proc_12345.lock # Client process 12345's registration
├── proc_12346.lock # Client process 12346's registration
└── proc_12347.lock # Client process 12347's registration
```
**How it works:**
1. Each client process creates `proc_{pid}.lock` and holds flock on it
2. Process crash → kernel releases flock → file unlocked (but exists)
3. New client scans directory, tries flock on each file
- If flock succeeds → stale file from dead process → delete it
- If flock fails → live process → count it
4. Server shutdown when `clients/` directory has no locked files
### Component 2: Server Coordinator
```rust
pub struct ServerCoordinatorV2 {
config: Config,
}
impl ServerCoordinatorV2 {
pub async fn connect(&self) -> Result<Surreal<Client>> {
// Step 1: Register ourselves (creates proc_PID.lock with flock)
let _registration = self.register_client()?;
// Step 2: Ensure server is running (uses server.lock for coordination)
let port = self.ensure_server().await?;
// Step 3: Connect (no ref_count, no semaphore)
self.connect_to_port(port).await
// Note: _registration keeps lock until this process exits
}
fn register_client(&self) -> Result<ClientRegistration> {
let clients_dir = self.config.base_dir.join("clients");
fs::create_dir_all(&clients_dir)?;
// Clean stale registrations first (optional optimization)
self.cleanup_stale_clients(&clients_dir)?;
// Check capacity
let live_count = self.count_live_clients(&clients_dir)?;
if live_count >= self.config.max_clients {
return Err(anyhow!("Capacity reached: {} of {} clients",
live_count, self.config.max_clients));
}
// Create our registration
let my_file = clients_dir.join(format!("proc_{}.lock", std::process::id()));
let file = File::create(&my_file)?;
file.lock_exclusive()?; // Kernel holds this until we exit
Ok(ClientRegistration { _file: file, path: my_file })
}
fn count_live_clients(&self, clients_dir: &Path) -> Result<usize> {
let mut count = 0;
for entry in fs::read_dir(clients_dir)? {
let path = entry?.path();
if path.extension().map_or(false, |e| e == "lock") {
// Try to lock - if we can, it's stale
if let Ok(file) = File::open(&path) {
if file.try_lock_exclusive().is_ok() {
// Stale - we got the lock, meaning no one else has it
drop(file);
fs::remove_file(&path)?;
} else {
// Live - someone else holds the lock
count += 1;
}
}
}
}
Ok(count)
}
}
```
### Component 3: Server Lifecycle
```rust
impl ServerCoordinatorV2 {
async fn ensure_server(&self) -> Result<u16> {
let server_lock = self.config.base_dir.join("server.lock");
// Acquire exclusive lock for server startup coordination
let lock_file = OpenOptions::new()
.write(true).create(true).open(&server_lock)?;
lock_file.lock_exclusive()?; // Blocks until we have exclusive access
// Check if server is already running
if let Some(info) = self.read_server_info()? {
if self.is_server_alive(&info).await {
return Ok(info.port);
}
// Server is dead, clean up
self.cleanup_dead_server()?;
}
// Start new server
let port = self.start_server().await?;
self.write_server_info(port)?;
// Release startup lock (other clients can now connect)
drop(lock_file);
Ok(port)
}
fn should_shutdown_server(&self) -> Result<bool> {
let clients_dir = self.config.base_dir.join("clients");
Ok(self.count_live_clients(&clients_dir)? == 0)
}
}
/// RAII guard for client registration
struct ClientRegistration {
_file: File, // Holds flock
path: PathBuf,
}
impl Drop for ClientRegistration {
fn drop(&mut self) {
// File lock auto-released by kernel
// Optionally delete our file (or leave for cleanup)
let _ = fs::remove_file(&self.path);
}
}
```
## Comparison: V1 vs V2
| **Crash recovery** | Complex (check ref_count, semaphore) | Automatic (kernel flock) |
| **State tracking** | 3 sources (ref_count, semaphore, reality) | 1 source (flock files) |
| **Cleanup code** | 50+ lines of emergency_cleanup | 0 lines (kernel does it) |
| **Debugging** | Hard (semaphore is opaque) | Easy (`ls`, `lsof`) |
| **Platform support** | POSIX semaphore quirks | Universal flock |
| **Failure modes** | Many (any can get out of sync) | Few (kernel is reliable) |
## Migration Strategy
### Phase 1: Parallel Implementation
- Add V2 code alongside V1
- Feature flag to choose implementation
- V1 remains default
### Phase 2: Testing
- Test V2 with crash scenarios (SIGKILL, OOM, etc.)
- Verify cross-platform behavior
- Performance benchmarking
### Phase 3: Cutover
- Make V2 default
- Deprecate V1
- Migration script for existing installations
### Phase 4: Cleanup
- Remove V1 code
- Clean up semaphore files from existing installations
## Edge Cases Handled
### Case 1: Process killed during registration
- File created but lock not acquired → next client's cleanup removes it
- File created and locked → kernel releases on crash → next client's cleanup removes it
### Case 2: Server killed externally
- `server.json` exists but server dead
- Next client detects via `is_server_alive()` → starts new server
### Case 3: Disk full
- Can't create lock file → graceful error
- Won't corrupt existing state
### Case 4: Power failure
- All flock released by kernel
- `clients/` may have stale files → cleaned on next startup
### Case 5: NFS/Network filesystem
- Don't use it for `~/.gsc/` (should be local)
- Detect and error if not local filesystem
## Implementation Checklist
- [ ] Implement `SlotManager` using flock
- [ ] Implement `ClientRegistration` RAII guard
- [ ] Implement server lifecycle without ref_count
- [ ] Add `--use-v2` flag for testing
- [ ] Write integration tests with crash simulation
- [ ] Add migration script
- [ ] Update documentation
- [ ] Performance benchmark
- [ ] Cross-platform testing (Linux, macOS, Windows)