# Service Discovery Design
## Problem Statement
rapace-registry exists but is **half-baked** for actual service discovery. Current issues:
1. **No runtime registry** - `ServiceRegistry` is built at codegen time but not exposed globally
2. **No introspection** - Can't query "what services does this cell provide?"
3. **No announcement** - Cells don't advertise their capabilities when connecting
4. **Poor dispatching** - `DispatcherBuilder` just tries services linearly until one doesn't return `Unimplemented`
5. **No capability negotiation** - Host doesn't know what a cell can do until it tries to call it
6. **No versioning** - Can't handle version mismatches between host and cell
This makes debugging hard ("why isn't my RPC working?") and prevents useful features like:
- Runtime service browser/explorer
- Dynamic routing based on available services
- Health checks ("is service X available?")
- Hindsight integration (method name → ID mapping for traces)
---
## Current State
### What We Have
**`rapace-registry`**:
- `ServiceRegistry` with method_id → method name lookup
- `MethodEntry` with schemas (facet shapes), docs, arg info
- Sequential `MethodId` allocation (unique across services)
- Already captures all the metadata we need!
**`rapace-macros`**:
- `#[rapace::service]` generates client/server code
- Method IDs are computed at codegen time (FNV-1a hash)
- No automatic registration happens
**`rapace-cell`**:
- `DispatcherBuilder` for multi-service cells
- Linear search through services (inefficient + poor error messages)
### What's Missing
1. **Global registry access** - No way to get the registry at runtime
2. **Auto-registration** - Generated code doesn't register itself
3. **Introspection service** - Can't query available services via RPC
4. **Host-side discovery** - Host can't enumerate what services a cell provides
---
## MVP Proposal
### Goals
1. ✅ **Auto-registration** - Services register themselves when the server is created
2. ✅ **Global registry** - Thread-local or process-level registry accessible at runtime
3. ✅ **Introspection RPC** - Standard service for querying available services
4. ✅ **Better dispatching** - Use registry for fast method_id → service routing
5. ✅ **Hindsight integration** - Method name lookup for distributed tracing
### Non-Goals (Post-MVP)
- ❌ Capability negotiation protocol
- ❌ Service versioning
- ❌ Dynamic service addition/removal
- ❌ Cross-cell service discovery (service mesh)
---
## Design
### 1. Global Registry
Add a **process-level registry** that all services register into:
```rust
// In rapace-registry/src/lib.rs
use std::sync::LazyLock;
use parking_lot::RwLock;
/// Global process-level service registry.
///
/// All services automatically register here when their server is created.
static GLOBAL_REGISTRY: LazyLock<RwLock<ServiceRegistry>> =
LazyLock::new(|| RwLock::new(ServiceRegistry::new()));
impl ServiceRegistry {
/// Get a reference to the global registry.
pub fn global() -> &'static RwLock<ServiceRegistry> {
&GLOBAL_REGISTRY
}
/// Get the global registry (convenience for read access).
pub fn with_global<F, R>(f: F) -> R
where
F: FnOnce(&ServiceRegistry) -> R,
{
f(&GLOBAL_REGISTRY.read())
}
/// Modify the global registry (convenience for write access).
pub fn with_global_mut<F, R>(f: F) -> R
where
F: FnOnce(&mut ServiceRegistry) -> R,
{
f(&mut GLOBAL_REGISTRY.write())
}
}
```
**Why process-level?**
- Simple - no Arc/lifetime complexity
- Works for both host and cell processes
- Thread-safe via RwLock
- LazyLock ensures single initialization
**Alternative considered**: Thread-local storage
- Rejected: Doesn't work across threads (RPC handlers often run on tokio thread pool)
### 2. Auto-Registration
Modify the `#[rapace::service]` macro to generate registration code:
```rust
// Generated by #[rapace::service] for trait MyService
impl MyServiceServer {
/// Auto-register this service in the global registry.
///
/// Called automatically when the server is created.
fn __register() {
use rapace_registry::ServiceRegistry;
use once_cell::sync::OnceCell;
static REGISTERED: OnceCell<()> = OnceCell::new();
REGISTERED.get_or_init(|| {
ServiceRegistry::with_global_mut(|registry| {
let mut builder = registry.register_service(
"MyService",
"Service documentation from /// comments",
);
builder.add_method(
"my_method",
"Method documentation",
vec![
ArgInfo { name: "arg1", type_name: "String" },
ArgInfo { name: "arg2", type_name: "i32" },
],
<MyMethodRequest as Facet>::SHAPE,
<MyMethodResponse as Facet>::SHAPE,
);
builder.finish();
});
});
}
pub fn new(service: impl MyService + 'static) -> Self {
Self::__register(); // Auto-register on construction
Self {
service: Arc::new(service),
}
}
}
```
**Key insight**: Use `OnceCell` to ensure registration happens exactly once per service type, even if multiple instances are created.
### 3. Introspection Service
Define a **standard service** that all cells can optionally implement:
```rust
// In rapace-registry/src/introspection.rs
use facet::Facet;
/// Information about a registered service.
#[derive(Clone, Debug, Facet)]
pub struct ServiceInfo {
/// Service name (e.g., "Calculator").
pub name: String,
/// Service documentation.
pub doc: String,
/// Methods provided by this service.
pub methods: Vec<MethodInfo>,
}
/// Information about a method.
#[derive(Clone, Debug, Facet)]
pub struct MethodInfo {
/// Method ID (for debugging/logging).
pub id: u32,
/// Method name (e.g., "add").
pub name: String,
/// Full qualified name (e.g., "Calculator.add").
pub full_name: String,
/// Method documentation.
pub doc: String,
/// Argument names and types.
pub args: Vec<ArgInfo>,
/// Whether this is a streaming method.
pub is_streaming: bool,
}
/// Argument metadata.
#[derive(Clone, Debug, Facet)]
pub struct ArgInfo {
pub name: String,
pub type_name: String,
}
/// Standard introspection service.
///
/// Implement this service to expose runtime service information.
#[rapace::service]
pub trait ServiceIntrospection {
/// List all services registered in this process.
async fn list_services(&self) -> Vec<ServiceInfo>;
/// Describe a specific service by name.
async fn describe_service(&self, name: String) -> Option<ServiceInfo>;
/// Check if a method ID is supported.
async fn has_method(&self, method_id: u32) -> bool;
}
```
**Default implementation**:
```rust
// In rapace-registry/src/introspection.rs
/// Default implementation that reads from the global registry.
#[derive(Clone)]
pub struct DefaultServiceIntrospection;
impl ServiceIntrospection for DefaultServiceIntrospection {
async fn list_services(&self) -> Vec<ServiceInfo> {
ServiceRegistry::with_global(|registry| {
registry
.iter_services()
.map(|service| ServiceInfo {
name: service.name.to_string(),
doc: service.doc.clone(),
methods: service
.iter_methods()
.map(|method| MethodInfo {
id: method.id.0,
name: method.name.to_string(),
full_name: method.full_name.clone(),
doc: method.doc.clone(),
args: method
.args
.iter()
.map(|arg| ArgInfo {
name: arg.name.to_string(),
type_name: arg.type_name.to_string(),
})
.collect(),
is_streaming: method.is_streaming,
})
.collect(),
})
.collect()
})
}
async fn describe_service(&self, name: String) -> Option<ServiceInfo> {
self.list_services()
.await
.into_iter()
.find(|s| s.name == name)
}
async fn has_method(&self, method_id: u32) -> bool {
ServiceRegistry::with_global(|registry| {
registry.method_by_id(MethodId(method_id)).is_some()
})
}
}
```
### 4. Better Dispatching
Improve `DispatcherBuilder` to use the registry for routing:
```rust
// In rapace-cell/src/lib.rs
impl DispatcherBuilder {
pub fn build(self) -> impl Fn(...) -> ... {
let services = Arc::new(self.services);
move |_channel_id, method_id, payload| {
let services = services.clone();
Box::pin(async move {
// NEW: Use registry to find which service handles this method
if let Some(method) = ServiceRegistry::with_global(|reg| {
reg.method_by_id(MethodId(method_id)).map(|m| m.full_name.clone())
}) {
tracing::debug!(
method_id,
method_name = %method,
"Dispatching to registered method"
);
}
// Try each service until one handles it
for service in services.iter() {
let result = service.dispatch(method_id, &payload).await;
if !matches!(
&result,
Err(RpcError::Status {
code: ErrorCode::Unimplemented,
..
})
) {
return result;
}
}
// No service handled this method - use registry for better error
let error_msg = ServiceRegistry::with_global(|reg| {
if let Some(method) = reg.method_by_id(MethodId(method_id)) {
format!(
"Method '{}' (id={}) exists but is not implemented by any registered service",
method.full_name, method_id
)
} else {
format!(
"Unknown method_id: {} (not registered in global registry)",
method_id
)
}
});
Err(RpcError::Status {
code: ErrorCode::Unimplemented,
message: error_msg,
})
})
}
}
}
```
**Improvement**: Error messages now include method names instead of just IDs!
### 5. Cell Integration
Make it easy for cells to expose introspection:
```rust
// In rapace-cell/src/lib.rs
impl DispatcherBuilder {
/// Add introspection service to this cell.
///
/// This exposes the `ServiceIntrospection` service, allowing callers to
/// query what services and methods this cell provides.
pub fn with_introspection(self) -> Self {
use rapace_registry::introspection::{
DefaultServiceIntrospection, ServiceIntrospectionServer,
};
let introspection = DefaultServiceIntrospection;
let server = ServiceIntrospectionServer::new(introspection);
self.add_service(server)
}
}
```
**Usage**:
```rust
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
rapace_cell::run_multi(|builder| {
builder
.add_service(MyServiceServer::new(my_impl))
.add_service(AnotherServiceServer::new(another_impl))
.with_introspection() // ← Add introspection!
})
.await?;
Ok(())
}
```
---
## Implementation Plan
### Phase 1: Global Registry (1-2 hours)
**Files**:
- `rapace-registry/src/lib.rs`
**Changes**:
1. Add `GLOBAL_REGISTRY` static
2. Add `ServiceRegistry::global()`, `with_global()`, `with_global_mut()`
3. Add tests
**Test**:
```rust
#[test]
fn test_global_registry() {
ServiceRegistry::with_global_mut(|reg| {
let mut builder = reg.register_service("TestService", "Test");
builder.add_method("test", "", vec![], &DUMMY_SHAPE, &DUMMY_SHAPE);
builder.finish();
});
ServiceRegistry::with_global(|reg| {
assert_eq!(reg.service_count(), 1);
assert!(reg.service("TestService").is_some());
});
}
```
### Phase 2: Auto-Registration (2-3 hours)
**Files**:
- `rapace-macros/src/lib.rs`
**Changes**:
1. Modify `#[rapace::service]` codegen to generate `__register()` method
2. Call `__register()` in generated `new()` constructors
3. Use `OnceCell` to prevent duplicate registration
4. Add dependency on `rapace-registry`
**Test**: Manually verify generated code
### Phase 3: Introspection Service (1-2 hours)
**Files**:
- `rapace-registry/src/introspection.rs` (new)
- `rapace-registry/src/lib.rs` (export module)
**Changes**:
1. Define `ServiceInfo`, `MethodInfo`, `ArgInfo` types
2. Define `ServiceIntrospection` trait (use `#[rapace::service]`)
3. Implement `DefaultServiceIntrospection`
4. Add tests
**Test**:
```rust
#[tokio::test]
async fn test_introspection() {
let intro = DefaultServiceIntrospection;
let services = intro.list_services().await;
assert!(services.len() > 0);
}
```
### Phase 4: Better Dispatching (1 hour)
**Files**:
- `rapace-cell/src/lib.rs`
**Changes**:
1. Update `DispatcherBuilder::build()` to use registry for error messages
2. Add `with_introspection()` helper
3. Add optional tracing for method dispatch
### Phase 5: Integration & Testing (1-2 hours)
**Files**:
- `demos/*/` - Update demos to use introspection
- `rapace-explorer/` - Use introspection to list available services
**Tasks**:
1. Update at least one demo to expose introspection
2. Test end-to-end: connect to cell, call `list_services()`, verify results
3. Update docs with examples
---
## Example: Full Flow
### Cell Code
```rust
use rapace_cell::run_multi;
// Service implementation
struct CalculatorImpl;
#[rapace::async_trait]
impl Calculator for CalculatorImpl {
async fn add(&self, a: i32, b: i32) -> i32 {
a + b
}
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
run_multi(|builder| {
builder
.add_service(CalculatorServer::new(CalculatorImpl))
.with_introspection() // ← Auto-exposes ServiceIntrospection
})
.await?;
Ok(())
}
```
### Host Code (Debugging/Exploration)
```rust
use rapace_registry::introspection::{ServiceIntrospectionClient};
// Connect to cell
let session = connect_to_cell("/tmp/cell.shm").await?;
let intro_client = ServiceIntrospectionClient::new(session.clone());
// Query available services
let services = intro_client.list_services().await?;
for service in services {
println!("Service: {}", service.name);
for method in service.methods {
println!(" - {}: {}", method.name, method.doc);
for arg in method.args {
println!(" {}: {}", arg.name, arg.type_name);
}
}
}
```
**Output**:
```
Service: Calculator
- add: Add two numbers together
a: i32
b: i32
Service: ServiceIntrospection
- list_services: List all services in this process
- describe_service: Get details about a specific service
name: String
```
### Hindsight Integration
```rust
// In hindsight-server, when receiving a span:
if let Some(method_name) = ServiceRegistry::with_global(|reg| {
reg.method_by_id(MethodId(span.method_id))
.map(|m| m.full_name.clone())
}) {
span.set_attribute("rpc.method", method_name);
}
```
---
## Benefits
### For Developers
- **Better errors**: "Unknown method `Calculator.add`" instead of "Unknown method_id: 12345"
- **Runtime inspection**: Query what a cell can do without reading code
- **Easier debugging**: Hindsight traces show method names, not IDs
### For Hindsight
- Method name → ID mapping without manual configuration
- Service-level filtering ("only trace Calculator service")
- Human-readable trace views
### For Future Features
- Service mesh (discover services across multiple cells)
- Load balancing (route to cells that have service X)
- Health checks (is service X registered and healthy?)
- Version negotiation (ensure host and cell are compatible)
---
## Open Questions
1. **Should introspection be mandatory or optional?**
- **Proposal**: Optional but enabled by default in `rapace-cell`
- Cells can opt-out if they want minimal overhead
2. **Should we support service removal?**
- **Proposal**: Not in MVP (static registration only)
- Future: Add `ServiceRegistry::unregister_service()`
3. **How to handle method ID collisions?**
- **Current**: FNV-1a hash of full method name
- **Risk**: Collisions are theoretically possible
- **Mitigation**: Registry can detect and panic on collision
4. **Thread safety of global registry?**
- **Proposal**: RwLock for read-heavy workload
- Most operations are reads (method lookups during dispatch)
- Writes only happen at server creation time (rare)
---
## Alternatives Considered
### 1. Thread-Local Registry
**Rejected**: Doesn't work for multi-threaded RPC handlers.
### 2. Explicit Registry Passing
```rust
let registry = ServiceRegistry::new();
MyServiceServer::new_with_registry(impl, ®istry);
```
**Rejected**: Too much boilerplate, easy to forget.
### 3. Discovery via Separate RPC
Instead of a service, use a separate RPC endpoint for discovery.
**Rejected**: Less idiomatic - services are the primitive in rapace.
### 4. Code Generation Only (No Runtime Registry)
Just generate service metadata at build time, no runtime component.
**Rejected**: Can't support Hindsight integration or runtime introspection.
---
## Success Criteria
MVP is successful when:
1. ✅ All generated services auto-register in global registry
2. ✅ `ServiceIntrospection` works end-to-end (cell advertises, host queries)
3. ✅ Error messages show method names instead of IDs
4. ✅ Hindsight can map method_id → method name without configuration
5. ✅ At least one demo uses introspection
6. ✅ Documentation is updated with examples
---
**Estimated Total Time**: 8-12 hours
**Priority**: Medium (useful but not blocking Hindsight MVP)
**Dependencies**: None (purely additive changes)