# Implementation Plan: Embellama Server
This document outlines the phased implementation plan for the `embellama` server component. The server provides an OpenAI-compatible REST API for the core library functionality.
## Critical Architecture Constraint
**⚠️ IMPORTANT**: Due to `LlamaContext` being `!Send`:
- Models cannot be shared between threads using `Arc`
- Each worker thread must own its model instance
- All communication uses message passing via channels
- Worker pool pattern is mandatory for concurrent requests
## Phase 1: Server Foundation
### Objectives
Set up the basic server infrastructure with Axum and establish the project structure.
### Tasks
- [ ] Configure server dependencies in `Cargo.toml`
- [ ] Add server feature flag configuration
- [ ] Add `axum = { version = "0.7", optional = true }`
- [ ] Add `tokio = { version = "1.35", features = ["full"], optional = true }`
- [ ] Add `clap = { version = "4.4", features = ["derive"], optional = true }`
- [ ] Add `tower = { version = "0.4", optional = true }`
- [ ] Add `tower-http = { version = "0.5", features = ["cors", "trace"], optional = true }`
- [ ] Configure binary target with required features
- [ ] Create server module structure
- [ ] Create `src/server/mod.rs` with submodule declarations
- [ ] Create `src/server/worker.rs` for worker thread implementation
- [ ] Create `src/server/dispatcher.rs` for request routing
- [ ] Create `src/server/channel.rs` for message types
- [ ] Create `src/server/state.rs` for application state
- [ ] Create `src/bin/server.rs` for server binary
- [ ] Implement CLI argument parsing (`src/bin/server.rs`)
- [ ] Define `Args` struct with `clap` derive
- [ ] Add arguments:
- [ ] `--model-path` - Path to GGUF model file
- [ ] `--model-name` - Model identifier for API
- [ ] `--host` - Bind address (default: 127.0.0.1)
- [ ] `--port` - Server port (default: 8080)
- [ ] `--workers` - Number of worker threads
- [ ] `--queue-size` - Max pending requests per worker
- [ ] Implement argument validation
- [ ] Set up basic Axum application
- [ ] Create `Router` with basic routes
- [ ] Add `/health` endpoint
- [ ] Configure middleware (CORS, tracing)
- [ ] Set up graceful shutdown
- [ ] Implement server binding and listening
- [ ] Configure logging for server
- [ ] Set up `tracing_subscriber` with env filter
- [ ] Add request/response logging middleware
- [ ] Configure structured logging output
- [ ] Add correlation IDs for requests
### Success Criteria
- [ ] Server starts with `cargo run --features server --bin embellama-server`
- [ ] `/health` endpoint returns 200 OK
- [ ] CLI arguments are parsed correctly
- [ ] Logging produces structured output
### Dependencies
- Core library phases 1-3 (available for import)
---
## Phase 2: Worker Pool Architecture
### Objectives
Implement the worker pool pattern to handle the `!Send` constraint with thread-local models.
### Tasks
- [ ] Define message types (`src/server/channel.rs`)
- [ ] Create `WorkerRequest` struct:
- [ ] `id: Uuid` - Request identifier
- [ ] `model: String` - Model name
- [ ] `input: TextInput` - Text(s) to process
- [ ] `response_tx: oneshot::Sender<WorkerResponse>`
- [ ] Create `WorkerResponse` struct:
- [ ] `embeddings: Vec<Vec<f32>>`
- [ ] `token_count: usize`
- [ ] `processing_time_ms: u64`
- [ ] Create `TextInput` enum for single/batch
- [ ] Implement worker thread (`src/server/worker.rs`)
- [ ] Create `Worker` struct with:
- [ ] Thread-local `EmbeddingModel`
- [ ] Request receiver channel
- [ ] Worker ID and metrics
- [ ] Implement worker main loop:
- [ ] Receive requests from channel
- [ ] Process with thread-local model
- [ ] Send response via oneshot channel
- [ ] Handle errors gracefully
- [ ] Add worker lifecycle management:
- [ ] Initialization with model loading
- [ ] Graceful shutdown handling
- [ ] Error recovery mechanisms
- [ ] Implement dispatcher (`src/server/dispatcher.rs`)
- [ ] Create `Dispatcher` struct with:
- [ ] Vector of worker sender channels
- [ ] Round-robin or load-based routing
- [ ] Request queue management
- [ ] Implement request routing:
- [ ] Select available worker
- [ ] Forward request to worker
- [ ] Handle backpressure
- [ ] Implement timeout handling
- [ ] Create worker pool management
- [ ] Spawn worker threads on startup
- [ ] Each worker loads its own model instance
- [ ] Monitor worker health
- [ ] Handle worker failures/restarts
- [ ] Implement graceful shutdown
- [ ] Implement `AppState` (`src/server/state.rs`)
- [ ] Store dispatcher sender channel
- [ ] Configuration parameters
- [ ] Metrics and statistics
- [ ] Health check status
- [ ] Add worker pool tests
- [ ] Test message passing
- [ ] Test concurrent requests
- [ ] Test worker failure recovery
- [ ] Verify thread isolation
### Success Criteria
- [ ] Worker pool starts successfully
- [ ] Models load on each worker thread
- [ ] Messages route correctly to workers
- [ ] No data races or deadlocks
- [ ] Graceful shutdown works properly
### Dependencies
- Phase 1 (Server Foundation)
---
## Phase 3: OpenAI-Compatible API Endpoints
### Objectives
Implement the OpenAI-compatible `/v1/embeddings` endpoint with proper request/response handling.
### Tasks
- [ ] Define API types (create `src/server/api_types.rs`)
- [ ] Implement `EmbeddingsRequest`:
```rust
#[derive(Deserialize)]
struct EmbeddingsRequest {
model: String,
input: InputType, encoding_format: Option<String>, dimensions: Option<usize>,
user: Option<String>,
}
```
- [ ] Implement `EmbeddingsResponse`:
```rust
#[derive(Serialize)]
struct EmbeddingsResponse {
object: String, data: Vec<EmbeddingData>,
model: String,
usage: Usage,
}
```
- [ ] Add supporting types (EmbeddingData, Usage)
- [ ] Implement embeddings handler
- [ ] Create async handler function
- [ ] Parse and validate request
- [ ] Create oneshot channel for response
- [ ] Send request to dispatcher
- [ ] Await response with timeout
- [ ] Format OpenAI-compatible response
- [ ] Add input validation
- [ ] Validate model name exists
- [ ] Check input text length limits
- [ ] Validate encoding format
- [ ] Handle empty inputs gracefully
- [ ] Implement error responses
- [ ] OpenAI-compatible error format
- [ ] Appropriate HTTP status codes
- [ ] Helpful error messages
- [ ] Request ID in errors
- [ ] Add routes to router
- [ ] Mount `/v1/embeddings` POST endpoint
- [ ] Add `/v1/models` GET endpoint
- [ ] Add OpenAPI/Swagger documentation
- [ ] Implement content negotiation
- [ ] Support JSON requests/responses
- [ ] Handle content-type headers
- [ ] Support gzip compression
### Success Criteria
- [ ] Endpoint accepts OpenAI-format requests
- [ ] Responses match OpenAI structure
- [ ] Error handling follows OpenAI patterns
- [ ] Works with OpenAI client libraries
### Dependencies
- Phase 2 (Worker Pool Architecture)
---
## Phase 4: Request/Response Pipeline
### Objectives
Implement the complete request processing pipeline with proper error handling and monitoring.
### Tasks
- [ ] Implement request preprocessing
- [ ] Add request ID generation
- [ ] Implement rate limiting
- [ ] Add request size limits
- [ ] Validate authentication (if configured)
- [ ] Enhance request flow
- [ ] Add request queuing with priorities
- [ ] Implement request cancellation
- [ ] Add request deduplication
- [ ] Handle batch request optimization
- [ ] Implement response post-processing
- [ ] Format embeddings (float vs base64)
- [ ] Calculate token usage statistics
- [ ] Add response caching headers
- [ ] Implement response compression
- [ ] Add timeout handling
- [ ] Configure per-request timeouts
- [ ] Implement timeout response
- [ ] Clean up timed-out requests
- [ ] Return partial results option
- [ ] Implement backpressure handling
- [ ] Monitor queue depths
- [ ] Return 503 when overloaded
- [ ] Add retry-after headers
- [ ] Implement circuit breaker pattern
- [ ] Add observability
- [ ] Request/response metrics
- [ ] Latency histograms
- [ ] Queue depth monitoring
- [ ] Worker utilization metrics
- [ ] Error recovery mechanisms
- [ ] Retry failed requests
- [ ] Dead letter queue for failures
- [ ] Graceful degradation
- [ ] Health check integration
### Success Criteria
- [ ] Requests process end-to-end
- [ ] Timeouts work correctly
- [ ] Backpressure prevents overload
- [ ] Metrics are collected accurately
- [ ] Error recovery functions properly
### Dependencies
- Phase 3 (API Endpoints)
---
## Phase 5: Integration Testing & Validation
### Objectives
Ensure server functionality with comprehensive integration tests and OpenAI compatibility validation.
### Tasks
- [ ] Set up test infrastructure
- [ ] Create test server helper functions
- [ ] Download test models for CI
- [ ] Configure test environment
- [ ] Set up test fixtures
- [ ] Write API integration tests
- [ ] Test single embedding requests
- [ ] Test batch embedding requests
- [ ] Test error scenarios
- [ ] Test timeout handling
- [ ] Test concurrent requests
- [ ] OpenAI compatibility tests
- [ ] Test with OpenAI Python client
- [ ] Test with OpenAI JavaScript client
- [ ] Validate response format exactly
- [ ] Test streaming if applicable
- [ ] Load testing
- [ ] Use `oha` or `wrk` for load testing
- [ ] Test various concurrency levels
- [ ] Measure latency percentiles
- [ ] Find breaking points
- [ ] Test sustained load
- [ ] Worker pool tests
- [ ] Test worker failures
- [ ] Test model reloading
- [ ] Test graceful shutdown
- [ ] Test resource cleanup
- [ ] End-to-end scenarios
- [ ] RAG pipeline simulation
- [ ] Semantic search workflow
- [ ] High-volume batch processing
- [ ] Mixed workload patterns
- [ ] Performance benchmarks
- [ ] Single request latency
- [ ] Throughput at various batch sizes
- [ ] Memory usage under load
- [ ] CPU utilization patterns
### Success Criteria
- [ ] All integration tests pass
- [ ] OpenAI clients work seamlessly
- [ ] Performance meets targets
- [ ] No memory leaks under load
- [ ] Graceful degradation verified
### Dependencies
- Phase 4 (Request/Response Pipeline)
---
## Phase 6: Production Features
### Objectives
Add production-ready features for monitoring, operations, and deployment.
### Tasks
- [ ] Implement monitoring endpoints
- [ ] Add `/metrics` Prometheus endpoint
- [ ] Export key metrics:
- [ ] Request rate and latency
- [ ] Model inference time
- [ ] Queue depths
- [ ] Worker utilization
- [ ] Error rates
- [ ] Add custom business metrics
- [ ] Add operational endpoints
- [ ] `/admin/reload` - Reload models
- [ ] `/admin/workers` - Worker status
- [ ] `/admin/config` - View configuration
- [ ] `/admin/drain` - Graceful drain
- [ ] Implement configuration management
- [ ] Support environment variables
- [ ] Add configuration file support (YAML/TOML)
- [ ] Hot-reload configuration
- [ ] Validate configuration changes
- [ ] Add deployment features
- [ ] Docker container support
- [ ] Kubernetes readiness/liveness probes
- [ ] Horizontal scaling support
- [ ] Rolling update compatibility
- [ ] Enhance security
- [ ] Add API key authentication
- [ ] Implement rate limiting per client
- [ ] Add request signing/verification
- [ ] Audit logging for requests
- [ ] Implement caching layer
- [ ] Cache frequent embeddings
- [ ] LRU eviction policy
- [ ] Cache metrics
- [ ] Cache invalidation API
- [ ] Add multi-model support
- [ ] Load multiple models
- [ ] Route by model name
- [ ] Model versioning
- [ ] A/B testing support
- [ ] Production documentation
- [ ] Deployment guide
- [ ] Operations runbook
- [ ] Monitoring setup
- [ ] Troubleshooting guide
### Success Criteria
- [ ] Prometheus metrics exported
- [ ] Docker container runs successfully
- [ ] Configuration hot-reload works
- [ ] Multi-model routing functions
- [ ] Production documentation complete
### Dependencies
- Phase 5 (Integration Testing)
---
## Phase 7: Performance Optimization
### Objectives
Optimize server performance based on profiling and real-world usage patterns.
### Tasks
- [ ] Profile server performance
- [ ] CPU profiling with flamegraph
- [ ] Memory profiling with heaptrack
- [ ] Async runtime analysis
- [ ] Channel contention analysis
- [ ] Optimize request routing
- [ ] Implement smart load balancing
- [ ] Add work-stealing queue
- [ ] Optimize channel implementations
- [ ] Reduce context switching
- [ ] Improve batching efficiency
- [ ] Dynamic batch aggregation
- [ ] Adaptive batch sizing
- [ ] Optimize batch timeout
- [ ] Coalesce small requests
- [ ] Optimize serialization
- [ ] Use faster JSON library (simd-json)
- [ ] Implement zero-copy where possible
- [ ] Optimize base64 encoding
- [ ] Buffer pool for responses
- [ ] Network optimizations
- [ ] TCP tuning parameters
- [ ] HTTP/2 support
- [ ] Keep-alive optimization
- [ ] Response streaming
- [ ] Worker pool tuning
- [ ] Optimal worker count discovery
- [ ] CPU affinity settings
- [ ] NUMA awareness
- [ ] Memory pooling
### Success Criteria
- [ ] 30% latency improvement
- [ ] 50% throughput increase
- [ ] Reduced memory footprint
- [ ] Better resource utilization
### Dependencies
- Phase 6 (Production Features)
---
## Implementation Notes
### Critical Threading Architecture
```rust
// Server architecture to handle !Send constraint:
// 1. Main thread: Axum server with tokio runtime
// 2. Worker threads: Each owns a LlamaContext
// 3. Communication: Channels only, no shared state
// AppState (Send + Sync + Clone)
struct AppState {
dispatcher_tx: mpsc::Sender<WorkerRequest>,
config: Arc<ServerConfig>,
metrics: Arc<Metrics>,
}
// Worker (runs on dedicated thread)
struct Worker {
id: usize,
model: EmbeddingModel, // !Send - stays on this thread
receiver: mpsc::Receiver<WorkerRequest>,
}
// Request flow:
// HTTP Handler -> Dispatcher -> Worker -> Response Channel -> HTTP Response
```
### Deployment Considerations
1. **Memory Requirements**
- Each worker needs full model in memory
- Total RAM = `num_workers × model_size + overhead`
- Example: 4 workers × 1.5GB model = 6GB minimum
2. **CPU Considerations**
- Workers should not exceed physical cores
- Leave cores for tokio runtime
- Recommended: `workers = physical_cores - 2`
3. **Scaling Strategy**
- Vertical: Add more workers (limited by RAM)
- Horizontal: Multiple server instances
- Use load balancer for horizontal scaling
### Monitoring Key Metrics
1. **Latency Metrics**
- Request processing time (P50, P95, P99)
- Model inference time
- Queue wait time
- Total end-to-end time
2. **Throughput Metrics**
- Requests per second
- Embeddings per second
- Batch size distribution
- Worker utilization
3. **Health Metrics**
- Worker status
- Queue depths
- Error rates
- Memory usage
### Testing Strategy
1. **Unit Tests**: Test individual components
2. **Integration Tests**: Test API endpoints
3. **Load Tests**: Verify performance under load
4. **Chaos Tests**: Test failure scenarios
5. **Compatibility Tests**: Verify OpenAI compatibility
## Success Metrics
### Performance Targets
- [ ] Single request: <100ms P99 latency
- [ ] Batch requests: >1000 embeddings/second
- [ ] Concurrent requests: Linear scaling with workers
- [ ] Memory: <2x model size per worker
- [ ] Startup time: <10 seconds
### Reliability Targets
- [ ] 99.9% uptime
- [ ] Graceful degradation under load
- [ ] Zero data loss
- [ ] Clean shutdown/restart
- [ ] Automatic recovery from failures
### Compatibility Targets
- [ ] 100% OpenAI API compatibility
- [ ] Works with all OpenAI client libraries
- [ ] Supports common embedding models
- [ ] Drop-in replacement capability