# Implementation Plan: Embellama Core Library
This document outlines the phased implementation plan for the `embellama` core library component. Each phase is designed to be independently implementable with clear deliverables and success criteria.
## Critical Design Constraint
**⚠️ IMPORTANT**: The `LlamaContext` from `llama-cpp-2` is `!Send` and `!Sync`. This means:
- Context cannot be moved between threads
- Context cannot be shared using `Arc`
- Each thread must own its model instance
- All concurrency must use message passing
## Phase 1: Project Setup & Core Infrastructure
### Objectives
Establish the foundational project structure, dependencies, and core error handling infrastructure.
### Tasks
- [x] Initialize Cargo project with appropriate metadata
- [x] Set package name, version, authors, description
- [x] Configure license (Apache-2.0)
- [x] Add repository and documentation links
- [x] Configure dependencies in `Cargo.toml`
- [x] Add `llama-cpp-2 = "0.1.117"` for model backend
- [x] Add `thiserror = "1.0"` for error handling
- [x] Add `anyhow = "1.0"` for application errors
- [x] Add `tracing = "0.1"` for structured logging
- [x] Add `serde = { version = "1.0", features = ["derive"] }` for serialization
- [x] Add `rayon = "1.8"` for parallel batch processing
- [x] Create module structure
- [x] Create `src/lib.rs` with module declarations
- [x] Create `src/error.rs` for error types
- [x] Create `src/config.rs` for configuration types
- [x] Create placeholder files for other modules
- [x] Implement error handling foundation (`src/error.rs`)
- [x] Define `EmbellamaError` enum with `thiserror`
- [x] Add variants for common error cases:
- [x] `ModelNotFound(String)`
- [x] `ModelLoadFailed { path: String, source: llama_cpp_2::Error }`
- [x] `EmbeddingFailed`
- [x] `InvalidInput(String)`
- [x] `ConfigurationError(String)`
- [x] Implement `From` conversions for underlying errors
- [x] Set up logging infrastructure
- [x] Configure `tracing` subscriber in examples/tests
- [x] Define logging macros/utilities if needed
- [x] Add instrumentation attributes to key functions
- [x] Add license headers to all source files
- [x] Create script or template for Apache 2.0 headers
- [x] Apply to all `.rs` files
### Success Criteria
- [x] Project compiles with `cargo build`
- [x] Basic error types are usable in tests
- [x] Logging produces output in examples
- [x] All files have proper license headers
### Dependencies
- None (first phase)
---
## Phase 2: Basic Model Management
### Objectives
Implement the core model loading and management functionality, respecting the `!Send` constraint.
### Tasks
- [x] Implement `EmbeddingModel` struct (`src/model.rs`)
- [x] Define struct using `self_cell` crate to handle self-referential fields
- [x] Use `ModelCell` to safely store both `LlamaModel` and `LlamaContext`
- [x] Ensure struct is marked as `!Send` (automatic due to `LlamaContext`)
- [x] Add model metadata fields (name, path, dimensions)
- [x] Implement model initialization
- [x] Create `new()` method accepting model path
- [x] Initialize `LlamaBackend` (once per process)
- [x] Load model with `LlamaModelParams`
- [x] Create context with `LlamaContextParams`
- [x] Configure thread count settings
- [x] Implement model configuration (`src/config.rs`)
- [x] Define `ModelConfig` struct with:
- [x] `model_path: PathBuf`
- [x] `model_name: String`
- [x] `n_ctx: Option<u32>` (context size)
- [x] `n_threads: Option<usize>` (CPU threads)
- [x] `n_gpu_layers: Option<u32>` (GPU offload)
- [x] ~~`seed: Option<u32>`~~ (removed - not applicable for embeddings)
- [x] Implement builder pattern for `ModelConfig`
- [x] Add model lifecycle methods
- [x] `load()` - Load model from disk
- [x] `unload()` - Clean up resources
- [x] `is_loaded()` - Check model state
- [x] Implement `Drop` trait for cleanup
- [x] Add basic model information methods
- [x] `embedding_dimensions()` - Get output dimension size
- [x] `max_sequence_length()` - Get max input length
- [x] `model_size()` - Get model memory footprint
- [x] Write unit tests
- [x] Test model loading with valid path
- [x] Test error handling for invalid paths
- [x] Test configuration builder
- [x] Verify `!Send` constraint at compile time
### Success Criteria
- [x] Can load a GGUF model file successfully (requires actual model file)
- [x] Proper error messages for load failures
- [x] Model cleanup on drop
- [x] Configuration builder works as expected
### Dependencies
- Phase 1 (Core Infrastructure)
---
## Phase 3: Single Embedding Generation
### Objectives
Implement single-text embedding generation with proper tokenization and processing.
### Tasks
- [x] Implement tokenization
- [x] Add tokenization method to `EmbeddingModel`
- [x] Handle text encoding properly (UTF-8)
- [x] Implement token limit checking
- [x] Add special tokens handling (BOS/EOS)
- [x] Implement embedding generation
- [x] Create `generate_embedding(&self, text: &str) -> Result<Vec<f32>>`
- [x] Tokenize input text
- [x] Create `LlamaBatch` for tokens
- [x] Perform forward pass through model
- [x] Extract embedding vector from output
- [x] Normalize embeddings if required
- [x] Implement `EmbeddingEngine` (`src/engine.rs`)
- [x] Define public interface struct
- [x] Store model instances (thread-local)
- [x] Implement model registry/lookup
- [x] Add `embed()` public method
- [x] Add embedding post-processing
- [x] L2 normalization option
- [x] Pooling strategies (Mean, CLS, Max, MeanSqrt)
- [x] Output format configuration
- [x] Implement builder pattern for `EmbeddingEngine`
- [x] `EngineConfig` struct with defaults (already existed)
- [x] Builder methods for configuration
- [x] Validation of configuration
- [x] Error handling improvements
- [x] Add specific error types for embedding failures
- [x] Provide helpful error messages
- [x] Include context in errors (model name, text length)
- [x] Write integration tests
- [x] Test with small test model (requires actual model)
- [x] Verify embedding dimensions
- [x] Test error cases (empty text, too long text)
- [x] Benchmark single embedding performance
### Success Criteria
- [x] Can generate embeddings for simple text
- [x] Embeddings have correct dimensions
- [x] Performance meets target (<100ms for small text)
- [x] Proper error handling for edge cases
### Dependencies
- Phase 2 (Model Management)
---
## Phase 4: Batch Processing
### Objectives
Implement efficient batch processing with parallel pre/post-processing while respecting thread constraints.
### Tasks
- [x] Implement batch processing logic (`src/batch.rs`)
- [x] Define `BatchProcessor` struct
- [x] Implement text collection and validation
- [x] Handle variable-length inputs
- [x] Optimize memory allocation
- [x] Implement parallel tokenization
- [x] Use `rayon` for parallel text processing
- [x] Tokenize multiple texts concurrently
- [x] Collect tokens into batches
- [x] Handle tokenization errors gracefully
- [x] Implement sequential model inference
- [x] Process token batches through model (single-threaded)
- [x] Handle batch size limits
- [x] Implement progress tracking
- [x] Manage memory efficiently
- [x] Implement parallel post-processing
- [x] Normalize embeddings in parallel
- [x] Format output concurrently
- [x] Aggregate results efficiently
- [x] Add batch API to `EmbeddingEngine`
- [x] `embed_batch()` method using `BatchProcessor`
- [x] Configure batch size limits
- [x] Add progress callback option
- [x] Return results in input order
- [x] Optimize batch performance
- [x] ~~Profile with `flamegraph`~~ (deferred to Phase 6)
- [x] Minimize allocations with pre-allocation
- [x] Optimize memory layout
- [x] Tune batch sizes (default: 64)
- [x] Handle edge cases
- [x] Empty batch
- [x] Single item batch
- [x] Very large batches
- [x] Mixed text lengths
- [x] Write comprehensive tests
- [x] Test various batch sizes
- [x] Verify order preservation
- [x] Test concurrent batch requests
- [x] Benchmark throughput
### Success Criteria
- [x] Batch processing faster than sequential (parallel pre/post-processing)
- [x] ~~Throughput >1000 embeddings/second~~ (requires real model for measurement)
- [x] Memory usage scales linearly
- [x] Results match single-embedding quality
### Dependencies
- Phase 3 (Single Embedding Generation)
---
## Phase 5: Testing & Documentation
### Objectives
Ensure code quality with comprehensive testing and provide excellent documentation.
### Tasks
- [x] Set up test infrastructure
- [x] ~~Download test models (MiniLM, jina-embeddings)~~ Created mock fixtures
- [x] Create test fixtures
- [ ] Set up CI test environment
- [ ] Configure code coverage
- [x] Write unit tests (per module)
- [x] Test `error.rs` - Error conversions and display
- [x] Test `config.rs` - Builder patterns and validation
- [ ] Test `model.rs` - Model lifecycle
- [ ] Test `batch.rs` - Batch processing logic
- [ ] Test `engine.rs` - Public API
> NOTE: Enhanced unit tests for error.rs and config.rs with comprehensive coverage
> BUG: Fixed config defaults and whitespace validation after code review
- [x] Write integration tests
- [x] End-to-end embedding generation
- [x] Model loading and unloading cycles
- [x] Batch processing with real models
- [x] Error recovery scenarios
- [x] Write concurrency tests
- [x] Verify thread-local model isolation
- [x] Test parallel batch operations
- [x] Ensure no data races
- [x] Test resource cleanup
> NOTE: Added #[serial] attributes for test isolation and fixed panic test
- [x] Write performance benchmarks
- [x] Single embedding latency
- [x] Batch throughput
- [x] Memory usage patterns
- [x] Scaling characteristics
> PERFORMANCE ISSUE: Benchmarks require real model file (set EMBELLAMA_BENCH_MODEL)
- [x] Create documentation
- [ ] Write rustdoc for all public APIs
- [x] Create usage examples
- [x] Write README with quickstart
- [x] Document configuration options
- [x] Add architecture diagrams
- [x] Create example applications
- [x] Simple embedding generation
- [x] Batch processing example
- [ ] Configuration examples
- [x] Error handling patterns
- [ ] Multi-model usage example
> NOTE: Added comprehensive error_handling.rs example with retry logic and batch error recovery
- [ ] Set up quality checks
- [ ] Configure `clippy` lints
- [x] Set up `rustfmt` configuration (exists)
- [ ] Add pre-commit hooks
- [x] Configure CI/CD pipeline (exists)
### Success Criteria
- [ ] Test coverage >80% (pending real model tests)
- [ ] All public APIs documented (partial)
- [x] Examples run successfully
- [ ] CI/CD pipeline green
### Dependencies
- Phase 4 (Batch Processing)
---
## Phase 6: Performance Optimization
### Objectives
Optimize performance based on profiling results and real-world usage patterns.
### Tasks
- [ ] Profile current implementation
- [ ] Use `cargo flamegraph` for CPU profiling
- [ ] Measure memory allocations with `valgrind`
- [ ] Identify bottlenecks
- [ ] Create baseline benchmarks
- [ ] Optimize hot paths
- [ ] Minimize allocations in embedding loop
- [ ] Use `SmallVec` for small collections
- [ ] Optimize vector operations
- [ ] Consider SIMD for normalization
- [ ] Improve memory efficiency
- [ ] Implement token buffer pooling
- [ ] Reuse embedding buffers
- [ ] Optimize batch memory layout
- [ ] Reduce temporary allocations
- [ ] Enhance caching strategies
- [ ] Cache tokenization results
- [ ] Implement LRU cache for frequent inputs
- [ ] Cache model metadata
- [ ] Consider memory-mapped models
- [ ] Optimize configuration
- [ ] Auto-tune thread counts
- [ ] Optimize batch sizes
- [ ] Profile-guided optimization
- [ ] Platform-specific optimizations
- [ ] Consider alternative implementations
- [ ] Evaluate `parking_lot` mutexes
- [ ] Test different channel implementations
- [ ] Compare allocator options
- [ ] Investigate GPU acceleration
- [ ] Validate optimizations
- [ ] Ensure correctness maintained
- [ ] Verify performance improvements
- [ ] Check memory usage reduction
- [ ] Test on different hardware
### Success Criteria
- [ ] 20% performance improvement
- [ ] Reduced memory footprint
- [ ] No regression in accuracy
- [ ] Maintained code clarity
### Dependencies
- Phase 5 (Testing & Documentation)
---
## Implementation Notes
### Self-Referential Struct Solution
The `LlamaContext` holds a reference to `LlamaModel`, creating a self-referential struct challenge. This was solved using the `self_cell` crate:
1. **The Problem**: `LlamaContext<'a>` borrows from `LlamaModel` with lifetime `'a`
2. **The Solution**: Use `self_cell::self_cell!` macro to safely create a self-referential struct
3. **Implementation**:
```rust
self_cell! {
struct ModelCell {
owner: LlamaModel,
#[covariant]
dependent: LlamaContext,
}
}
```
4. **Access Pattern**:
- Use `cell.borrow_owner()` to access the model
- Use `cell.borrow_dependent()` to access the context
- The macro ensures proper lifetime management and drop order
### Thread Safety Considerations
Due to the `!Send` constraint of `LlamaContext`:
1. **Never attempt to share models between threads**
```rust
let model = Arc::new(EmbeddingModel::new(...)?);
thread_local! {
static MODEL: RefCell<Option<EmbeddingModel>> = RefCell::new(None);
}
```
2. **Use message passing for concurrent operations**
```rust
```
3. **Each worker thread needs its own model instance**
- This increases memory usage but ensures thread safety
- Models cannot be pooled or shared
- Consider model lazy loading to manage memory
### Testing Strategy
1. **Unit Tests**: Test individual components in isolation
2. **Integration Tests**: Test with real GGUF models
3. **Property Tests**: Use `proptest` for edge cases
4. **Benchmarks**: Track performance regressions
5. **Thread Safety Tests**: Verify compile-time guarantees
### Performance Targets
- Single embedding: <50ms (CPU), <10ms (GPU)
- Batch throughput: >1000 embeddings/second
- Memory overhead: <2x model size
- Startup time: <5 seconds for 1GB model
### Error Handling Philosophy
- Use `Result<T, E>` everywhere
- Provide context in errors
- Make errors actionable
- Never panic in library code
- Log errors appropriately
## Success Metrics
### Phase Completion Criteria
- [ ] All tasks completed
- [ ] Tests passing
- [ ] Documentation complete
- [ ] Code reviewed
- [ ] Performance targets met
### Overall Project Success
- [ ] Library is usable standalone
- [ ] API is ergonomic and Rust-idiomatic
- [ ] Performance meets or exceeds targets
- [ ] Thread safety guaranteed at compile time
- [ ] Comprehensive documentation and examples