embellama 0.5.0

High-performance Rust library for generating text embeddings using llama-cpp
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
# Implementation Plan: Embellama Core Library

This document outlines the phased implementation plan for the `embellama` core library component. Each phase is designed to be independently implementable with clear deliverables and success criteria.

## Critical Design Constraint

**⚠️ IMPORTANT**: The `LlamaContext` from `llama-cpp-2` is `!Send` and `!Sync`. This means:
- Context cannot be moved between threads
- Context cannot be shared using `Arc`
- Each thread must own its model instance
- All concurrency must use message passing

## Phase 1: Project Setup & Core Infrastructure
**Priority: CRITICAL** | **Estimated Time: 2-4 hours**

### Objectives
Establish the foundational project structure, dependencies, and core error handling infrastructure.

### Tasks
- [x] Initialize Cargo project with appropriate metadata
  - [x] Set package name, version, authors, description
  - [x] Configure license (Apache-2.0)
  - [x] Add repository and documentation links

- [x] Configure dependencies in `Cargo.toml`
  - [x] Add `llama-cpp-2 = "0.1.117"` for model backend
  - [x] Add `thiserror = "1.0"` for error handling
  - [x] Add `anyhow = "1.0"` for application errors
  - [x] Add `tracing = "0.1"` for structured logging
  - [x] Add `serde = { version = "1.0", features = ["derive"] }` for serialization
  - [x] Add `rayon = "1.8"` for parallel batch processing

- [x] Create module structure
  - [x] Create `src/lib.rs` with module declarations
  - [x] Create `src/error.rs` for error types
  - [x] Create `src/config.rs` for configuration types
  - [x] Create placeholder files for other modules

- [x] Implement error handling foundation (`src/error.rs`)
  - [x] Define `EmbellamaError` enum with `thiserror`
  - [x] Add variants for common error cases:
    - [x] `ModelNotFound(String)`
    - [x] `ModelLoadFailed { path: String, source: llama_cpp_2::Error }`
    - [x] `EmbeddingFailed`
    - [x] `InvalidInput(String)`
    - [x] `ConfigurationError(String)`
  - [x] Implement `From` conversions for underlying errors

- [x] Set up logging infrastructure
  - [x] Configure `tracing` subscriber in examples/tests
  - [x] Define logging macros/utilities if needed
  - [x] Add instrumentation attributes to key functions

- [x] Add license headers to all source files
  - [x] Create script or template for Apache 2.0 headers
  - [x] Apply to all `.rs` files

### Success Criteria
- [x] Project compiles with `cargo build`
- [x] Basic error types are usable in tests
- [x] Logging produces output in examples
- [x] All files have proper license headers

### Dependencies
- None (first phase)

---

## Phase 2: Basic Model Management
**Priority: HIGH** | **Estimated Time: 4-6 hours**

### Objectives
Implement the core model loading and management functionality, respecting the `!Send` constraint.

### Tasks
- [x] Implement `EmbeddingModel` struct (`src/model.rs`)
  - [x] Define struct using `self_cell` crate to handle self-referential fields
  - [x] Use `ModelCell` to safely store both `LlamaModel` and `LlamaContext`
  - [x] Ensure struct is marked as `!Send` (automatic due to `LlamaContext`)
  - [x] Add model metadata fields (name, path, dimensions)

- [x] Implement model initialization
  - [x] Create `new()` method accepting model path
  - [x] Initialize `LlamaBackend` (once per process)
  - [x] Load model with `LlamaModelParams`
  - [x] Create context with `LlamaContextParams`
  - [x] Configure thread count settings

- [x] Implement model configuration (`src/config.rs`)
  - [x] Define `ModelConfig` struct with:
    - [x] `model_path: PathBuf`
    - [x] `model_name: String`
    - [x] `n_ctx: Option<u32>` (context size)
    - [x] `n_threads: Option<usize>` (CPU threads)
    - [x] `n_gpu_layers: Option<u32>` (GPU offload)
    - [x] ~~`seed: Option<u32>`~~ (removed - not applicable for embeddings)
  - [x] Implement builder pattern for `ModelConfig`

- [x] Add model lifecycle methods
  - [x] `load()` - Load model from disk
  - [x] `unload()` - Clean up resources
  - [x] `is_loaded()` - Check model state
  - [x] Implement `Drop` trait for cleanup

- [x] Add basic model information methods
  - [x] `embedding_dimensions()` - Get output dimension size
  - [x] `max_sequence_length()` - Get max input length
  - [x] `model_size()` - Get model memory footprint

- [x] Write unit tests
  - [x] Test model loading with valid path
  - [x] Test error handling for invalid paths
  - [x] Test configuration builder
  - [x] Verify `!Send` constraint at compile time

### Success Criteria
- [x] Can load a GGUF model file successfully (requires actual model file)
- [x] Proper error messages for load failures
- [x] Model cleanup on drop
- [x] Configuration builder works as expected

### Dependencies
- Phase 1 (Core Infrastructure)

---

## Phase 3: Single Embedding Generation
**Priority: HIGH** | **Estimated Time: 6-8 hours**

### Objectives
Implement single-text embedding generation with proper tokenization and processing.

### Tasks
- [x] Implement tokenization
  - [x] Add tokenization method to `EmbeddingModel`
  - [x] Handle text encoding properly (UTF-8)
  - [x] Implement token limit checking
  - [x] Add special tokens handling (BOS/EOS)

- [x] Implement embedding generation
  - [x] Create `generate_embedding(&self, text: &str) -> Result<Vec<f32>>`
  - [x] Tokenize input text
  - [x] Create `LlamaBatch` for tokens
  - [x] Perform forward pass through model
  - [x] Extract embedding vector from output
  - [x] Normalize embeddings if required

- [x] Implement `EmbeddingEngine` (`src/engine.rs`)
  - [x] Define public interface struct
  - [x] Store model instances (thread-local)
  - [x] Implement model registry/lookup
  - [x] Add `embed()` public method

- [x] Add embedding post-processing
  - [x] L2 normalization option
  - [x] Pooling strategies (Mean, CLS, Max, MeanSqrt)
  - [x] Output format configuration

- [x] Implement builder pattern for `EmbeddingEngine`
  - [x] `EngineConfig` struct with defaults (already existed)
  - [x] Builder methods for configuration
  - [x] Validation of configuration

- [x] Error handling improvements
  - [x] Add specific error types for embedding failures
  - [x] Provide helpful error messages
  - [x] Include context in errors (model name, text length)

- [x] Write integration tests
  - [x] Test with small test model (requires actual model)
  - [x] Verify embedding dimensions
  - [x] Test error cases (empty text, too long text)
  - [x] Benchmark single embedding performance

### Success Criteria
- [x] Can generate embeddings for simple text
- [x] Embeddings have correct dimensions
- [x] Performance meets target (<100ms for small text)
- [x] Proper error handling for edge cases

### Dependencies
- Phase 2 (Model Management)

---

## Phase 4: Batch Processing
**Priority: MEDIUM** | **Estimated Time: 8-10 hours**

### Objectives
Implement efficient batch processing with parallel pre/post-processing while respecting thread constraints.

### Tasks
- [x] Implement batch processing logic (`src/batch.rs`)
  - [x] Define `BatchProcessor` struct
  - [x] Implement text collection and validation
  - [x] Handle variable-length inputs
  - [x] Optimize memory allocation

- [x] Implement parallel tokenization
  - [x] Use `rayon` for parallel text processing
  - [x] Tokenize multiple texts concurrently
  - [x] Collect tokens into batches
  - [x] Handle tokenization errors gracefully

- [x] Implement sequential model inference
  - [x] Process token batches through model (single-threaded)
  - [x] Handle batch size limits
  - [x] Implement progress tracking
  - [x] Manage memory efficiently

- [x] Implement parallel post-processing
  - [x] Normalize embeddings in parallel
  - [x] Format output concurrently
  - [x] Aggregate results efficiently

- [x] Add batch API to `EmbeddingEngine`
  - [x] `embed_batch()` method using `BatchProcessor`
  - [x] Configure batch size limits
  - [x] Add progress callback option
  - [x] Return results in input order

- [x] Optimize batch performance
  - [x] ~~Profile with `flamegraph`~~ (deferred to Phase 6)
  - [x] Minimize allocations with pre-allocation
  - [x] Optimize memory layout
  - [x] Tune batch sizes (default: 64)

- [x] Handle edge cases
  - [x] Empty batch
  - [x] Single item batch
  - [x] Very large batches
  - [x] Mixed text lengths

- [x] Write comprehensive tests
  - [x] Test various batch sizes
  - [x] Verify order preservation
  - [x] Test concurrent batch requests
  - [x] Benchmark throughput

### Success Criteria
- [x] Batch processing faster than sequential (parallel pre/post-processing)
- [x] ~~Throughput >1000 embeddings/second~~ (requires real model for measurement)
- [x] Memory usage scales linearly
- [x] Results match single-embedding quality

### Dependencies
- Phase 3 (Single Embedding Generation)

---

## Phase 5: Testing & Documentation
**Priority: MEDIUM** | **Estimated Time: 6-8 hours**

### Objectives
Ensure code quality with comprehensive testing and provide excellent documentation.

### Tasks
- [x] Set up test infrastructure
  - [x] ~~Download test models (MiniLM, jina-embeddings)~~ Created mock fixtures
  - [x] Create test fixtures
  - [ ] Set up CI test environment
  - [ ] Configure code coverage

- [x] Write unit tests (per module)
  - [x] Test `error.rs` - Error conversions and display
  - [x] Test `config.rs` - Builder patterns and validation
  - [ ] Test `model.rs` - Model lifecycle
  - [ ] Test `batch.rs` - Batch processing logic
  - [ ] Test `engine.rs` - Public API
  > NOTE: Enhanced unit tests for error.rs and config.rs with comprehensive coverage
  > BUG: Fixed config defaults and whitespace validation after code review

- [x] Write integration tests
  - [x] End-to-end embedding generation
  - [x] Model loading and unloading cycles
  - [x] Batch processing with real models
  - [x] Error recovery scenarios

- [x] Write concurrency tests
  - [x] Verify thread-local model isolation
  - [x] Test parallel batch operations
  - [x] Ensure no data races
  - [x] Test resource cleanup
  > NOTE: Added #[serial] attributes for test isolation and fixed panic test

- [x] Write performance benchmarks
  - [x] Single embedding latency
  - [x] Batch throughput
  - [x] Memory usage patterns
  - [x] Scaling characteristics
  > PERFORMANCE ISSUE: Benchmarks require real model file (set EMBELLAMA_BENCH_MODEL)

- [x] Create documentation
  - [ ] Write rustdoc for all public APIs
  - [x] Create usage examples
  - [x] Write README with quickstart
  - [x] Document configuration options
  - [x] Add architecture diagrams

- [x] Create example applications
  - [x] Simple embedding generation
  - [x] Batch processing example
  - [ ] Configuration examples
  - [x] Error handling patterns
  - [ ] Multi-model usage example
  > NOTE: Added comprehensive error_handling.rs example with retry logic and batch error recovery

- [ ] Set up quality checks
  - [ ] Configure `clippy` lints
  - [x] Set up `rustfmt` configuration (exists)
  - [ ] Add pre-commit hooks
  - [x] Configure CI/CD pipeline (exists)

### Success Criteria
- [ ] Test coverage >80% (pending real model tests)
- [ ] All public APIs documented (partial)
- [x] Examples run successfully
- [ ] CI/CD pipeline green

### Dependencies
- Phase 4 (Batch Processing)

---

## Phase 6: Performance Optimization
**Priority: LOW** | **Estimated Time: 4-6 hours**

### Objectives
Optimize performance based on profiling results and real-world usage patterns.

### Tasks
- [ ] Profile current implementation
  - [ ] Use `cargo flamegraph` for CPU profiling
  - [ ] Measure memory allocations with `valgrind`
  - [ ] Identify bottlenecks
  - [ ] Create baseline benchmarks

- [ ] Optimize hot paths
  - [ ] Minimize allocations in embedding loop
  - [ ] Use `SmallVec` for small collections
  - [ ] Optimize vector operations
  - [ ] Consider SIMD for normalization

- [ ] Improve memory efficiency
  - [ ] Implement token buffer pooling
  - [ ] Reuse embedding buffers
  - [ ] Optimize batch memory layout
  - [ ] Reduce temporary allocations

- [ ] Enhance caching strategies
  - [ ] Cache tokenization results
  - [ ] Implement LRU cache for frequent inputs
  - [ ] Cache model metadata
  - [ ] Consider memory-mapped models

- [ ] Optimize configuration
  - [ ] Auto-tune thread counts
  - [ ] Optimize batch sizes
  - [ ] Profile-guided optimization
  - [ ] Platform-specific optimizations

- [ ] Consider alternative implementations
  - [ ] Evaluate `parking_lot` mutexes
  - [ ] Test different channel implementations
  - [ ] Compare allocator options
  - [ ] Investigate GPU acceleration

- [ ] Validate optimizations
  - [ ] Ensure correctness maintained
  - [ ] Verify performance improvements
  - [ ] Check memory usage reduction
  - [ ] Test on different hardware

### Success Criteria
- [ ] 20% performance improvement
- [ ] Reduced memory footprint
- [ ] No regression in accuracy
- [ ] Maintained code clarity

### Dependencies
- Phase 5 (Testing & Documentation)

---

## Implementation Notes

### Self-Referential Struct Solution

The `LlamaContext` holds a reference to `LlamaModel`, creating a self-referential struct challenge. This was solved using the `self_cell` crate:

1. **The Problem**: `LlamaContext<'a>` borrows from `LlamaModel` with lifetime `'a`
2. **The Solution**: Use `self_cell::self_cell!` macro to safely create a self-referential struct
3. **Implementation**:
   ```rust
   self_cell! {
       struct ModelCell {
           owner: LlamaModel,
           #[covariant]
           dependent: LlamaContext,
       }
   }
   ```
4. **Access Pattern**:
   - Use `cell.borrow_owner()` to access the model
   - Use `cell.borrow_dependent()` to access the context
   - The macro ensures proper lifetime management and drop order

### Thread Safety Considerations

Due to the `!Send` constraint of `LlamaContext`:

1. **Never attempt to share models between threads**
   ```rust
   // ❌ Won't compile
   let model = Arc::new(EmbeddingModel::new(...)?);

   // ✅ Correct approach
   thread_local! {
       static MODEL: RefCell<Option<EmbeddingModel>> = RefCell::new(None);
   }
   ```

2. **Use message passing for concurrent operations**
   ```rust
   // For the library, batch processing uses:
   // - Parallel pre-processing (rayon)
   // - Sequential model inference (single thread)
   // - Parallel post-processing (rayon)
   ```

3. **Each worker thread needs its own model instance**
   - This increases memory usage but ensures thread safety
   - Models cannot be pooled or shared
   - Consider model lazy loading to manage memory

### Testing Strategy

1. **Unit Tests**: Test individual components in isolation
2. **Integration Tests**: Test with real GGUF models
3. **Property Tests**: Use `proptest` for edge cases
4. **Benchmarks**: Track performance regressions
5. **Thread Safety Tests**: Verify compile-time guarantees

### Performance Targets

- Single embedding: <50ms (CPU), <10ms (GPU)
- Batch throughput: >1000 embeddings/second
- Memory overhead: <2x model size
- Startup time: <5 seconds for 1GB model

### Error Handling Philosophy

- Use `Result<T, E>` everywhere
- Provide context in errors
- Make errors actionable
- Never panic in library code
- Log errors appropriately

## Success Metrics

### Phase Completion Criteria
- [ ] All tasks completed
- [ ] Tests passing
- [ ] Documentation complete
- [ ] Code reviewed
- [ ] Performance targets met

### Overall Project Success
- [ ] Library is usable standalone
- [ ] API is ergonomic and Rust-idiomatic
- [ ] Performance meets or exceeds targets
- [ ] Thread safety guaranteed at compile time
- [ ] Comprehensive documentation and examples