embellama 0.2.0

High-performance Rust library for generating text embeddings using llama-cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
# Architecture Proposal for Embellama

This document outlines the architectural design for `embellama`, a Rust crate and server for generating text embeddings using `llama-cpp-2`.

## 1. Overview

Embellama will consist of two main components:

1.  A **core library** that provides a robust and ergonomic Rust API for interacting with `llama.cpp` to generate embeddings.
2.  An **API server**, available as a feature flag, that exposes the library's functionality through an OpenAI-compatible REST API.

The primary goal is to create a high-performance, easy-to-use tool for developers who need to integrate local embedding models into their Rust applications.

## 2. Goals and Non-Goals

### Goals

*   Provide a simple and intuitive Rust API for embedding generation.
*   Support for loading/unloading models, and both single-text and batch embedding generation.
*   Offer an optional, feature-flagged `axum`-based server with an OpenAI-compatible API (`/v1/embeddings`).
*   Prioritize both low-latency single requests and high-throughput batch processing.
*   Enable configuration of the library via a programmatic builder pattern and the server via CLI arguments.

### Non-Goals

*   The library will **not** handle the downloading of models. Users are responsible for providing their own GGUF-formatted model files.
*   The initial version will only support the `llama.cpp` backend via the `llama-cpp-2` crate. Other backends are out of scope for now.
*   The server will not handle authentication or authorization. It is expected to run in a trusted environment.

## 3. Core Concepts

### `EmbeddingModel`

A struct that represents a loaded `llama.cpp` model. It will encapsulate the `llama_cpp_2::LlamaModel` and `llama_cpp_2::LlamaContext` and handle the logic for generating embeddings.

### `EmbeddingEngine`

The main entry point for the library. It will manage the lifecycle of `EmbeddingModel` instances (loading, unloading) and provide the public-facing API for generating embeddings. It will be configurable using a builder pattern.

### `AppState` (for Server)

An `axum` state struct that holds communication channels to the worker pool. Since `EmbeddingEngine` contains `!Send` types, it cannot be directly shared. Instead, `AppState` contains:
- A sender channel (`tokio::sync::mpsc::Sender`) for dispatching requests to workers
- Configuration data and metrics that are `Send + Sync`

## 4. Threading Model & Concurrency

### Critical Constraint: `!Send` LlamaContext

The `llama-cpp-2` library's `LlamaContext` contains `NonNull` pointers and other thread-local data, making it `!Send` and `!Sync`. This means:
- The context **cannot** be safely moved between threads
- The context **cannot** be shared between threads using `Arc`
- Each context instance must remain on the thread that created it

This fundamental constraint drives the entire concurrency architecture of both the library and server components.

### Library Threading Model

For the core library, embedding operations are inherently single-threaded per model instance:

```rust
// This will NOT work - compilation error
let model = Arc::new(EmbeddingModel::new(...)?); // ❌ EmbeddingModel is !Send

// Instead, each thread needs its own instance
let model = EmbeddingModel::new(...)?; // ✅ Thread-local instance
```

The library provides thread-safe batch processing through careful orchestration:
- **Pre-processing** (parallel): Text tokenization, validation - uses `rayon` for CPU parallelism
- **Model inference** (sequential): Single model instance processes tokens sequentially
- **Post-processing** (parallel): Result normalization, formatting - uses `rayon`

### Server Threading Architecture

The server uses a **message-passing worker pool** architecture to handle the `!Send` constraint:

```
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Axum Handler  │────▶│    Dispatcher    │────▶│  Worker Pool    │
│   (async)       │◀────│  (channel-based) │◀────│  (thread-local  │
└─────────────────┘     └──────────────────┘     │   models)       │
                                                  └─────────────────┘
```

#### Key Components:

1. **Inference Workers**: Dedicated threads that each own a `LlamaContext` instance
2. **Message Channels**: `tokio::sync::mpsc` for request routing
3. **Response Channels**: One-shot channels for request/response pattern
4. **Dispatcher**: Routes requests to available workers

#### Request Flow:

1. Axum handler receives HTTP request
2. Creates a one-shot response channel
3. Sends request + response channel to dispatcher via mpsc
4. Dispatcher forwards to next available worker
5. Worker processes with its thread-local model
6. Worker sends result back through one-shot channel
7. Handler awaits response and returns HTTP response

This architecture ensures:
- `LlamaContext` never crosses thread boundaries
- True parallel inference with multiple workers
- Non-blocking async HTTP handling
- Predictable performance under load

## 5. Library (`embellama`) Design

### Module Structure

```
src/
├── lib.rs         # Main library file, feature flags
├── engine.rs      # EmbeddingEngine implementation and builder
├── model.rs       # EmbeddingModel implementation
├── batch.rs       # Batch processing logic
├── config.rs      # Configuration structs for the engine
├── error.rs       # Custom error types
└── server/        # Server-specific modules (feature-gated)
    ├── worker.rs      # Inference worker thread implementation
    ├── dispatcher.rs  # Request routing to workers
    ├── channel.rs     # Channel types and message definitions
    └── state.rs       # AppState and server configuration
```

### Public API & Usage

The library will be configured using a builder pattern for `EmbeddingEngine`.

**Example Usage:**

```rust
use embellama::{EmbeddingEngine, EngineConfig};

// 1. Configure and build the engine
let config = EngineConfig::builder()
    .with_model_path("path/to/your/model.gguf")
    .with_model_name("my-embedding-model")
    .build()?;

let engine = EmbeddingEngine::new(config)?;

// 2. Generate a single embedding
let embedding = engine.embed("my-embedding-model", "Hello, world!")?;

// 3. Generate embeddings in a batch
let texts = vec!["First text", "Second text"];
let embeddings = engine.embed_batch("my-embedding-model", texts)?;

// 4. Unload the model when no longer needed
engine.unload_model("my-embedding-model")?;
```

### Error Handling

A custom `Error` enum will be defined in `src/error.rs` to handle all possible failures, from model loading to embedding generation. It will implement `std::error::Error` and provide conversions from underlying errors like those from `llama-cpp-2`.

## 6. Server (`server` feature) Design

The server will be enabled with a `server` feature flag.

### Dependencies

*   `axum`: For the web framework.
*   `tokio`: For the async runtime with multi-threaded runtime.
*   `clap`: For parsing CLI arguments.
*   `serde`: For JSON serialization/deserialization.
*   `tracing`: For logging.
*   `crossbeam-channel`: For robust channel implementations (optional, can use tokio channels).

### CLI Arguments

The server will be configured via CLI arguments using `clap`.

```bash
embellama-server --model-path /path/to/model.gguf --model-name my-model --port 8080 --workers 4
```

### `main.rs` (under `src/bin/server.rs` or similar)

The server's entry point will:
1.  Parse CLI arguments using `clap`.
2.  Spawn worker pool threads, each with its own `EmbeddingEngine` instance.
3.  Create channels for communication between axum handlers and workers.
4.  Create an `axum` `Router` with `AppState` containing the sender channel.
5.  Define the `/v1/embeddings` and `/health` endpoints.
6.  Start the server with proper graceful shutdown handling.

### OpenAI-Compatible API

**Endpoint:** `POST /v1/embeddings`

**Request Body (`EmbeddingsRequest`):**

```json
{
  "model": "my-model",
  "input": "A single string"
}
```

or

```json
{
  "model": "my-model",
  "input": ["An array", "of strings"]
}
```

**Response Body (`EmbeddingsResponse`):**

```json
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.1, 0.2, ...]
    }
  ],
  "model": "my-model",
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 12
  }
}
```

### API Request Flow

The complete flow of an embedding request through the server:

```rust
// 1. Axum handler receives request
async fn embeddings_handler(
    State(app_state): State<AppState>,
    Json(request): Json<EmbeddingsRequest>,
) -> Result<Json<EmbeddingsResponse>, Error> {
    // 2. Create one-shot channel for response
    let (tx, rx) = tokio::sync::oneshot::channel();

    // 3. Send request to dispatcher
    app_state.dispatcher_tx
        .send(WorkerRequest {
            model: request.model,
            input: request.input,
            response_tx: tx,
        })
        .await?;

    // 4. Await response from worker
    let embeddings = rx.await??;

    // 5. Format and return response
    Ok(Json(format_response(embeddings)))
}

// Worker thread processing
fn worker_thread(receiver: Receiver<WorkerRequest>) {
    // Own thread-local model instance
    let model = EmbeddingModel::new(config).unwrap();

    while let Ok(request) = receiver.recv() {
        // Process with thread-local model
        let result = model.generate_embedding(&request.input);

        // Send back through one-shot channel
        let _ = request.response_tx.send(result);
    }
}
```

This architecture ensures:
- Models never cross thread boundaries
- Async handlers remain non-blocking
- Parallel processing with multiple workers
- Clean error propagation

## 7. Concurrency & Scaling

### Worker Pool Configuration

The server's concurrency model is based on a configurable worker pool:

```rust
struct WorkerPoolConfig {
    num_workers: usize,        // Number of parallel inference workers
    queue_size: usize,          // Max pending requests per worker
    timeout_ms: u64,            // Request timeout
    max_batch_size: usize,      // Max batch size per worker
}
```

### Performance Characteristics

#### Single Worker Performance
- **Latency**: ~50ms for single embedding (CPU)
- **Throughput**: Sequential processing of requests
- **Memory**: Model size + context buffer (~500MB-2GB per model)

#### Multi-Worker Scaling
- **Linear scaling** up to number of CPU cores for CPU inference
- **GPU considerations**: Limited by VRAM; typically 1-2 workers per GPU
- **Memory usage**: `num_workers × model_memory_footprint`

### Resource Management

#### Memory Management
- Each worker maintains its own model instance in memory
- VRAM allocation for GPU workers must fit within GPU memory limits
- Automatic cleanup on worker shutdown

#### Backpressure Handling
When all workers are busy:
1. Requests queue up to `queue_size` limit
2. Beyond queue limit, server returns 503 Service Unavailable
3. Clients should implement exponential backoff

### Scaling Strategies

#### Horizontal Scaling (Multiple Servers)
- Deploy multiple server instances behind a load balancer
- Each instance manages its own worker pool
- Stateless design enables easy scaling

#### Vertical Scaling (More Workers)
- Increase `--workers` parameter up to available cores
- Monitor memory usage to avoid OOM
- Profile to find optimal worker count

#### Batch Optimization
- Workers can process batches for better throughput
- Batch requests are processed atomically by single worker
- Trade-off between latency and throughput

### Monitoring & Metrics

Key metrics to track:
- **Worker utilization**: % time workers are busy
- **Queue depth**: Number of pending requests
- **Request latency**: P50, P95, P99 percentiles
- **Throughput**: Requests/second, embeddings/second
- **Memory usage**: Per worker and total

## 8. Project Structure

```
embellama/
├── .gitignore
├── Cargo.toml
├── ARCHITECTURE.md
└── src/
    ├── lib.rs
    ├── engine.rs
    ├── model.rs
    ├── batch.rs
    ├── config.rs
    ├── error.rs
    ├── server/              # Server modules (feature-gated)
    │   ├── mod.rs
    │   ├── worker.rs
    │   ├── dispatcher.rs
    │   ├── channel.rs
    │   └── state.rs
    └── bin/
        └── server.rs  # Compiled only when "server" feature is enabled
```

In `Cargo.toml`:

```toml
[package]
name = "embellama"
version = "0.1.0"
edition = "2021"

[dependencies]
# Core library dependencies
llama-cpp-2 = "0.1.117"
thiserror = "1.0"
anyhow = "1.0"
tracing = "0.1"
serde = { version = "1.0", features = ["derive"] }
rayon = "1.8"  # For library batch processing only

# Server-only dependencies
axum = { version = "0.8", optional = true }
tokio = { version = "1.35", features = ["full"], optional = true }
clap = { version = "4.4", features = ["derive"], optional = true }
tower = { version = "0.4", optional = true }
tower-http = { version = "0.6", features = ["cors", "trace"], optional = true }

[features]
default = []
server = ["dep:axum", "dep:tokio", "dep:clap", "dep:tower", "dep:tower-http"]

[[bin]]
name = "embellama-server"
required-features = ["server"]
path = "src/bin/server.rs"
```

## 9. Testing Strategy

*   **Unit Tests:** Each module in the library will have unit tests to verify its logic in isolation.
*   **Integration Tests:** An `integration` test module will be created. These tests will require embedding models to be present for testing the full flow. We will specifically test against GGUF-converted versions of `sentence-transformers/all-MiniLM-L6-v2` and `jinaai/jina-embeddings-v2-base-code`. A build script or a helper script can be provided to download these models for testing purposes.
*   **Server E2E Tests:** A separate test suite will make HTTP requests to a running instance of the server to verify API compliance and behavior, using the same test models.
*   **Concurrency Tests:** Specific tests for the worker pool to verify thread safety, proper message passing, and concurrent request handling.

## 10. Thread Safety Guarantees

### Why LlamaContext is !Send

The `LlamaContext` from `llama-cpp-2` is marked as `!Send` (cannot be sent between threads) for several reasons:

1. **Raw Pointers**: Contains `NonNull` pointers to C++ objects that are not thread-safe
2. **FFI Boundary**: Interfaces with C++ code that assumes single-threaded access
3. **Internal State**: Maintains thread-local state that would be corrupted if accessed from multiple threads
4. **CUDA/GPU Context**: GPU operations are often tied to specific thread contexts

### Safety Guarantees

Our architecture provides the following guarantees:

#### For Library Users
```rust
// ✅ Safe: Each thread creates its own model
std::thread::spawn(|| {
    let model = EmbeddingModel::new(config)?;
    model.generate_embedding("text")
});

// ❌ Won't compile: Cannot share model between threads
let model = Arc::new(EmbeddingModel::new(config)?);  // Compilation error
```

#### For Server Operations
- **Guarantee 1**: Each worker thread owns exactly one model instance
- **Guarantee 2**: Models never move between threads (enforced at compile time)
- **Guarantee 3**: All communication uses message passing, not shared memory
- **Guarantee 4**: Request/response channels ensure proper synchronization

### Performance Trade-offs

The threading constraints lead to specific trade-offs:

| Approach | Pros | Cons |
|----------|------|------|
| **Worker Pool** (chosen) | True parallelism, predictable performance, safe | Higher memory usage (model per worker) |
| Single Shared Model | Lower memory usage | Sequential processing, lock contention |
| Model per Request | Maximum parallelism | High model loading overhead |

### Verification

The Rust compiler enforces these guarantees at compile time:
- Attempting to wrap `EmbeddingModel` in `Arc` results in compilation error
- Attempting to move models between threads results in compilation error
- The type system ensures all threading constraints are satisfied