rlm-cli 1.3.0

Recursive Language Model (RLM) REPL for Claude Code - handles long-context tasks via chunking and recursive sub-LLM calls
Documentation
---
title: "ADR-012: Concurrency Model with Rayon"
description: "Decision to use Rayon for data-parallel concurrency."
sidebar:
  label: "ADR-012: Concurrency Rayon"
---

# ADR-012: Concurrency Model with Rayon

## Status

Accepted

## Context

### Background and Problem Statement

Processing large documents requires efficient parallel execution. When loading multi-megabyte files and generating embeddings for hundreds of chunks, single-threaded execution becomes a bottleneck. The system needs a concurrency model that:

1. Scales with available CPU cores
2. Maintains thread safety without complex synchronization
3. Integrates cleanly with existing sequential code
4. Handles errors gracefully in parallel contexts

### Performance Requirements

- Load and chunk 10MB+ files efficiently
- Generate embeddings for 100+ chunks concurrently
- Minimize latency for interactive CLI usage
- Scale from single-core to many-core systems

## Decision Drivers

### Primary Decision Drivers

1. **Work-stealing efficiency**: Automatic load balancing across threads
2. **Simple API**: Parallel iterators mirror sequential code
3. **Safety guarantees**: Compile-time verification of data race freedom
4. **Error propagation**: Clean handling of failures in parallel contexts

### Secondary Decision Drivers

1. **Zero-overhead abstraction**: No runtime cost for unused parallelism
2. **Rust ecosystem**: Well-maintained, widely-used crate
3. **Configurable**: Thread pool sizing and scheduling options

## Considered Options

### Option 1: Rayon (Chosen)

Data-parallelism library using work-stealing for automatic load balancing.

**Pros:**
- Drop-in parallel iterators (`.par_iter()`)
- Automatic work distribution
- Scoped threads prevent dangling references
- Excellent documentation and ecosystem support

**Cons:**
- Adds dependency (~100KB)
- Thread pool overhead for small workloads

### Option 2: std::thread

Rust standard library threading primitives.

**Pros:**
- No external dependencies
- Full control over thread lifecycle

**Cons:**
- Manual thread management
- No work-stealing
- Complex error handling across threads

### Option 3: tokio

Async runtime for I/O-bound workloads.

**Pros:**
- Excellent for network operations
- Mature ecosystem

**Cons:**
- Async/await complexity for CPU-bound work
- Overhead for non-I/O operations
- Viral async requirement

### Option 4: crossbeam

Low-level concurrency primitives.

**Pros:**
- Fine-grained control
- Scoped threads

**Cons:**
- More boilerplate than Rayon
- Manual work distribution

## Decision

We chose **Rayon** for parallel processing with the following patterns:

### Parallel Semantic Search

`semantic_search()` uses `par_iter()` to distribute the O(n) cosine-similarity scan across
available CPU cores. This provides near-linear speedup for collections of >1 000 embeddings
without changing the public API surface:

```rust
let mut similarities: Vec<(i64, f32)> = all_embeddings
    .par_iter()
    .map(|(chunk_id, embedding)| {
        let sim = cosine_similarity(&query_embedding, embedding);
        (*chunk_id, sim)
    })
    .filter(|(_, sim)| *sim >= config.similarity_threshold)
    .collect();

// Sort by similarity descending
similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));

// Limit results
similarities.truncate(config.top_k * 2);
```

The `rayon` import is scoped to the function body (`use rayon::prelude::*;`) to keep the
module-level namespace clean.

### Parallel Chunking

The `ParallelChunker` uses Rayon for concurrent chunk processing:

```rust
pub struct ParallelChunker {
    inner: Box<dyn Chunker>,
    thread_count: Option<usize>,
}

impl ParallelChunker {
    pub fn chunk(&self, buffer_id: i64, content: &str, path: Option<&Path>) -> Result<Vec<Chunk>> {
        // Split content into segments
        let segments = self.split_into_segments(content);

        // Process segments in parallel
        let results: Result<Vec<Vec<Chunk>>> = segments
            .par_iter()
            .map(|segment| self.inner.chunk(buffer_id, segment, path))
            .collect();

        // Merge and reindex
        Ok(self.merge_chunks(results?))
    }
}
```

### Thread Safety for Storage

SQLite connections are not thread-safe. We use `Mutex` wrapping:

```rust
pub struct SqliteStorage {
    conn: Mutex<Connection>,
    db_path: PathBuf,
}
```

This ensures only one thread accesses the database at a time while allowing parallel chunk processing.

### Parallel Embedding Generation

Embedding batches can be processed in parallel when chunks don't share state:

```rust
let embeddings: Vec<Embedding> = chunks
    .par_chunks(BATCH_SIZE)
    .flat_map(|batch| embedder.embed_batch(batch))
    .collect();
```

## Consequences

### Positive Consequences

1. **Linear speedup**: Processing time scales with core count
2. **Simple code**: Parallel iterators look like sequential code
3. **Safe by default**: Compile-time data race prevention
4. **Automatic optimization**: Work-stealing balances load

### Negative Consequences

1. **Memory overhead**: Each thread needs stack space
2. **Debugging complexity**: Parallel execution harder to trace
3. **SQLite serialization**: Database becomes bottleneck for write-heavy workloads

### Neutral Consequences

1. **Thread pool startup**: One-time initialization cost
2. **Batch size tuning**: Optimal batch size depends on workload

## Implementation Notes

### Configuring Thread Count

```rust
// Use custom thread pool for specific operations
rayon::ThreadPoolBuilder::new()
    .num_threads(4)
    .build_global()
    .unwrap();
```

### Error Handling in Parallel Contexts

Rayon's `collect::<Result<Vec<_>>>()` short-circuits on first error:

```rust
let results: Result<Vec<Chunk>> = segments
    .par_iter()
    .map(|s| process(s))  // Returns Result<Chunk>
    .collect();           // Stops on first Err
```

### When NOT to Use Parallelism

- Small workloads (< 10 chunks): Overhead exceeds benefit
- Sequential dependencies: Operations that must be ordered
- I/O-bound work: Consider async instead

## Performance Characteristics

| Operation | Sequential | Parallel (8 cores) | Speedup |
|-----------|------------|-------------------|---------|
| Chunk 1MB | 50ms | 15ms | 3.3x |
| Embed 100 chunks | 2000ms | 300ms | 6.7x |
| Semantic search 1000 chunks | 100ms | 20ms | 5x |

*Measurements on Apple M1 Pro, representative workloads*

## References

- [Rayon documentation](https://docs.rs/rayon)
- [Rayon design](https://github.com/rayon-rs/rayon/blob/master/FAQ.md)
- [Rust Parallelism](https://doc.rust-lang.org/book/ch16-00-concurrency.html)