Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Under Construction: This project is actively being developed and is not yet ready for production use. APIs and features may change without notice.
Why Reflex?
Every time your coding agent asks "How do I center a div?" it costs you tokens. Every time it asks "What's the syntax for centering a div in CSS?" it costs you tokens again---even though you already have a perfectly good answer sitting in your logs.
The problem is threefold:
| Pain Point | What Happens | The Cost |
|---|---|---|
| Duplicate queries | Agents ask semantically identical questions | You pay for the same answer repeatedly |
| Context bloat | JSON responses eat precious context window | Agents hit token limits faster |
| Cold starts | Every session starts from scratch | No institutional memory |
Reflex solves this by sitting between your agent and the LLM provider:
Agent ---> Reflex ---> OpenAI/Anthropic/etc.
|
v
[Cache Hit?]
|
Yes: Return instantly (~1ms)
No: Forward, cache response, return
The killer feature: Reflex doesn't just cache---it compresses. Responses are returned in Tauq format, a semantic encoding that cuts token overhead by ~40%. Your agent gets the same information in fewer tokens, extending its effective context window.
Quick Start
# 1. Start Qdrant (vector database)
# 2. Run Reflex
# 3. Point your agent to localhost:8080
That's it. Your existing OpenAI-compatible code works unchanged.
How It Works
Reflex uses a tiered cache architecture that balances speed with semantic understanding:
Request
│
▼
┌────────────────┐
│ L1: Exact │ <-- Hash lookup (~0.1ms)
│ (In-Memory) │ Identical requests return instantly
└───────┬────────┘
│ Miss
▼
┌────────────────┐
│ L2: Semantic │ <-- Vector search (~5ms)
│ (Qdrant) │ "Center a div" ≈ "CSS centering"
└───────┬────────┘
│ Candidates found
▼
┌────────────────┐
│ L3: Verification│ <-- Cross-encoder (~10ms)
│ (ModernBERT) │ Ensures intent actually matches
└───────┬────────┘
│ Verified / No match
▼
┌────────────────┐
│ Provider │ <-- Forward to OpenAI/Anthropic
│ (Upstream) │ Cache response for next time
└────────────────┘
Cache Tiers Explained
| Tier | What It Does | When It Hits | Latency |
|---|---|---|---|
| L1 Exact | Blake3 hash match on serialized request | Identical JSON payloads | <1ms |
| L2 Semantic | Embedding similarity via Qdrant | "How to sort?" ~ "Sorting algorithm?" | ~5ms |
| L3 Verification | Cross-encoder reranking | Confirms semantic match is valid | ~10ms |
The L3 verification step is critical: it prevents false positives from L2. Just because two questions are similar doesn't mean they have the same answer. The cross-encoder ensures the cached response actually answers the new question.
Features
- OpenAI-Compatible API --- Drop-in replacement. Change
base_url, nothing else. - Tiered Caching --- L1 exact + L2 semantic + L3 verification
- Tauq Compression --- ~40% token savings via semantic encoding
- Multi-Tenant --- Isolated caches per API key
- GPU Acceleration --- Metal (Apple Silicon) and CUDA support
- Zero-Copy Storage --- Memory-mapped files via
rkyvfor instant deserialization - Self-Hosted --- Your data stays on your machine
- Graceful Lifecycle --- Automatic state hydration/dehydration for cloud deployments
Installation
From Source (Recommended)
# CPU only
# Apple Silicon (Metal)
# NVIDIA GPU (CUDA)
Docker Compose
# CPU
# With GPU backend
GPU_BACKEND=metal
Prerequisites
- Rust 2024 Edition (nightly or 1.85+)
- Qdrant vector database (included in docker-compose)
- Embedding model (Qwen2-based, 1536-dim)
- Reranker model (ModernBERT cross-encoder) [optional but recommended]
Configuration
All configuration is via environment variables:
| Variable | Description | Default |
|---|---|---|
REFLEX_PORT |
HTTP port to bind | 8080 |
REFLEX_BIND_ADDR |
IP address to bind | 127.0.0.1 |
REFLEX_QDRANT_URL |
Qdrant gRPC endpoint | http://localhost:6334 |
REFLEX_STORAGE_PATH |
Path for mmap storage | ./.data |
REFLEX_L1_CAPACITY |
Max entries in L1 cache | 10000 |
REFLEX_MODEL_PATH |
Path to embedding model | Required for production |
REFLEX_RERANKER_PATH |
Path to reranker model | Optional |
REFLEX_RERANKER_THRESHOLD |
L3 verification confidence (0.0-1.0) | 0.70 |
Cloud/Lifecycle Configuration
| Variable | Description | Default |
|---|---|---|
REFLEX_GCS_BUCKET |
GCS bucket for state persistence | Empty |
REFLEX_SNAPSHOT_PATH |
Local snapshot file path | /mnt/nvme/reflex_data/snapshot.rkyv |
REFLEX_IDLE_TIMEOUT_SECS |
Idle timeout before shutdown | 900 (15 min) |
REFLEX_CLOUD_PROVIDER |
Cloud provider (gcp, local) |
gcp |
API Reference
POST /v1/chat/completions
Fully OpenAI-compatible. Your existing code works unchanged.
Request:
Response (Cache Hit):
The content field contains Tauq-encoded data. Your agent can parse this to extract:
- The original semantic request (for verification)
- The cached response content
Response Headers:
| Header | Values | Meaning |
|---|---|---|
X-Reflex-Status |
hit-l1-exact |
L1 cache hit (identical request) |
hit-l3-verified |
L2 candidate verified by L3 | |
miss |
Cache miss, forwarded to provider |
GET /healthz
Liveness probe. Returns 200 OK if the server is running.
GET /ready
Readiness probe. Returns 200 OK when Qdrant is connected and storage is operational.
Usage with Coding Agents
Claude Code / Aider / Continue
# Set the base URL to Reflex
# Still needed for cache misses
# Run your agent as normal
Python (openai SDK)
=
=
async-openai (Rust)
use ;
let config = new
.with_api_base;
let client = with_config;
Architecture Deep Dive
Storage Layer
Reflex uses rkyv for zero-copy deserialization. Cached responses are memory-mapped directly from disk---no parsing, no allocation, no delay.
.data/
├── {tenant_id}/
│ ├── {context_hash}.rkyv # Archived CacheEntry
│ └── ...
Embedding Pipeline
- Tokenization --- Qwen2 tokenizer (8192 max sequence length)
- Embedding --- Sinter embedder (1536-dim vectors)
- Indexing --- Binary quantization in Qdrant for efficient ANN search
Verification Pipeline
The L3 cross-encoder takes (query, candidate) pairs and outputs a relevance score. Only candidates exceeding the threshold (default 0.70) are considered valid cache hits.
This prevents:
- "How to sort ascending?" returning cached "How to sort descending?" answer
- "Python quicksort" returning cached "Rust quicksort" implementation
- False positives from embedding similarity alone
Development
# Run tests
# Run with debug logging
RUST_LOG=debug
# Run specific integration tests (requires Qdrant)
# Lint
# Format
Test Modes
| Environment Variable | Effect |
|---|---|
REFLEX_MOCK_PROVIDER=1 |
Bypass real LLM calls (for CI) |
Roadmap
- Streaming response caching
- Cache invalidation API
- Prometheus metrics endpoint
- Multi-model embedding support
- Redis L1 backend option
License
AGPL-3.0 --- Free as in freedom. If you modify Reflex and distribute it (including as a service), you must release your modifications under the same license.