reflex-cache 0.1.1

docs.rs failed to build reflex-cache-0.1.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Visit the last successful build: reflex-cache-0.2.1

Under Construction: This project is actively being developed and is not yet ready for production use. APIs and features may change without notice.

Why Reflex?

Every time your coding agent asks "How do I center a div?" it costs you tokens. Every time it asks "What's the syntax for centering a div in CSS?" it costs you tokens again---even though you already have a perfectly good answer sitting in your logs.

The problem is threefold:

Pain Point	What Happens	The Cost
Duplicate queries	Agents ask semantically identical questions	You pay for the same answer repeatedly
Context bloat	JSON responses eat precious context window	Agents hit token limits faster
Cold starts	Every session starts from scratch	No institutional memory

Reflex solves this by sitting between your agent and the LLM provider:

Agent  --->  Reflex  --->  OpenAI/Anthropic/etc.
               |
               v
         [Cache Hit?]
               |
      Yes: Return instantly (~1ms)
       No: Forward, cache response, return

The killer feature: Reflex doesn't just cache---it compresses. Responses are returned in Tauq format, a semantic encoding that cuts token overhead by ~40%. Your agent gets the same information in fewer tokens, extending its effective context window.

Quick Start

# 1. Start Qdrant (vector database)
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant

# 2. Run Reflex
cargo run --release

# 3. Point your agent to localhost:8080
export OPENAI_BASE_URL=http://localhost:8080/v1

That's it. Your existing OpenAI-compatible code works unchanged.

How It Works

Reflex uses a tiered cache architecture that balances speed with semantic understanding:

                    Request
                       │
                       ▼
              ┌────────────────┐
              │   L1: Exact    │  <-- Hash lookup (~0.1ms)
              │   (In-Memory)  │      Identical requests return instantly
              └───────┬────────┘
                      │ Miss
                      ▼
              ┌────────────────┐
              │  L2: Semantic  │  <-- Vector search (~5ms)
              │   (Qdrant)     │      "Center a div" ≈ "CSS centering"
              └───────┬────────┘
                      │ Candidates found
                      ▼
              ┌────────────────┐
              │ L3: Verification│ <-- Cross-encoder (~10ms)
              │  (ModernBERT)   │     Ensures intent actually matches
              └───────┬────────┘
                      │ Verified / No match
                      ▼
              ┌────────────────┐
              │   Provider     │  <-- Forward to OpenAI/Anthropic
              │   (Upstream)   │      Cache response for next time
              └────────────────┘

Cache Tiers Explained

Tier	What It Does	When It Hits	Latency
L1 Exact	Blake3 hash match on serialized request	Identical JSON payloads	<1ms
L2 Semantic	Embedding similarity via Qdrant	"How to sort?" ~ "Sorting algorithm?"	~5ms
L3 Verification	Cross-encoder reranking	Confirms semantic match is valid	~10ms

The L3 verification step is critical: it prevents false positives from L2. Just because two questions are similar doesn't mean they have the same answer. The cross-encoder ensures the cached response actually answers the new question.

Features

OpenAI-Compatible API --- Drop-in replacement. Change base_url, nothing else.
Tiered Caching --- L1 exact + L2 semantic + L3 verification
Tauq Compression --- ~40% token savings via semantic encoding
Multi-Tenant --- Isolated caches per API key
GPU Acceleration --- Metal (Apple Silicon) and CUDA support
Zero-Copy Storage --- Memory-mapped files via rkyv for instant deserialization
Self-Hosted --- Your data stays on your machine
Graceful Lifecycle --- Automatic state hydration/dehydration for cloud deployments

Installation

From Source (Recommended)

git clone https://github.com/ccheney/reflex.git
cd reflex

# CPU only
cargo build --release

# Apple Silicon (Metal)
cargo build --release --features metal

# NVIDIA GPU (CUDA)
cargo build --release --features cuda

Docker Compose

# CPU
docker compose up -d

# With GPU backend
GPU_BACKEND=metal docker compose up -d

Prerequisites

Rust 2024 Edition (nightly or 1.85+)
Qdrant vector database (included in docker-compose)
Embedding model (Qwen2-based, 1536-dim)
Reranker model (ModernBERT cross-encoder) [optional but recommended]

Configuration

All configuration is via environment variables:

Variable	Description	Default
`REFLEX_PORT`	HTTP port to bind	`8080`
`REFLEX_BIND_ADDR`	IP address to bind	`127.0.0.1`
`REFLEX_QDRANT_URL`	Qdrant gRPC endpoint	`http://localhost:6334`
`REFLEX_STORAGE_PATH`	Path for mmap storage	`./.data`
`REFLEX_L1_CAPACITY`	Max entries in L1 cache	`10000`
`REFLEX_MODEL_PATH`	Path to embedding model	Required for production
`REFLEX_RERANKER_PATH`	Path to reranker model	Optional
`REFLEX_RERANKER_THRESHOLD`	L3 verification confidence (0.0-1.0)	`0.70`

Cloud/Lifecycle Configuration

Variable	Description	Default
`REFLEX_GCS_BUCKET`	GCS bucket for state persistence	Empty
`REFLEX_SNAPSHOT_PATH`	Local snapshot file path	`/mnt/nvme/reflex_data/snapshot.rkyv`
`REFLEX_IDLE_TIMEOUT_SECS`	Idle timeout before shutdown	`900` (15 min)
`REFLEX_CLOUD_PROVIDER`	Cloud provider (`gcp`, `local`)	`gcp`

API Reference

`POST /v1/chat/completions`

Fully OpenAI-compatible. Your existing code works unchanged.

Request:

{
  "model": "gpt-4o",
  "messages": [
    {"role": "user", "content": "How do I center a div in CSS?"}
  ]
}

Response (Cache Hit):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "! def\n: semantic_request \"How do I center a div in CSS?\"\n: response !\n  : content \"Use flexbox: display: flex; justify-content: center; align-items: center;\""
    },
    "finish_reason": "stop"
  }]
}

The content field contains Tauq-encoded data. Your agent can parse this to extract:

The original semantic request (for verification)
The cached response content

Response Headers:

Header	Values	Meaning
`X-Reflex-Status`	`hit-l1-exact`	L1 cache hit (identical request)
	`hit-l3-verified`	L2 candidate verified by L3
	`miss`	Cache miss, forwarded to provider

`GET /healthz`

Liveness probe. Returns 200 OK if the server is running.

`GET /ready`

Readiness probe. Returns 200 OK when Qdrant is connected and storage is operational.

Usage with Coding Agents

Claude Code / Aider / Continue

# Set the base URL to Reflex
export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=sk-your-key  # Still needed for cache misses

# Run your agent as normal
aider --model gpt-4o

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-your-key"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quicksort"}]
)

async-openai (Rust)

use async_openai::{Client, config::OpenAIConfig};

let config = OpenAIConfig::new()
    .with_api_base("http://localhost:8080/v1");
let client = Client::with_config(config);

Architecture Deep Dive

Storage Layer

Reflex uses rkyv for zero-copy deserialization. Cached responses are memory-mapped directly from disk---no parsing, no allocation, no delay.

.data/
├── {tenant_id}/
│   ├── {context_hash}.rkyv    # Archived CacheEntry
│   └── ...

Embedding Pipeline

Tokenization --- Qwen2 tokenizer (8192 max sequence length)
Embedding --- Sinter embedder (1536-dim vectors)
Indexing --- Binary quantization in Qdrant for efficient ANN search

Verification Pipeline

The L3 cross-encoder takes (query, candidate) pairs and outputs a relevance score. Only candidates exceeding the threshold (default 0.70) are considered valid cache hits.

This prevents:

"How to sort ascending?" returning cached "How to sort descending?" answer
"Python quicksort" returning cached "Rust quicksort" implementation
False positives from embedding similarity alone

Development

# Run tests
cargo test

# Run with debug logging
RUST_LOG=debug cargo run

# Run specific integration tests (requires Qdrant)
docker compose up -d qdrant
cargo test --test integration_real

# Lint
cargo clippy --all-targets -- -D warnings

# Format
cargo fmt

Test Modes

Environment Variable	Effect
`REFLEX_MOCK_PROVIDER=1`	Bypass real LLM calls (for CI)

Roadmap

Streaming response caching
Cache invalidation API
Prometheus metrics endpoint
Multi-model embedding support
Redis L1 backend option

License

AGPL-3.0 --- Free as in freedom. If you modify Reflex and distribute it (including as a service), you must release your modifications under the same license.