Ollama Proxy (Rust)

Description

A lightweight Rust proxy for Ollama that intelligently adjusts request parameters to match each model's training configuration.

Problem

Some AI clients (like Elephas) send the same context length parameter for all models. This causes issues when:

Embedding models trained with 8K context receive requests for 128K context
Ollama warns: "requested context size too large for model"
Models may perform poorly with incorrect parameters

Solution

This proxy sits between your client and Ollama, automatically:

Detects which model is being requested
Fetches the model's training context length (n_ctx_train)
Adjusts num_ctx if it exceeds the model's capabilities
Provides detailed logging of all modifications

Features

✅ Prevents infinite generation - Auto-injects num_predict to limit output
✅ Smart chunking - Automatically splits large embeddings inputs to prevent crashes
✅ Context safety caps - Configurable hard limits to prevent Ollama stalls
✅ Request timeouts - Prevents indefinite hangs with configurable limits
✅ Automatic parameter correction based on model metadata
✅ Request/response logging for debugging
✅ Model metadata caching for performance
✅ Extensible modifier framework for future enhancements
✅ Zero configuration for basic usage

Installation

cargo build --release

Usage

1. Start the Proxy

# Default: Listen on 127.0.0.1:11435, proxy to 127.0.0.1:11434
cargo run --release

# Or with custom settings:
OLLAMA_HOST=http://127.0.0.1:11434 PROXY_PORT=11435 RUST_LOG=info cargo run --release

2. Configure Your Client

Point your AI client (Elephas, etc.) to the proxy instead of Ollama directly:

Before: http://127.0.0.1:11434
After: http://127.0.0.1:11435

3. Watch the Magic

The proxy will log all requests and modifications:

📨 Incoming request: POST /v1/embeddings
📋 Request body: {
  "model": "nomic-embed-text",
  "input": "test",
  "options": {
    "num_ctx": 131072
  }
}
🔍 Detected model: nomic-embed-text
📊 Model metadata - n_ctx_train: 8192
⚠️  num_ctx (131072) exceeds model training context (8192)
✏️  Modified options.num_ctx: 131072 → 8192
🔧 ContextLimitModifier applied modifications
📬 Response status: 200 OK

Configuration

Environment variables:

OLLAMA_HOST - Target Ollama server (default: http://127.0.0.1:11434)
PROXY_PORT - Port to listen on (default: 11435)
RUST_LOG - Log level: error, warn, info, debug, trace (default: info)

Context Size Configuration

Prevent Ollama stalls with large contexts:

MAX_CONTEXT_OVERRIDE - Hard cap for context size regardless of model support (default: 16384)
REQUEST_TIMEOUT_SECONDS - Timeout for requests to Ollama (default: 120)

Why This Matters:

Models may claim to support very large contexts (e.g., 131K tokens), but Ollama can stall or hang when actually processing them, especially with flash attention enabled. The MAX_CONTEXT_OVERRIDE provides a safety limit.

Recommended Settings:

# Conservative (most reliable)
MAX_CONTEXT_OVERRIDE=16384 REQUEST_TIMEOUT_SECONDS=120 cargo run --release

# Moderate (test with your hardware)
MAX_CONTEXT_OVERRIDE=32768 REQUEST_TIMEOUT_SECONDS=180 cargo run --release

# Aggressive (may cause stalls on some systems)
MAX_CONTEXT_OVERRIDE=65536 REQUEST_TIMEOUT_SECONDS=300 cargo run --release

Note: If requests time out, reduce MAX_CONTEXT_OVERRIDE first before increasing timeout.

Generation Limit (num_predict)

THE CRITICAL FIX FOR TIMEOUTS:

The proxy automatically injects num_predict into all chat requests to prevent infinite generation loops.

The Problem:

Ollama's default num_predict is -1 (infinite)
Without this parameter, models generate until they fill entire context
This causes "stalls" even with small contexts (4K)
The model isn't stuck - it's generating millions of unwanted tokens!

How the Proxy Fixes This:

Detects chat requests (those with messages array)
Checks if num_predict is already set
If not set, injects num_predict:
- Uses max_tokens from request if available (e.g., 4096 from Elephas)
- Otherwise defaults to 4096 tokens
Logs the injection for transparency

Example:

// Your request:
{
  "model": "gpt-oss:20b",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 2048
}

// Proxy automatically adds:
{
  "model": "gpt-oss:20b",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 2048,
  "options": {
    "num_predict": 2048  // ← Added by proxy
  }
}

Why This Matters:

Without num_predict, a simple "say hello" request can generate for 3+ minutes, filling the entire context buffer with elaborations, examples, and repetitions until it crashes or times out.

Override if Needed:

If you want different generation limits, set num_predict explicitly in your request - the proxy preserves existing values.

Chunking Configuration

For large embeddings inputs, the proxy can automatically chunk text to prevent Ollama memory errors:

MAX_EMBEDDING_INPUT_LENGTH - Maximum characters per embedding input (default: 2000)
ENABLE_AUTO_CHUNKING - Enable automatic chunking for large inputs (default: true)

How Chunking Works:

When an embeddings request contains text longer than MAX_EMBEDDING_INPUT_LENGTH:

The proxy splits the text into smaller chunks (with 10% overlap for context preservation)
Each chunk is sent as a separate request to Ollama sequentially
The proxy collects all embedding vectors
Embeddings are averaged to create a single combined embedding
The client receives one response, transparently

Example:

# Allow larger inputs before chunking (4000 characters)
MAX_EMBEDDING_INPUT_LENGTH=4000 cargo run --release

# Disable chunking (return error for large inputs)
ENABLE_AUTO_CHUNKING=false cargo run --release

Performance Considerations:

Chunking processes sequentially to avoid memory pressure
A 10,000 character input with 2000 char limit creates ~5 chunks
Each chunk adds ~200-500ms latency (model dependent)
For best performance, keep inputs under the limit when possible

Flash Attention

What is Flash Attention?

Flash Attention is an optimization technique that speeds up inference and reduces memory usage. Ollama can enable it automatically for supported models.

How to Control Flash Attention

Flash Attention is global only (environment variable), not per-request:

# Let Ollama decide (RECOMMENDED - unset the variable)
unset OLLAMA_FLASH_ATTENTION
ollama serve

# Explicitly enable (may cause issues with large contexts)
export OLLAMA_FLASH_ATTENTION=1
ollama serve

# Explicitly disable (may help with large context stalls)
export OLLAMA_FLASH_ATTENTION=0
ollama serve

When Flash Attention Causes Problems

Symptoms:

Requests with large contexts (>60K tokens) stall indefinitely
GPU shows "100% allocated" but 0% utilization in Activity Monitor
Ollama process is running but not responding
Client times out without receiving response

Why This Happens: Flash attention with very large contexts can trigger memory allocation deadlocks or exceed Metal's working set limits on macOS, especially with M-series chips.

Solutions:

Unset flash attention (let Ollama decide per-model):

unset OLLAMA_FLASH_ATTENTION
pkill ollama
ollama serve

Reduce context size (use the proxy's safety cap):

MAX_CONTEXT_OVERRIDE=16384 cargo run --release

Test systematically to find your hardware's limits:
```
./test_context_limits.sh gpt-oss:20b
```

Best Practices

✅ DO:

Keep OLLAMA_FLASH_ATTENTION unset (let Ollama auto-detect)
Use MAX_CONTEXT_OVERRIDE=16384 for reliability
Test with test_context_limits.sh to find your system's sweet spot
Monitor GPU utilization when testing large contexts

❌ DON'T:

Set flash attention to false globally (disables it for all models)
Use contexts >60K without testing first
Assume model's claimed context limit works reliably in practice

Troubleshooting

500 Internal Server Error from Ollama

Symptoms:

Embeddings requests return HTTP 500
Ollama logs show SIGABRT: abort or output_reserve: reallocating output buffer
Error occurs with large text inputs (> 5000 characters)

Cause: Ollama's embedding models crash when trying to allocate large buffers for very long inputs.

Solutions:

Enable chunking (should be on by default):

ENABLE_AUTO_CHUNKING=true cargo run --release

Reduce chunk size if still seeing errors:

MAX_EMBEDDING_INPUT_LENGTH=1500 cargo run --release

Check Ollama logs for details:
```
tail -f ~/.ollama/logs/server.log
```

Input Too Large Error

Symptoms:

Request returns HTTP 400
Error message: "Input too large (X characters). Maximum is Y characters."

Cause: Input exceeds MAX_EMBEDDING_INPUT_LENGTH and chunking is disabled.

Solution: Enable chunking:

ENABLE_AUTO_CHUNKING=true cargo run --release

Slow Embeddings Requests

Symptoms:

Embeddings take much longer than expected
Logs show "Processing X chunks sequentially"

Cause: Large inputs are being chunked and processed sequentially.

This is expected behavior! Chunking prevents crashes but adds latency.

To improve speed:

Reduce input size at the source
Increase MAX_EMBEDDING_INPUT_LENGTH if your hardware can handle it
Use a smaller/faster embeddings model

How It Works

Intercept: Proxy receives request from client
Detect API Format: Determine if request uses OpenAI or native Ollama API
Translate (if needed): Convert OpenAI /v1/embeddings → Ollama /api/embed
Fetch Metadata: Query Ollama API for model's training parameters
Inject Parameters: Add options.num_ctx with correct value for the model
Forward: Send request to Ollama native API (which accepts options)
Translate Response: Convert Ollama response back to OpenAI format
Return: Pass OpenAI-compatible response back to client

Architecture

Client (Elephas)
    ↓ OpenAI API format (/v1/embeddings)
Proxy (Port 11435)
    ↓ Translates to native Ollama API (/api/embed)
    ↓ Injects options.num_ctx based on model
Ollama (Port 11434)
    ↓ Returns native response
Proxy
    ↓ Translates back to OpenAI format
Client receives OpenAI-compatible response

Key Innovation: The proxy acts as a translation layer, converting between OpenAI's API format (which doesn't support runtime options) and Ollama's native API (which does), enabling per-request parameter control without changing global settings.

Extending

The modifier framework is designed for easy extension:

pub trait ParameterModifier {
    fn modify(&self, json: &mut Value, metadata: &ModelMetadata) -> bool;
    fn name(&self) -> &str;
}

Add new modifiers in src/modifier.rs and register them in apply_modifiers().

Testing

cargo test

License

MIT

ollama-proxy-rs 0.1.1

Ollama Proxy (Rust)

Description

Problem

Solution

Features

Installation

Usage

1. Start the Proxy

2. Configure Your Client

3. Watch the Magic

Configuration

Context Size Configuration

Generation Limit (num_predict)

Chunking Configuration

Flash Attention

What is Flash Attention?

How to Control Flash Attention

When Flash Attention Causes Problems

Best Practices

Troubleshooting

500 Internal Server Error from Ollama

Input Too Large Error

Slow Embeddings Requests

How It Works

Architecture

Extending

Testing

License