Ollama Proxy (Rust)
Description
A lightweight Rust proxy for Ollama that intelligently adjusts request parameters to match each model's training configuration.
Problem
Some AI clients (like Elephas) send the same context length parameter for all models. This causes issues when:
- Embedding models trained with 8K context receive requests for 128K context
- Ollama warns: "requested context size too large for model"
- Models may perform poorly with incorrect parameters
Solution
This proxy sits between your client and Ollama, automatically:
- Detects which model is being requested
- Fetches the model's training context length (
n_ctx_train) - Adjusts
num_ctxif it exceeds the model's capabilities - Provides detailed logging of all modifications
Features
- ✅ Prevents infinite generation - Auto-injects
num_predictto limit output - ✅ Smart chunking - Automatically splits large embeddings inputs to prevent crashes
- ✅ Context safety caps - Configurable hard limits to prevent Ollama stalls
- ✅ Request timeouts - Prevents indefinite hangs with configurable limits
- ✅ Automatic parameter correction based on model metadata
- ✅ Request/response logging for debugging
- ✅ Model metadata caching for performance
- ✅ Extensible modifier framework for future enhancements
- ✅ Zero configuration for basic usage
Installation
Usage
1. Start the Proxy
# Default: Listen on 127.0.0.1:11435, proxy to 127.0.0.1:11434
# Or with custom settings:
OLLAMA_HOST=http://127.0.0.1:11434 PROXY_PORT=11435 RUST_LOG=info
2. Configure Your Client
Point your AI client (Elephas, etc.) to the proxy instead of Ollama directly:
Before: http://127.0.0.1:11434
After: http://127.0.0.1:11435
3. Watch the Magic
The proxy will log all requests and modifications:
📨 Incoming request: POST /v1/embeddings
📋 Request body: {
"model": "nomic-embed-text",
"input": "test",
"options": {
"num_ctx": 131072
}
}
🔍 Detected model: nomic-embed-text
📊 Model metadata - n_ctx_train: 8192
⚠️ num_ctx (131072) exceeds model training context (8192)
✏️ Modified options.num_ctx: 131072 → 8192
🔧 ContextLimitModifier applied modifications
📬 Response status: 200 OK
Configuration
Environment variables:
OLLAMA_HOST- Target Ollama server (default:http://127.0.0.1:11434)PROXY_PORT- Port to listen on (default:11435)RUST_LOG- Log level:error,warn,info,debug,trace(default:info)
Context Size Configuration
Prevent Ollama stalls with large contexts:
MAX_CONTEXT_OVERRIDE- Hard cap for context size regardless of model support (default:16384)REQUEST_TIMEOUT_SECONDS- Timeout for requests to Ollama (default:120)
Why This Matters:
Models may claim to support very large contexts (e.g., 131K tokens), but Ollama can stall or hang when actually processing them, especially with flash attention enabled. The MAX_CONTEXT_OVERRIDE provides a safety limit.
Recommended Settings:
# Conservative (most reliable)
MAX_CONTEXT_OVERRIDE=16384 REQUEST_TIMEOUT_SECONDS=120
# Moderate (test with your hardware)
MAX_CONTEXT_OVERRIDE=32768 REQUEST_TIMEOUT_SECONDS=180
# Aggressive (may cause stalls on some systems)
MAX_CONTEXT_OVERRIDE=65536 REQUEST_TIMEOUT_SECONDS=300
Note: If requests time out, reduce MAX_CONTEXT_OVERRIDE first before increasing timeout.
Generation Limit (num_predict)
THE CRITICAL FIX FOR TIMEOUTS:
The proxy automatically injects num_predict into all chat requests to prevent infinite generation loops.
The Problem:
- Ollama's default
num_predictis -1 (infinite) - Without this parameter, models generate until they fill entire context
- This causes "stalls" even with small contexts (4K)
- The model isn't stuck - it's generating millions of unwanted tokens!
How the Proxy Fixes This:
- Detects chat requests (those with
messagesarray) - Checks if
num_predictis already set - If not set, injects
num_predict:- Uses
max_tokensfrom request if available (e.g., 4096 from Elephas) - Otherwise defaults to 4096 tokens
- Uses
- Logs the injection for transparency
Example:
// Your request:
// Proxy automatically adds:
Why This Matters:
Without num_predict, a simple "say hello" request can generate for 3+ minutes, filling the entire context buffer with elaborations, examples, and repetitions until it crashes or times out.
Override if Needed:
If you want different generation limits, set num_predict explicitly in your request - the proxy preserves existing values.
Chunking Configuration
For large embeddings inputs, the proxy can automatically chunk text to prevent Ollama memory errors:
MAX_EMBEDDING_INPUT_LENGTH- Maximum characters per embedding input (default:2000)ENABLE_AUTO_CHUNKING- Enable automatic chunking for large inputs (default:true)
How Chunking Works:
When an embeddings request contains text longer than MAX_EMBEDDING_INPUT_LENGTH:
- The proxy splits the text into smaller chunks (with 10% overlap for context preservation)
- Each chunk is sent as a separate request to Ollama sequentially
- The proxy collects all embedding vectors
- Embeddings are averaged to create a single combined embedding
- The client receives one response, transparently
Example:
# Allow larger inputs before chunking (4000 characters)
MAX_EMBEDDING_INPUT_LENGTH=4000
# Disable chunking (return error for large inputs)
ENABLE_AUTO_CHUNKING=false
Performance Considerations:
- Chunking processes sequentially to avoid memory pressure
- A 10,000 character input with 2000 char limit creates ~5 chunks
- Each chunk adds ~200-500ms latency (model dependent)
- For best performance, keep inputs under the limit when possible
Flash Attention
What is Flash Attention?
Flash Attention is an optimization technique that speeds up inference and reduces memory usage. Ollama can enable it automatically for supported models.
How to Control Flash Attention
Flash Attention is global only (environment variable), not per-request:
# Let Ollama decide (RECOMMENDED - unset the variable)
# Explicitly enable (may cause issues with large contexts)
# Explicitly disable (may help with large context stalls)
When Flash Attention Causes Problems
Symptoms:
- Requests with large contexts (>60K tokens) stall indefinitely
- GPU shows "100% allocated" but 0% utilization in Activity Monitor
- Ollama process is running but not responding
- Client times out without receiving response
Why This Happens: Flash attention with very large contexts can trigger memory allocation deadlocks or exceed Metal's working set limits on macOS, especially with M-series chips.
Solutions:
-
Unset flash attention (let Ollama decide per-model):
-
Reduce context size (use the proxy's safety cap):
MAX_CONTEXT_OVERRIDE=16384 -
Test systematically to find your hardware's limits:
Best Practices
✅ DO:
- Keep
OLLAMA_FLASH_ATTENTIONunset (let Ollama auto-detect) - Use
MAX_CONTEXT_OVERRIDE=16384for reliability - Test with
test_context_limits.shto find your system's sweet spot - Monitor GPU utilization when testing large contexts
❌ DON'T:
- Set flash attention to
falseglobally (disables it for all models) - Use contexts >60K without testing first
- Assume model's claimed context limit works reliably in practice
Troubleshooting
500 Internal Server Error from Ollama
Symptoms:
- Embeddings requests return HTTP 500
- Ollama logs show
SIGABRT: abortoroutput_reserve: reallocating output buffer - Error occurs with large text inputs (> 5000 characters)
Cause: Ollama's embedding models crash when trying to allocate large buffers for very long inputs.
Solutions:
-
Enable chunking (should be on by default):
ENABLE_AUTO_CHUNKING=true -
Reduce chunk size if still seeing errors:
MAX_EMBEDDING_INPUT_LENGTH=1500 -
Check Ollama logs for details:
Input Too Large Error
Symptoms:
- Request returns HTTP 400
- Error message: "Input too large (X characters). Maximum is Y characters."
Cause:
Input exceeds MAX_EMBEDDING_INPUT_LENGTH and chunking is disabled.
Solution: Enable chunking:
ENABLE_AUTO_CHUNKING=true
Slow Embeddings Requests
Symptoms:
- Embeddings take much longer than expected
- Logs show "Processing X chunks sequentially"
Cause: Large inputs are being chunked and processed sequentially.
This is expected behavior! Chunking prevents crashes but adds latency.
To improve speed:
- Reduce input size at the source
- Increase
MAX_EMBEDDING_INPUT_LENGTHif your hardware can handle it - Use a smaller/faster embeddings model
How It Works
- Intercept: Proxy receives request from client
- Detect API Format: Determine if request uses OpenAI or native Ollama API
- Translate (if needed): Convert OpenAI
/v1/embeddings→ Ollama/api/embed - Fetch Metadata: Query Ollama API for model's training parameters
- Inject Parameters: Add
options.num_ctxwith correct value for the model - Forward: Send request to Ollama native API (which accepts options)
- Translate Response: Convert Ollama response back to OpenAI format
- Return: Pass OpenAI-compatible response back to client
Architecture
Client (Elephas)
↓ OpenAI API format (/v1/embeddings)
Proxy (Port 11435)
↓ Translates to native Ollama API (/api/embed)
↓ Injects options.num_ctx based on model
Ollama (Port 11434)
↓ Returns native response
Proxy
↓ Translates back to OpenAI format
Client receives OpenAI-compatible response
Key Innovation: The proxy acts as a translation layer, converting between OpenAI's API format (which doesn't support runtime options) and Ollama's native API (which does), enabling per-request parameter control without changing global settings.
Extending
The modifier framework is designed for easy extension:
Add new modifiers in src/modifier.rs and register them in apply_modifiers().
Testing
License
MIT