A3S Power
Overview
A3S Power is an Ollama-compatible CLI tool and HTTP server for local model management and inference. It provides both an Ollama-compatible native API and an OpenAI-compatible API, so existing tools, SDKs, and frontends work out of the box.
Basic Usage
# Pull a model by name (resolves from Ollama registry, built-in registry, or HuggingFace)
# Pull from a direct URL
# Interactive chat
# Single prompt
# Push a model to a remote registry
# Start HTTP server
Features
- CLI Model Management: Pull, list, show, delete, and push models from the command line
- Ollama Registry Integration: Pull any model from
registry.ollama.aiby name (llama3.2:3b) — primary resolution source with built-in registry and HuggingFace fallback - Interactive Chat: Multi-turn conversation with streaming token output
- Vision/Multimodal Support: Accept image URLs in chat messages (OpenAI-compatible
contentarray format) - Tool/Function Calling: Structured tool definitions, tool choice, and tool call responses (OpenAI-compatible)
- JSON Schema Structured Output: Constrain model output to match JSON Schema via GBNF grammar generation — supports
"json",{"type":"json_object"}, or full JSON Schema objects - Chat Template Auto-Detection: Detects ChatML, Llama, Phi, and Generic templates from GGUF metadata
- Jinja2 Template Engine: Renders arbitrary Jinja2 chat templates via
minijinja(Llama 3, Gemma, ChatML, Phi, custom) with hardcoded fallback - KV Cache Reuse: Persists
LlamaContextacross requests with prefix matching — skips re-evaluating shared prompt tokens for multi-turn speedup - Tool Call Parsing: Parses model output into structured
tool_calls— supports<tool_call>XML,[TOOL_CALLS]prefix, and raw JSON formats - Modelfile Support: Create custom models with
FROM,PARAMETER,SYSTEM,TEMPLATE,ADAPTER(LoRA/QLoRA),LICENSE, andMESSAGE(pre-seeded conversations) directives - Multiple Concurrent Models: Load multiple models with LRU eviction at configurable capacity
- Automatic Model Unloading: Background keep_alive reaper unloads idle models after configurable timeout (default 5m)
- GPU Acceleration: Configurable GPU layer offloading via
[gpu]config section with automatic GPU detection (Metal/CUDA), multi-GPU support (main_gpu), and per-requestnum_gpuoverride - GPU Auto-Detection: Automatically detects Apple Metal and NVIDIA CUDA GPUs at server startup, sets optimal
gpu_layerswhen not explicitly configured - Memory Estimation: Estimates VRAM requirements before loading a model (model weights + KV cache + compute overhead) and logs warnings
- Full Ollama Options: All Ollama generation options supported —
repeat_last_n,penalize_newline,num_batch,num_thread,num_thread_batch,use_mmap,use_mlock,numa,flash_attention,num_gpu,main_gpu— in addition to standard sampling parameters - Embedding Support: Real embedding generation with automatic model reload in embedding mode
- HTTP Server: Axum-based server with CORS, tracing, and metrics middleware
- Ollama-Compatible API:
/api/generate,/api/chat,/api/tags,/api/pull,/api/push,/api/show,/api/delete,/api/embeddings,/api/embed,/api/ps,/api/copy,/api/version,/api/blobs/:digest - OpenAI-Compatible API:
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddings - Blob Management API: Check, upload, and download content-addressed blobs via REST
- Push API: Upload models to remote registries with progress reporting
- NDJSON Streaming: Native API endpoints stream as
application/x-ndjson(Ollama wire format); OpenAI endpoints use SSE - Context Token Return:
/api/generatereturns token IDs incontextfield for conversation continuity - Prometheus Metrics:
GET /metricsendpoint with request counts, durations, tokens, model gauges, inference duration, TTFT, cost, evictions, model memory, and GPU metrics - Usage Dashboard:
GET /v1/usageendpoint with date range and model filtering for cost tracking - GGUF Metadata Reader: Lightweight binary parser for GGUF file headers — extracts architecture metadata and tensor descriptors without loading weights
- Verbose Show:
/api/showwithverbose: truereturns full GGUF metadata and tensor information - Per-Layer Pull Progress: Pull progress shows per-layer digest identifiers (
pulling sha256:abc...) matching Ollama's output format - Content-Addressed Storage: Model blobs stored by SHA-256 hash with automatic deduplication
- llama.cpp Backend: GGUF inference via
llama-cpp-2Rust bindings (optional feature flag) - Health Check:
GET /healthendpoint with uptime, version, and loaded model count - Model Auto-Loading: Models are automatically loaded on first inference request with LRU eviction
- TOML Configuration: User-configurable host, port, GPU settings, keep_alive, and storage settings
- Ollama Environment Variables:
OLLAMA_HOST,OLLAMA_MODELS,OLLAMA_KEEP_ALIVE,OLLAMA_MAX_LOADED_MODELS,OLLAMA_NUM_GPUfor drop-in compatibility - Download Resumption: Interrupted model downloads resume automatically via HTTP Range requests
- Async-First: Built on Tokio for high-performance async operations
Quality Metrics
Test Coverage
861 unit tests with 90.11% region coverage across 59 source files:
| Module | Lines | Coverage | Functions | Coverage |
|---|---|---|---|---|
| api/health.rs | 62 | 100.00% | 10 | 100.00% |
| api/mod.rs | 27 | 100.00% | 5 | 100.00% |
| api/native/mod.rs | 22 | 100.00% | 1 | 100.00% |
| api/native/ps.rs | 149 | 100.00% | 17 | 100.00% |
| api/native/version.rs | 21 | 100.00% | 6 | 100.00% |
| api/openai/mod.rs | 30 | 100.00% | 4 | 100.00% |
| api/openai/usage.rs | 384 | 100.00% | 27 | 100.00% |
| backend/llamacpp.rs | 186 | 100.00% | 26 | 100.00% |
| backend/test_utils.rs | 130 | 100.00% | 18 | 100.00% |
| cli/delete.rs | 102 | 100.00% | 5 | 100.00% |
| cli/list.rs | 88 | 100.00% | 7 | 100.00% |
| error.rs | 93 | 100.00% | 19 | 100.00% |
| model/manifest.rs | 164 | 100.00% | 19 | 100.00% |
| server/router.rs | 209 | 100.00% | 33 | 100.00% |
| backend/json_schema.rs | 389 | 98.97% | 53 | 100.00% |
| backend/tool_parser.rs | 347 | 99.14% | 43 | 100.00% |
| model/modelfile.rs | 552 | 99.28% | 42 | 100.00% |
| server/state.rs | 266 | 99.25% | 37 | 97.30% |
| api/sse.rs | 95 | 98.95% | 16 | 93.75% |
| api/types.rs | 613 | 98.37% | 52 | 100.00% |
| server/metrics.rs | 607 | 98.35% | 54 | 96.30% |
| backend/chat_template.rs | 349 | 98.28% | 32 | 100.00% |
| backend/mod.rs | 65 | 98.46% | 15 | 100.00% |
| dirs.rs | 55 | 98.18% | 12 | 91.67% |
| backend/types.rs | 261 | 98.08% | 23 | 95.65% |
| api/native/chat.rs | 735 | 94.42% | 32 | 100.00% |
| api/native/generate.rs | 709 | 95.77% | 32 | 100.00% |
| api/native/models.rs | 457 | 96.06% | 32 | 100.00% |
| config.rs | 475 | 96.84% | 60 | 96.67% |
| api/openai/embeddings.rs | 187 | 95.72% | 9 | 100.00% |
| api/native/blobs.rs | 212 | 94.81% | 15 | 100.00% |
| api/autoload.rs | 220 | 94.09% | 24 | 100.00% |
| api/native/embed.rs | 158 | 93.04% | 9 | 100.00% |
| model/gguf.rs | 746 | 93.43% | 80 | 80.00% |
| api/openai/models.rs | 118 | 93.22% | 9 | 100.00% |
| api/native/embeddings.rs | 133 | 96.24% | 7 | 100.00% |
| api/native/copy.rs | 60 | 91.67% | 6 | 100.00% |
| cli/mod.rs | 340 | 91.18% | 34 | 100.00% |
| api/native/create.rs | 340 | 90.00% | 19 | 94.74% |
| api/openai/chat.rs | 531 | 88.14% | 23 | 78.26% |
| model/registry.rs | 308 | 87.99% | 42 | 83.33% |
| model/storage.rs | 331 | 87.31% | 31 | 83.87% |
| cli/show.rs | 234 | 84.19% | 15 | 100.00% |
| api/openai/completions.rs | 394 | 82.99% | 14 | 78.57% |
| backend/gpu.rs | 281 | 82.92% | 38 | 92.11% |
| model/resolve.rs | 341 | 75.66% | 54 | 79.63% |
| api/native/push.rs | 187 | 75.40% | 10 | 80.00% |
| cli/push.rs | 43 | 74.42% | 10 | 90.00% |
| model/ollama_registry.rs | 530 | 73.21% | 57 | 70.18% |
| cli/ps.rs | 152 | 70.39% | 22 | 81.82% |
| cli/serve.rs | 34 | 70.59% | 4 | 50.00% |
| cli/stop.rs | 102 | 70.59% | 12 | 75.00% |
| server/mod.rs | 84 | 65.48% | 12 | 66.67% |
| model/push.rs | 151 | 62.91% | 27 | 81.48% |
| cli/pull.rs | 72 | 62.50% | 6 | 83.33% |
| api/native/pull.rs | 269 | 50.19% | 16 | 81.25% |
| cli/run.rs | 845 | 48.88% | 57 | 85.96% |
| model/pull.rs | 384 | 48.70% | 36 | 63.89% |
| TOTAL | 15429 | 87.94% | 1430 | 91.47% |
Overall: 90.11% region coverage, 91.47% function coverage, 87.94% line coverage
Run coverage report:
LLVM_COV=/opt/homebrew/Cellar/llvm/21.1.8/bin/llvm-cov \
LLVM_PROFDATA=/opt/homebrew/Cellar/llvm/21.1.8/bin/llvm-profdata \
Architecture
Components
┌─────────────────────────────────────────────────┐
│ a3s-power │
│ │
│ CLI Layer │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ run │ │ pull │ │ list │ │ push │ │serve │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │ │ │
│ Model Layer │ │ │
│ ┌────────────────────┴────────┐ │ │
│ │ ModelRegistry │ │ │
│ │ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │ manifest │ │ storage │ │ │ │
│ │ └──────────┘ └──────────┘ │ │ │
│ └─────────────────────────────┘ │ │
│ │ │
│ Backend Layer │ │
│ ┌─────────────────────────────┐ │ │
│ │ BackendRegistry │ │ │
│ │ ┌──────────────────────┐ │ │ │
│ │ │ LlamaCppBackend │ │ │ │
│ │ │ (feature: llamacpp) │ │ │ │
│ │ └──────────────────────┘ │ │ │
│ └─────────────────────────────┘ │ │
│ │ │
│ Server Layer ◄──────────────────────────┘ │
│ ┌─────────────────────────────────────┐ │
│ │ Axum Router │ │
│ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │ /api/* │ │ /v1/* │ │ │
│ │ │ (Ollama) │ │ (OpenAI) │ │ │
│ │ └────────────┘ └────────────────┘ │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Backend Trait
The Backend trait abstracts inference engines. The llama.cpp backend is feature-gated; without the llamacpp feature, Power can still manage models but returns "backend not available" for inference calls.
Quick Start
Build
# Build without inference backend (model management only)
# Build with llama.cpp inference (requires C++ compiler + CMake)
Model Management
# Pull a model by name (Ollama registry → built-in registry → HuggingFace fallback)
# Pull from a direct URL
# List local models
# Show model details
# Delete a model
# Push a model to a remote registry
Interactive Chat
# Start interactive chat session
# Send a single prompt
HTTP Server
# Start server on default port (127.0.0.1:11434)
# Custom host and port
API Reference
Server
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check (status, version, uptime, loaded models) |
GET |
/metrics |
Prometheus metrics (requests, durations, tokens, inference, TTFT, cost, evictions, model memory, GPU) |
Native API (Ollama-Compatible)
| Method | Path | Description |
|---|---|---|
POST |
/api/generate |
Text generation (streaming/non-streaming) |
POST |
/api/chat |
Chat completion with vision & tool support (streaming/non-streaming) |
POST |
/api/pull |
Download a model by name or URL (streaming progress) |
POST |
/api/push |
Push a model to a remote registry |
GET |
/api/tags |
List local models |
POST |
/api/show |
Show model details |
DELETE |
/api/delete |
Delete a model |
POST |
/api/embeddings |
Generate embeddings |
POST |
/api/embed |
Batch embedding generation |
GET |
/api/ps |
List running/loaded models |
POST |
/api/copy |
Copy/alias a model |
GET |
/api/version |
Server version |
HEAD |
/api/blobs/:digest |
Check if a blob exists |
POST |
/api/blobs/:digest |
Upload a blob with SHA-256 verification |
GET |
/api/blobs/:digest |
Download a blob |
DELETE |
/api/blobs/:digest |
Delete a blob |
OpenAI-Compatible API
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completion (streaming/non-streaming) |
POST |
/v1/completions |
Text completion (streaming/non-streaming) |
GET |
/v1/models |
List available models |
POST |
/v1/embeddings |
Generate embeddings |
GET |
/v1/usage |
Usage and cost dashboard data (date range + model filter) |
Examples
List Models
# OpenAI-compatible
# Ollama-compatible
Chat Completion (OpenAI)
Chat Completion with Streaming
Text Generation (Ollama)
Text Completion (OpenAI)
Vision/Multimodal (OpenAI)
Tool/Function Calling (OpenAI)
Push Model
Structured Output (JSON Schema)
# Constrain output to match a JSON Schema
Blob Management
# Check if blob exists
# Upload blob
# Download blob
CLI Commands
| Command | Description |
|---|---|
a3s-power run <model> [--prompt <text>] |
Load model and start interactive chat, or send a single prompt |
a3s-power pull <name_or_url> |
Download a model by name (llama3.2:3b) or direct URL |
a3s-power push <model> --destination <url> |
Push a model to a remote registry |
a3s-power list |
List all locally available models |
a3s-power show <model> |
Show model details (format, size, parameters) |
a3s-power delete <model> |
Delete a model from local storage |
a3s-power create <name> -f <modelfile> |
Create a custom model from a Modelfile |
a3s-power cp <source> <destination> |
Copy/alias a model to a new name |
a3s-power ps |
List running (loaded) models on the server |
a3s-power stop <model> |
Stop (unload) a running model from the server |
a3s-power serve [--host <addr>] [--port <port>] |
Start HTTP server (default: 127.0.0.1:11434) |
Model Storage
Models are stored in ~/.a3s/power/ (override with $A3S_POWER_HOME):
~/.a3s/power/
├── config.toml # User configuration
└── models/
├── manifests/ # JSON manifest files
│ ├── llama-2-7b.json
│ └── qwen2.5-7b.json
└── blobs/ # Content-addressed model files
├── sha256-abc123...
└── sha256-def456...
Content-Addressed Storage
Model files are stored by their SHA-256 hash, enabling:
- Deduplication: Identical files share storage
- Integrity verification: Blobs can be verified against their hash
- Clean deletion: Remove manifest + blob independently
Configuration
Configuration is read from ~/.a3s/power/config.toml:
= "127.0.0.1"
= 11434
= 1
= "5m" # auto-unload idle models ("0"=immediate, "-1"=never, "5m", "1h")
[]
= -1 # offload all layers to GPU (-1=all, 0=CPU only)
= 0 # primary GPU index
| Field | Default | Description |
|---|---|---|
host |
127.0.0.1 |
HTTP server bind address |
port |
11434 |
HTTP server port |
data_dir |
~/.a3s/power |
Base directory for model storage |
max_loaded_models |
1 |
Maximum models loaded in memory concurrently |
keep_alive |
"5m" |
Auto-unload idle models after this duration ("0"=immediate, "-1"=never, "5m", "1h", "30s") |
gpu.gpu_layers |
0 |
Number of layers to offload to GPU (0=CPU, -1=all) |
gpu.main_gpu |
0 |
Index of the primary GPU to use |
All fields are optional and have sensible defaults.
Environment Variables (Ollama-Compatible)
Environment variables override config file values for drop-in Ollama compatibility:
| Variable | Description | Example |
|---|---|---|
OLLAMA_HOST |
Server bind address (host:port or host) |
0.0.0.0:11434 |
OLLAMA_MODELS |
Model storage directory | /data/models |
OLLAMA_KEEP_ALIVE |
Default keep-alive duration | 10m, -1, 0 |
OLLAMA_MAX_LOADED_MODELS |
Max concurrent loaded models | 3 |
OLLAMA_NUM_GPU |
GPU layers to offload (-1 = all) | -1 |
A3S_POWER_HOME |
Base directory for all Power data | ~/.a3s/power |
OLLAMA_HOST supports scheme prefixes (e.g. http://0.0.0.0:8080).
Feature Flags
| Flag | Description |
|---|---|
llamacpp |
Enable llama.cpp inference backend via llama-cpp-2. Requires a C++ compiler and CMake. |
Without any feature flags, Power can manage models (pull, list, delete) and serve API responses, but inference calls will return a "backend not available" error.
Development
Build Commands
# Build
# Test
# Lint
# Run
Project Structure
power/
├── Cargo.toml
├── README.md
├── LICENSE
├── .gitignore
└── src/
├── main.rs # Binary entry point (CLI dispatch)
├── lib.rs # Library root (module re-exports)
├── error.rs # PowerError enum + Result<T> alias
├── config.rs # TOML configuration (host, port, data_dir)
├── dirs.rs # Platform-specific paths (~/.a3s/power/)
├── cli/
│ ├── mod.rs # Cli struct + Commands enum (clap)
│ ├── run.rs # Interactive chat + single prompt
│ ├── pull.rs # Download with progress bar
│ ├── push.rs # Push model to remote registry
│ ├── list.rs # Tabular model listing
│ ├── show.rs # Model detail display
│ ├── delete.rs # Model + blob deletion
│ ├── ps.rs # List running models (queries server)
│ ├── stop.rs # Stop/unload a running model
│ └── serve.rs # HTTP server startup
├── model/
│ ├── manifest.rs # ModelManifest, ModelFormat, ModelParameters
│ ├── registry.rs # In-memory index backed by disk manifests
│ ├── storage.rs # Content-addressed blob store (SHA-256)
│ ├── pull.rs # HTTP download with progress callback
│ ├── push.rs # Push model to remote registry
│ ├── resolve.rs # Name-based model resolution (Ollama registry → built-in → HuggingFace)
│ ├── ollama_registry.rs # Ollama registry client (fetch manifests, metadata, blob URLs)
│ ├── modelfile.rs # Modelfile parser (FROM, PARAMETER, SYSTEM, TEMPLATE, etc.)
│ └── known_models.json# Built-in registry of popular GGUF models (offline fallback)
├── backend/
│ ├── mod.rs # Backend trait + BackendRegistry
│ ├── types.rs # Inference types (vision, tools, chat, completion, embedding)
│ ├── llamacpp.rs # llama.cpp backend (feature-gated, multi-model, KV cache reuse)
│ ├── chat_template.rs # Chat template detection, Jinja2 rendering (minijinja), and fallback formatting
│ ├── json_schema.rs # JSON Schema → GBNF grammar converter for structured output
│ ├── tool_parser.rs # Tool call output parser (XML, Mistral, JSON formats)
│ └── test_utils.rs # MockBackend for testing
├── server/
│ ├── mod.rs # Server startup (bind, listen)
│ ├── state.rs # Shared AppState with LRU model tracking
│ ├── router.rs # Axum router with CORS + tracing + metrics
│ └── metrics.rs # Prometheus metrics collection and /metrics handler
└── api/
├── autoload.rs # Model auto-loading on first inference
├── health.rs # GET /health endpoint
├── types.rs # OpenAI + Ollama request/response types
├── sse.rs # Streaming utilities (NDJSON for native API, SSE for OpenAI API)
├── native/
│ ├── mod.rs # Ollama-compatible route group
│ ├── generate.rs # POST /api/generate
│ ├── chat.rs # POST /api/chat (vision + tools)
│ ├── models.rs # GET /api/tags, POST /api/show, DELETE /api/delete
│ ├── pull.rs # POST /api/pull (streaming progress)
│ ├── push.rs # POST /api/push (push to registry)
│ ├── blobs.rs # HEAD/POST/GET /api/blobs/:digest
│ ├── embeddings.rs# POST /api/embeddings
│ ├── embed.rs # POST /api/embed (batch embeddings)
│ ├── ps.rs # GET /api/ps (running models)
│ ├── copy.rs # POST /api/copy (model aliasing)
│ ├── create.rs # POST /api/create (from Modelfile)
│ └── version.rs # GET /api/version
└── openai/
├── mod.rs # OpenAI-compatible route group + shared helpers
├── chat.rs # POST /v1/chat/completions
├── completions.rs # POST /v1/completions
├── models.rs # GET /v1/models
└── embeddings.rs# POST /v1/embeddings
A3S Ecosystem
A3S Power is an infrastructure component of the A3S ecosystem — a standalone model server that enables local LLM inference for other A3S tools.
┌──────────────────────────────────────────────────────────┐
│ A3S Ecosystem │
│ │
│ Infrastructure: a3s-box (MicroVM sandbox runtime) │
│ a3s-power (local model serving) │
│ │ ▲ │
│ Application: a3s-code ────┘ (AI coding agent) │
│ / \ │
│ Utilities: a3s-lane a3s-context │
│ (memory/knowledge) │
│ │
│ a3s-power ◄── You are here │
└──────────────────────────────────────────────────────────┘
| Project | Package | Relationship |
|---|---|---|
| box | a3s-box-* |
Can use Power for local model inference |
| code | a3s-code |
Uses Power as a local model backend |
| lane | a3s-lane |
Independent utility (no direct relationship) |
| context | a3s-context |
Independent utility (no direct relationship) |
Standalone Usage: a3s-power works independently as a local model server for any application:
- Drop-in Ollama replacement with identical API and NDJSON wire format
- Pull any model from Ollama registry by name (
llama3.2:3b,qwen2.5:7b, etc.) - OpenAI SDK compatible for seamless integration
- Local-first inference with no cloud dependency
Roadmap
Phase 1: Core ✅
- CLI model management (pull, list, show, delete)
- Content-addressed storage with SHA-256
- Model manifest system with JSON persistence
- TOML configuration
- Platform-specific directory resolution
- Comprehensive unit test foundation
Phase 2: Backend & Inference ✅
- Backend trait abstraction
- llama.cpp backend via
llama-cpp-2(feature-gated) - Streaming token generation via channels
- Interactive chat with conversation history
- Single prompt mode
Phase 3: HTTP Server ✅
- Axum-based HTTP server with CORS + tracing
- Ollama-compatible native API (12 endpoints + blob management)
- OpenAI-compatible API (4 endpoints)
- SSE streaming for all inference endpoints
- Non-streaming response collection
Phase 4: Polish & Production ✅
- Model registry resolution (name-based pulls with Ollama registry → built-in registry → HuggingFace fallback)
- Embedding generation support (automatic reload with embedding mode)
- Multiple concurrent model loading (HashMap storage with LRU eviction)
- Model auto-loading on first API request
- GPU acceleration configuration (
[gpu]config with layer offloading) - Chat template auto-detection from GGUF metadata (ChatML, Llama, Phi, Generic)
- Health check endpoint (
/health) - Prometheus metrics endpoint (
/metricswith request/token/model counters)
Phase 5: Full Ollama Parity ✅
- Vision/Multimodal support (
MessageContentenum with text + image URL parts) - Tool/Function calling (tool definitions, tool choice, tool call responses)
- Push API + CLI with streaming progress (
POST /api/push,a3s-power push) - Blob management API (
HEAD/POST/GET/DELETE /api/blobs/:digest) - Generate API:
system,template,raw,suffix,context,imagesfields - Native chat
imagesfield (Ollama base64 format) - CLI
cpcommand for model aliasing - New error variants (
UploadFailed,InvalidDigest,BlobNotFound)
Phase 6: Observability & Cost Tracking ✅
End-to-end observability for LLM inference:
- OpenTelemetry-Ready Metrics: Instrument inference pipeline with Prometheus metrics
power_inference_duration_seconds{model}summary (count + sum)power_ttft_seconds{model}summary (time to first token)- Per-model inference instrumentation across all 4 inference endpoints
- Token & Cost Metrics: Per-call recording via Prometheus
power_inference_tokens_total{model, type=input|output}counterpower_cost_dollars{model}counterpower_inference_duration_seconds{model}summarypower_ttft_seconds{model}summary (time to first token)
- Cost Dashboard Data: Aggregate cost by model / day
- JSON export endpoint:
GET /v1/usagewith date range and model filter
- JSON export endpoint:
- Model Lifecycle Metrics: Load time, memory usage, eviction count
power_model_load_duration_seconds{model}summarypower_model_memory_bytes{model}gaugepower_model_evictions_totalcounter
- GPU Utilization Metrics: GPU memory, compute utilization per device
power_gpu_memory_bytes{device}gaugepower_gpu_utilization{device}gauge
Phase 7: Ollama Drop-in Compatibility ✅
Wire-format and runtime compatibility for seamless Ollama replacement:
- Ollama Registry Integration: Pull any model from
registry.ollama.aiby name — primary resolution source with template, system prompt, params, and license metadata - NDJSON Streaming: Native API endpoints (
/api/generate,/api/chat,/api/pull,/api/push) stream asapplication/x-ndjson(Ollama wire format); OpenAI endpoints keep SSE - Automatic Model Unloading: Background keep_alive reaper checks every 5s and unloads idle models (configurable:
"5m","1h","0","-1") - Context Token Return:
/api/generatereturns token IDs incontextfield for conversation continuity - 861 comprehensive unit tests
Phase 8: Advanced Compatibility ✅
- Jinja2/Go Template Engine: Render arbitrary Jinja2 chat templates via
minijinja(Llama 3, Gemma, ChatML, Phi, custom) with hardcoded fallback; prefers Ollama registrytemplate_overrideover GGUF metadata - KV Cache Reuse: Persist
LlamaContextacross requests with prefix matching — skips re-evaluating shared prompt tokens for multi-turn conversation speedup - Tool Call Parsing: Parse model output into structured
tool_calls— supports<tool_call>XML (Hermes/Qwen),[TOOL_CALLS]prefix (Mistral), and raw JSON formats; zero overhead when no tools in request - JSON Schema Structured Output: Support
format: {"type":"object","properties":{...}}via JSON Schema → GBNF grammar conversion; accepts"json",{"type":"json_object"}, or full JSON Schema objects - Vision Inference: Multimodal vision pipeline — accepts base64 images in Ollama
imagesfield and OpenAIimage_urlcontent parts; projector auto-downloaded from Ollama registry; uses llama.cppmtmdAPI for image encoding when projector available - ADAPTER Support: LoRA/QLoRA adapter loading at inference time — Modelfile
ADAPTERdirective parsed, adapter file loaded viallama_lora_adapter_init, applied to context withlora_adapter_setat scale 1.0 - MESSAGE Directive: Pre-seeded conversation history via Modelfile
MESSAGEdirective; messages stored in manifest and automatically prepended to chat requests - 861 comprehensive unit tests
Phase 9: Operational Parity ✅
Runtime and CLI parity for production Ollama replacement:
- Default Port 11434: Matches Ollama's default port for zero-config drop-in replacement
-
psCLI Command: List running (loaded) models viaa3s-power ps(queries serverGET /api/ps) -
stopCLI Command: Unload a running model viaa3s-power stop <model>(sendskeep_alive: 0) - Ollama Environment Variables:
OLLAMA_HOST,OLLAMA_MODELS,OLLAMA_KEEP_ALIVE,OLLAMA_MAX_LOADED_MODELS,OLLAMA_NUM_GPU— override config file for container/script compatibility - Download Resumption: Interrupted model downloads resume automatically via HTTP Range requests with partial file tracking
- 861 comprehensive unit tests
Phase 10: Intelligence & Observability ✅
GPU auto-detection, memory estimation, verbose model inspection, and per-layer pull progress:
- GPU Auto-Detection: Detect Apple Metal (via
system_profiler) and NVIDIA CUDA (vianvidia-smi) GPUs at server startup; auto-setgpu_layers = -1when GPU available and user hasn't explicitly configured - Memory Estimation: Estimate VRAM requirements before loading (model weights + KV cache + compute overhead); log estimates to help users right-size their hardware
- GGUF Metadata Reader: Lightweight binary parser for GGUF v2/v3 file headers — extracts all key-value metadata and tensor descriptors without loading weights into memory
- Verbose Show:
/api/showwithverbose: truereturns full GGUF metadata (architecture, context length, embedding dimensions, etc.) and tensor information (name, shape, type, element count) - Per-Layer Pull Progress: Streaming pull progress shows per-layer digest identifiers (
pulling sha256:abc123...) matching Ollama's output format; resolves model before download to extract layer digests - 861 comprehensive unit tests
Phase 11: Full Options Parity ✅
Complete Ollama generation options support and multi-GPU wiring:
- Missing Generation Options: Added
repeat_last_n,penalize_newline,num_batch,num_thread,num_thread_batch,use_mmap,use_mlock,numa,flash_attention,num_gpu,main_gputoGenerateOptions - Backend Wiring: All new options flow through API → backend
CompletionRequest/ChatRequest→ llama.cpp context params and sampler - Flash Attention: Wired to
LlamaContextParams::with_flash_attention_policy(Enabled)whenflash_attention: true - Multi-GPU:
main_gpuconfig wired toLlamaModelParams::with_main_gpu(); per-requestnum_gpu/main_gpuoverride supported - Memory Lock:
use_mlockconfig wired toLlamaModelParams::with_use_mlock(true)to prevent model swapping - Thread Control:
num_threadandnum_thread_batchwired toLlamaContextParams::with_n_threads()andwith_n_threads_batch() - Batch Size:
num_batchwired toLlamaContextParams::with_n_batch() - Repeat Penalty Window:
repeat_last_nwired toLlamaSampler::penalties()first argument (was hardcoded to 64) - Config Extensions: Added
use_mlock,num_thread,flash_attentiontoPowerConfigwith TOML support - 861 comprehensive unit tests
Phase 12: CLI Run Options Parity ✅
Complete Ollama CLI run command options — all 14/14 options now implemented:
-
--format: JSON output format constraint (accepts"json"or JSON schema object) -
--system: Override system prompt per session (prepended as system message) -
--template: Override chat template (reserved for template engine integration) -
--keep-alive: Model keep-alive duration (e.g."5m","1h","-1"for never unload) -
--verbose: Show timing and token statistics after each generation (prompt eval count/rate, eval count, total duration, tokens/s) -
--insecure: Skip TLS verification flag for registry operations - 861 comprehensive unit tests
Phase 13: Environment Variables & CLI Polish ✅
Complete Ollama environment variable parity and CLI enhancements:
-
OLLAMA_NUM_PARALLEL: Number of parallel request slots (concurrent inference) -
OLLAMA_DEBUG: Enable debug logging (setsRUST_LOG=debugif not already set) -
OLLAMA_ORIGINS: Custom CORS origins (comma-separated); empty = permissive -
OLLAMA_FLASH_ATTENTION: Global flash attention override ("1"or"true") -
OLLAMA_TMPDIR: Custom temporary directory for downloads and scratch files - CLI
show --verbose: Display full GGUF metadata (keys, values, tensor list) from CLI - CLI
pull --insecure: Skip TLS verification for pull operations - CLI
push --insecure: Skip TLS verification for push operations - Interactive
/help: Show available slash commands in interactive chat - Interactive
/clear: Clear conversation history (preserves system prompt) - Interactive
/show: Display model name, message counts, and current settings - Interactive
""": Multi-line input support with triple-quote delimiters - CORS Configuration: Server respects
OLLAMA_ORIGINSfor restricted CORS; defaults to permissive - 861 comprehensive unit tests
Phase 14: Final Ollama Parity ✅
Complete remaining Ollama feature gaps — help subcommand, blob pruning, GPU scheduling:
-
helpsubcommand:a3s-power help [command]prints help for any subcommand (replaces clap's built-in) - Blob pruning:
prune_unused_blobs()removes orphaned blob files not referenced by any manifest; returns count and bytes freed -
OLLAMA_NOPRUNE: Disable automatic blob pruning ("1"or"true") -
OLLAMA_SCHED_SPREAD: Spread model layers across all available GPUs ("1"or"true") - 861 comprehensive unit tests
License
MIT