A3S Power
Overview
A3S Power is an Ollama-compatible CLI tool and HTTP server for local model management and inference. It provides both an Ollama-compatible native API and an OpenAI-compatible API, so existing tools, SDKs, and frontends work out of the box.
Basic Usage
# Pull a model by name (resolves from Ollama registry, built-in registry, or HuggingFace)
# Pull from a direct URL
# Interactive chat
# Single prompt
# Push a model to a remote registry
# Start HTTP server
Features
- CLI Model Management: Pull, list, show, delete, and push models from the command line
- Ollama Registry Integration: Pull any model from
registry.ollama.aiby name (llama3.2:3b) — primary resolution source with built-in registry and HuggingFace fallback - Interactive Chat: Multi-turn conversation with streaming token output
- Vision/Multimodal Support: Accept base64 images (Ollama
imagesfield) and image URLs (OpenAIcontentarray format); projector auto-downloaded from Ollama registry; image processing requires vision model with projector (e.g. llava) - Tool/Function Calling: Structured tool definitions, tool choice, and tool call responses (OpenAI-compatible)
- JSON Schema Structured Output: Constrain model output to match JSON Schema via GBNF grammar generation — supports
"json",{"type":"json_object"}, or full JSON Schema objects - Chat Template Auto-Detection: Detects ChatML, Llama, Phi, and Generic templates from GGUF metadata
- Jinja2 Template Engine: Renders arbitrary Jinja2 chat templates via
minijinja(Llama 3, Gemma, ChatML, Phi, custom) with hardcoded fallback - KV Cache Reuse: Persists
LlamaContextacross requests with prefix matching — skips re-evaluating shared prompt tokens for multi-turn speedup - Tool Call Parsing: Parses model output into structured
tool_calls— supports<tool_call>XML,[TOOL_CALLS]prefix, and raw JSON formats - Modelfile Support: Create custom models with
FROM,PARAMETER,SYSTEM,TEMPLATE,ADAPTER(LoRA/QLoRA),LICENSE, andMESSAGE(pre-seeded conversations) directives - Multiple Concurrent Models: Load multiple models with LRU eviction at configurable capacity
- Automatic Model Unloading: Background keep_alive reaper unloads idle models after configurable timeout (default 5m)
- GPU Acceleration: Configurable GPU layer offloading via
[gpu]config section with automatic GPU detection (Metal/CUDA), multi-GPU support (main_gpu), and per-requestnum_gpuoverride - GPU Auto-Detection: Automatically detects Apple Metal and NVIDIA CUDA GPUs at server startup, sets optimal
gpu_layerswhen not explicitly configured - Memory Estimation: Estimates VRAM requirements before loading a model (model weights + KV cache + compute overhead) and logs warnings
- Full Ollama Options: All Ollama generation options supported —
repeat_last_n,penalize_newline,num_batch,num_thread,num_thread_batch,use_mmap,use_mlock,numa,flash_attention,num_gpu,main_gpu— in addition to standard sampling parameters - Embedding Support: Real embedding generation with automatic model reload in embedding mode
- HTTP Server: Axum-based server with CORS, tracing, and metrics middleware
- Ollama-Compatible API:
/api/generate,/api/chat,/api/tags,/api/pull,/api/push,/api/show,/api/delete,/api/embeddings,/api/embed,/api/ps,/api/copy,/api/version,/api/blobs/:digest - OpenAI-Compatible API:
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddings - Blob Management API: Check, upload, and download content-addressed blobs via REST
- Push API: Upload models to remote registries with progress reporting
- NDJSON Streaming: Native API endpoints stream as
application/x-ndjson(Ollama wire format); OpenAI endpoints use SSE - Context Token Return:
/api/generatereturns token IDs incontextfield for conversation continuity - Prometheus Metrics:
GET /metricsendpoint with request counts, durations, tokens, model gauges, inference duration, TTFT, cost, evictions, model memory, and GPU metrics - Usage Dashboard:
GET /v1/usageendpoint with date range and model filtering for cost tracking - GGUF Metadata Reader: Lightweight binary parser for GGUF file headers — extracts architecture metadata and tensor descriptors without loading weights
- Verbose Show:
/api/showwithverbose: truereturns full GGUF metadata and tensor information - Per-Layer Pull Progress: Pull progress shows per-layer digest identifiers (
pulling sha256:abc...) matching Ollama's output format - Content-Addressed Storage: Model blobs stored by SHA-256 hash with automatic deduplication
- llama.cpp Backend: GGUF inference via
llama-cpp-2Rust bindings (optional feature flag) - Health Check:
GET /healthendpoint with uptime, version, and loaded model count - Model Auto-Loading: Models are automatically loaded on first inference request with LRU eviction
- TOML Configuration: User-configurable host, port, GPU settings, keep_alive, and storage settings
- Ollama Environment Variables:
OLLAMA_HOST,OLLAMA_MODELS,OLLAMA_KEEP_ALIVE,OLLAMA_MAX_LOADED_MODELS,OLLAMA_NUM_GPU,OLLAMA_NUM_PARALLEL,OLLAMA_DEBUG,OLLAMA_ORIGINS,OLLAMA_FLASH_ATTENTION,OLLAMA_TMPDIR,OLLAMA_NOPRUNE,OLLAMA_SCHED_SPREADfor drop-in compatibility - Download Resumption: Interrupted model downloads resume automatically via HTTP Range requests
- Async-First: Built on Tokio for high-performance async operations
Ollama Compatibility Status
Compared against Ollama source at github.com/ollama/ollama (latest main).
✅ Fully Aligned
| Category | Status |
|---|---|
| Native API (14 endpoints) | /api/generate, /api/chat, /api/pull, /api/push, /api/tags, /api/show, /api/delete, /api/copy, /api/embed, /api/embeddings, /api/ps, /api/version, /api/create, /api/blobs/:digest |
| OpenAI API (4 endpoints) | /v1/chat/completions, /v1/completions, /v1/models, /v1/embeddings |
| CLI commands (12) | run, pull, list/ls, show, delete/rm, serve, create, push, cp, ps, stop, help |
| Streaming | NDJSON for native API, SSE for OpenAI API |
| Modelfile | FROM, PARAMETER, SYSTEM, TEMPLATE, ADAPTER, LICENSE, MESSAGE + heredoc |
| Sampling parameters | temperature, top_p, top_k, min_p, repeat_penalty, frequency/presence_penalty, seed, typical_p, num_keep, stop |
| Runner options | num_ctx, num_predict, num_batch, num_gpu, num_thread, use_mmap |
| Keep-alive | String + numeric, per-request + global config, "0" / "-1" special values |
| Tool/Function calling | Both native /api/chat and OpenAI /v1/chat/completions, XML/Mistral/JSON parsing |
| JSON structured output | "json", {"type":"json_object"}, full JSON Schema → GBNF grammar |
| Ollama registry | Pull from registry.ollama.ai with template/system/params/license extraction |
| KV cache reuse | Prefix matching across multi-turn requests |
| LoRA adapters | ADAPTER directive, loaded at inference |
| GPU auto-detection | Metal + CUDA, auto gpu_layers, multi-GPU |
| Blob management | HEAD/POST/GET/DELETE /api/blobs/:digest |
| Context return | /api/generate returns context token array |
done_reason |
Returned in generate/chat responses |
raw mode |
Skip template formatting in /api/generate |
suffix field |
Fill-in-the-middle in /api/generate |
| CORS | Configurable origins with OLLAMA_ORIGINS |
🔴 Remaining Gaps (vs Ollama latest)
API Request/Response Fields
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
think parameter |
Critical | api/types.go:109,173 |
ThinkValue (bool or "high"/"medium"/"low") in generate/chat requests — enables reasoning models (DeepSeek-R1, QwQ). Not implemented. |
thinking response field |
Critical | api/types.go:216,856 |
Message.Thinking and GenerateResponse.Thinking — returns thinking content separately from response. Not implemented. |
| Thinking parser | Critical | thinking/parser.go |
Streaming parser that separates <think>...</think> blocks from content in real-time. Infers tags from template. Not implemented. |
logprobs / top_logprobs |
Important | api/types.go:123-129,187-193 |
Log probability support in generate/chat requests + Logprob/TokenLogprob response types. Not implemented. |
truncate field (generate/chat) |
Important | api/types.go:112,176 |
Truncate prompt when exceeding context length instead of erroring. Not implemented. |
shift field (generate/chat) |
Important | api/types.go:117,180 |
Shift context window when hitting limit instead of erroring. Not implemented. |
_debug_render_only |
Nice-to-have | api/types.go:121,185 |
Debug mode that returns rendered template without calling model. Not implemented. |
tool_calls in GenerateResponse |
Moderate | api/types.go:870 |
/api/generate can also return tool_calls (not just /api/chat). Not implemented. |
OpenAI API Gaps
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
GET /v1/models/:model |
Important | routes.go:1610 |
Retrieve single model details. Not implemented (only GET /v1/models list). |
POST /v1/responses |
Moderate | routes.go:1611 |
OpenAI Responses API compatibility. Not implemented. |
POST /v1/messages |
Moderate | routes.go:1617 |
Anthropic Messages API compatibility via middleware. Not implemented. |
POST /v1/images/generations |
Nice-to-have | routes.go:1613 |
Image generation endpoint. Not implemented. |
POST /v1/images/edits |
Nice-to-have | routes.go:1614 |
Image editing endpoint. Not implemented. |
reasoning / reasoning_effort |
Important | openai/openai.go:94-96,112-113 |
OpenAI reasoning effort ("high"/"medium"/"low") mapped to think. Not implemented. |
stream_options.include_usage |
Moderate | openai/openai.go:90-92 |
Return usage stats in final streaming chunk when requested. Not implemented. |
encoding_format (embeddings) |
Moderate | openai/openai.go:87 |
"float" or "base64" encoding for embedding responses. Not implemented. |
dimensions (embeddings) |
Moderate | api/types.go:626 |
Truncate output embeddings to specified dimension. Not implemented. |
ShowResponse Fields
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
capabilities |
Important | api/types.go:755 |
List of model capabilities (completion, tools, vision, thinking, embedding, insert, image). Not implemented. |
renderer / parser |
Moderate | api/types.go:746-747 |
Custom renderer/parser names for model. Not implemented. |
projector_info |
Moderate | api/types.go:753 |
Projector metadata for vision models. Not implemented. |
remote_model / remote_host |
Moderate | api/types.go:750-751 |
Remote model proxy info. Not implemented. |
requires |
Nice-to-have | api/types.go:757 |
Minimum Ollama version required. Not implemented. |
messages |
Moderate | api/types.go:749 |
Pre-seeded messages in show response. Not implemented. |
ProcessResponse (ps) Fields
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
size_vram |
Moderate | api/types.go:829 |
VRAM usage per loaded model. Not implemented. |
context_length |
Moderate | api/types.go:830 |
Active context length per loaded model. Not implemented. |
Create API
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
| New structured Create API | Important | api/types.go:663-709 |
Ollama's new from, files, adapters, template, system, parameters, messages, license fields (replacing Modelfile-only approach). a3s-power only supports Modelfile-based create. |
| Re-quantization | Important | server/create.go |
create --quantize q4_K_M actually quantizes the model. a3s-power accepts but no-ops. |
| SafeTensors conversion | Moderate | convert/ |
Convert SafeTensors → GGUF during create. Not implemented. |
Environment Variables
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
OLLAMA_KV_CACHE_TYPE |
Important | envconfig/config.go:278 |
KV cache quantization type (default: f16). Not implemented. |
OLLAMA_GPU_OVERHEAD |
Moderate | envconfig/config.go:279 |
Reserve VRAM per GPU (bytes). Not implemented. |
OLLAMA_LOAD_TIMEOUT |
Moderate | envconfig/config.go:283 |
Stall detection timeout for model loads (default 5m). Not implemented. |
OLLAMA_MAX_QUEUE |
Moderate | envconfig/config.go:285 |
Maximum queued requests. Not implemented. |
OLLAMA_NOHISTORY |
Nice-to-have | envconfig/config.go:287 |
Disable readline history. Not implemented. |
OLLAMA_MULTIUSER_CACHE |
Nice-to-have | envconfig/config.go:292 |
Optimize prompt caching for multi-user. Not implemented. |
OLLAMA_CONTEXT_LENGTH |
Important | envconfig/config.go:293 |
Global default context length override. Not implemented. |
OLLAMA_REMOTES |
Moderate | envconfig/config.go:295 |
Allowed hosts for remote models. Not implemented. |
OLLAMA_LLM_LIBRARY |
Nice-to-have | envconfig/config.go:282 |
Override LLM library autodetection. Not applicable (Rust bindings). |
Auth & Account
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
signin / signout CLI |
Moderate | cmd/cmd.go:666,697 |
Sign in/out of ollama.com account. Not implemented. |
POST /api/me |
Moderate | routes.go:1583 |
Whoami endpoint. Not implemented. |
POST /api/signout |
Moderate | routes.go:1585 |
Signout endpoint. Not implemented. |
| Registry auth (push) | Important | auth/auth.go |
Keypair-based auth for pushing to registry.ollama.ai. Not implemented. |
CLI Flags
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
run --think |
Critical | cmd/cmd.go:2069 |
Enable thinking mode from CLI. Not implemented. |
run --hidethinking |
Important | cmd/cmd.go:2071 |
Hide thinking output in CLI. Not implemented. |
run --truncate |
Moderate | cmd/cmd.go:2072 |
Truncate embeddings input. Not implemented. |
run --dimensions |
Moderate | cmd/cmd.go:2073 |
Truncate output embeddings dimension. Not implemented. |
run --nowordwrap |
Nice-to-have | cmd/cmd.go:2067 |
Disable word wrapping in CLI. Not implemented. |
show --license |
Nice-to-have | cmd/cmd.go:2049 |
Show only license. Not implemented (shows all). |
show --modelfile |
Nice-to-have | cmd/cmd.go:2050 |
Show only modelfile. Not implemented. |
show --parameters |
Nice-to-have | cmd/cmd.go:2051 |
Show only parameters. Not implemented. |
show --template |
Nice-to-have | cmd/cmd.go:2052 |
Show only template. Not implemented. |
show --system |
Nice-to-have | cmd/cmd.go:2053 |
Show only system message. Not implemented. |
run --experimental |
Nice-to-have | cmd/cmd.go:2074 |
Experimental agent loop with tools. Not implemented. |
Server/Runtime
| Gap | Severity | Ollama Source | Description |
|---|---|---|---|
GET / and HEAD / |
Nice-to-have | routes.go:1570-1571 |
Returns "Ollama is running" string. Not implemented (a3s-power has /health). |
| Experimental aliases API | Nice-to-have | routes.go:1594-1596 |
GET/POST/DELETE /api/experimental/aliases. Not implemented. |
| Request queuing | Moderate | envconfig:OLLAMA_MAX_QUEUE |
Queue requests when all model slots busy. Not implemented. |
num_parallel wiring |
Moderate | — | Concurrent request slots per loaded model. Config exists but unclear if wired to llama.cpp. |
Extra Options (a3s-power has but Ollama removed)
Note: a3s-power supports some options that Ollama has removed from their latest Options struct:
mirostat,mirostat_tau,mirostat_eta— removed from Ollamatfs_z— removed from Ollamamain_gpu— removed from Ollama Runneruse_mlock— removed from Ollama Runnerflash_attention— removed from Ollama Runner (now env-only viaOLLAMA_FLASH_ATTENTION)num_thread_batch— removed from Ollama Runnerpenalize_newline— removed from Ollamanuma— removed from Ollama
These are kept in a3s-power for backward compatibility but may diverge from Ollama's current behavior.
Quality Metrics
Test Coverage
888 unit tests with 90.11% region coverage across 59 source files:
| Module | Lines | Coverage | Functions | Coverage |
|---|---|---|---|---|
| api/health.rs | 62 | 100.00% | 10 | 100.00% |
| api/mod.rs | 27 | 100.00% | 5 | 100.00% |
| api/native/mod.rs | 22 | 100.00% | 1 | 100.00% |
| api/native/ps.rs | 149 | 100.00% | 17 | 100.00% |
| api/native/version.rs | 21 | 100.00% | 6 | 100.00% |
| api/openai/mod.rs | 30 | 100.00% | 4 | 100.00% |
| api/openai/usage.rs | 384 | 100.00% | 27 | 100.00% |
| backend/llamacpp.rs | 186 | 100.00% | 26 | 100.00% |
| backend/test_utils.rs | 130 | 100.00% | 18 | 100.00% |
| cli/delete.rs | 102 | 100.00% | 5 | 100.00% |
| cli/list.rs | 88 | 100.00% | 7 | 100.00% |
| error.rs | 93 | 100.00% | 19 | 100.00% |
| model/manifest.rs | 164 | 100.00% | 19 | 100.00% |
| server/router.rs | 209 | 100.00% | 33 | 100.00% |
| backend/json_schema.rs | 389 | 98.97% | 53 | 100.00% |
| backend/tool_parser.rs | 347 | 99.14% | 43 | 100.00% |
| model/modelfile.rs | 552 | 99.28% | 42 | 100.00% |
| server/state.rs | 266 | 99.25% | 37 | 97.30% |
| api/sse.rs | 95 | 98.95% | 16 | 93.75% |
| api/types.rs | 613 | 98.37% | 52 | 100.00% |
| server/metrics.rs | 607 | 98.35% | 54 | 96.30% |
| backend/chat_template.rs | 349 | 98.28% | 32 | 100.00% |
| backend/mod.rs | 65 | 98.46% | 15 | 100.00% |
| dirs.rs | 55 | 98.18% | 12 | 91.67% |
| backend/types.rs | 261 | 98.08% | 23 | 95.65% |
| api/native/chat.rs | 735 | 94.42% | 32 | 100.00% |
| api/native/generate.rs | 709 | 95.77% | 32 | 100.00% |
| api/native/models.rs | 457 | 96.06% | 32 | 100.00% |
| config.rs | 475 | 96.84% | 60 | 96.67% |
| api/openai/embeddings.rs | 187 | 95.72% | 9 | 100.00% |
| api/native/blobs.rs | 212 | 94.81% | 15 | 100.00% |
| api/autoload.rs | 220 | 94.09% | 24 | 100.00% |
| api/native/embed.rs | 158 | 93.04% | 9 | 100.00% |
| model/gguf.rs | 746 | 93.43% | 80 | 80.00% |
| api/openai/models.rs | 118 | 93.22% | 9 | 100.00% |
| api/native/embeddings.rs | 133 | 96.24% | 7 | 100.00% |
| api/native/copy.rs | 60 | 91.67% | 6 | 100.00% |
| cli/mod.rs | 340 | 91.18% | 34 | 100.00% |
| api/native/create.rs | 340 | 90.00% | 19 | 94.74% |
| api/openai/chat.rs | 531 | 88.14% | 23 | 78.26% |
| model/registry.rs | 308 | 87.99% | 42 | 83.33% |
| model/storage.rs | 331 | 87.31% | 31 | 83.87% |
| cli/show.rs | 234 | 84.19% | 15 | 100.00% |
| api/openai/completions.rs | 394 | 82.99% | 14 | 78.57% |
| backend/gpu.rs | 281 | 82.92% | 38 | 92.11% |
| model/resolve.rs | 341 | 75.66% | 54 | 79.63% |
| api/native/push.rs | 187 | 75.40% | 10 | 80.00% |
| cli/push.rs | 43 | 74.42% | 10 | 90.00% |
| model/ollama_registry.rs | 530 | 73.21% | 57 | 70.18% |
| cli/ps.rs | 152 | 70.39% | 22 | 81.82% |
| cli/serve.rs | 34 | 70.59% | 4 | 50.00% |
| cli/stop.rs | 102 | 70.59% | 12 | 75.00% |
| server/mod.rs | 84 | 65.48% | 12 | 66.67% |
| model/push.rs | 151 | 62.91% | 27 | 81.48% |
| cli/pull.rs | 72 | 62.50% | 6 | 83.33% |
| api/native/pull.rs | 269 | 50.19% | 16 | 81.25% |
| cli/run.rs | 845 | 48.88% | 57 | 85.96% |
| model/pull.rs | 384 | 48.70% | 36 | 63.89% |
| TOTAL | 15429 | 87.94% | 1430 | 91.47% |
Overall: 90.11% region coverage, 91.47% function coverage, 87.94% line coverage
Run coverage report:
LLVM_COV=/opt/homebrew/Cellar/llvm/21.1.8/bin/llvm-cov \
LLVM_PROFDATA=/opt/homebrew/Cellar/llvm/21.1.8/bin/llvm-profdata \
Architecture
Components
┌─────────────────────────────────────────────────┐
│ a3s-power │
│ │
│ CLI Layer │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ run │ │ pull │ │ list │ │ push │ │serve │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │ │ │
│ Model Layer │ │ │
│ ┌────────────────────┴────────┐ │ │
│ │ ModelRegistry │ │ │
│ │ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │ manifest │ │ storage │ │ │ │
│ │ └──────────┘ └──────────┘ │ │ │
│ └─────────────────────────────┘ │ │
│ │ │
│ Backend Layer │ │
│ ┌─────────────────────────────┐ │ │
│ │ BackendRegistry │ │ │
│ │ ┌──────────────────────┐ │ │ │
│ │ │ LlamaCppBackend │ │ │ │
│ │ │ (feature: llamacpp) │ │ │ │
│ │ └──────────────────────┘ │ │ │
│ └─────────────────────────────┘ │ │
│ │ │
│ Server Layer ◄──────────────────────────┘ │
│ ┌─────────────────────────────────────┐ │
│ │ Axum Router │ │
│ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │ /api/* │ │ /v1/* │ │ │
│ │ │ (Ollama) │ │ (OpenAI) │ │ │
│ │ └────────────┘ └────────────────┘ │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Backend Trait
The Backend trait abstracts inference engines. The llama.cpp backend is feature-gated; without the llamacpp feature, Power can still manage models but returns "backend not available" for inference calls.
Installation
Homebrew (macOS)
Cargo (cross-platform)
# Model management only
# With llama.cpp inference backend (requires C++ compiler + CMake)
Pre-built Binary (macOS Apple Silicon)
Build from Source
# Without inference backend (model management only)
# With llama.cpp inference (requires C++ compiler + CMake)
# Binary at target/release/a3s-power
Quick Start
Model Management
# Pull a model by name (Ollama registry → built-in registry → HuggingFace fallback)
# Pull from a direct URL
# List local models
# Show model details
# Delete a model
# Push a model to a remote registry
Interactive Chat
# Start interactive chat session
# Send a single prompt
HTTP Server
# Start server on default port (127.0.0.1:11434)
# Custom host and port
API Reference
Server
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check (status, version, uptime, loaded models) |
GET |
/metrics |
Prometheus metrics (requests, durations, tokens, inference, TTFT, cost, evictions, model memory, GPU) |
Native API (Ollama-Compatible)
| Method | Path | Description |
|---|---|---|
POST |
/api/generate |
Text generation (streaming/non-streaming) |
POST |
/api/chat |
Chat completion with vision & tool support (streaming/non-streaming) |
POST |
/api/pull |
Download a model by name or URL (streaming progress) |
POST |
/api/push |
Push a model to a remote registry |
GET |
/api/tags |
List local models |
POST |
/api/show |
Show model details |
DELETE |
/api/delete |
Delete a model |
POST |
/api/embeddings |
Generate embeddings |
POST |
/api/embed |
Batch embedding generation |
GET |
/api/ps |
List running/loaded models |
POST |
/api/copy |
Copy/alias a model |
GET |
/api/version |
Server version |
HEAD |
/api/blobs/:digest |
Check if a blob exists |
POST |
/api/blobs/:digest |
Upload a blob with SHA-256 verification |
GET |
/api/blobs/:digest |
Download a blob |
DELETE |
/api/blobs/:digest |
Delete a blob |
OpenAI-Compatible API
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completion (streaming/non-streaming) |
POST |
/v1/completions |
Text completion (streaming/non-streaming) |
GET |
/v1/models |
List available models |
POST |
/v1/embeddings |
Generate embeddings |
GET |
/v1/usage |
Usage and cost dashboard data (date range + model filter) |
Examples
List Models
# OpenAI-compatible
# Ollama-compatible
Chat Completion (OpenAI)
Chat Completion with Streaming
Text Generation (Ollama)
Text Completion (OpenAI)
Vision/Multimodal (OpenAI)
Tool/Function Calling (OpenAI)
Push Model
Structured Output (JSON Schema)
# Constrain output to match a JSON Schema
Blob Management
# Check if blob exists
# Upload blob
# Download blob
CLI Commands
| Command | Description |
|---|---|
a3s-power run <model> [--prompt <text>] |
Load model and start interactive chat, or send a single prompt |
a3s-power pull <name_or_url> |
Download a model by name (llama3.2:3b) or direct URL |
a3s-power push <model> --destination <url> |
Push a model to a remote registry |
a3s-power list |
List all locally available models |
a3s-power show <model> |
Show model details (format, size, parameters) |
a3s-power delete <model> |
Delete a model from local storage |
a3s-power create <name> -f <modelfile> |
Create a custom model from a Modelfile |
a3s-power cp <source> <destination> |
Copy/alias a model to a new name |
a3s-power ps |
List running (loaded) models on the server |
a3s-power stop <model> |
Stop (unload) a running model from the server |
a3s-power serve [--host <addr>] [--port <port>] |
Start HTTP server (default: 127.0.0.1:11434) |
Model Storage
Models are stored in ~/.a3s/power/ (override with $A3S_POWER_HOME):
~/.a3s/power/
├── config.toml # User configuration
└── models/
├── manifests/ # JSON manifest files
│ ├── llama-2-7b.json
│ └── qwen2.5-7b.json
└── blobs/ # Content-addressed model files
├── sha256-abc123...
└── sha256-def456...
Content-Addressed Storage
Model files are stored by their SHA-256 hash, enabling:
- Deduplication: Identical files share storage
- Integrity verification: Blobs can be verified against their hash
- Clean deletion: Remove manifest + blob independently
Configuration
Configuration is read from ~/.a3s/power/config.toml:
= "127.0.0.1"
= 11434
= 1
= "5m" # auto-unload idle models ("0"=immediate, "-1"=never, "5m", "1h")
[]
= -1 # offload all layers to GPU (-1=all, 0=CPU only)
= 0 # primary GPU index
| Field | Default | Description |
|---|---|---|
host |
127.0.0.1 |
HTTP server bind address |
port |
11434 |
HTTP server port |
data_dir |
~/.a3s/power |
Base directory for model storage |
max_loaded_models |
1 |
Maximum models loaded in memory concurrently |
keep_alive |
"5m" |
Auto-unload idle models after this duration ("0"=immediate, "-1"=never, "5m", "1h", "30s") |
gpu.gpu_layers |
0 |
Number of layers to offload to GPU (0=CPU, -1=all) |
gpu.main_gpu |
0 |
Index of the primary GPU to use |
All fields are optional and have sensible defaults.
Environment Variables (Ollama-Compatible)
Environment variables override config file values for drop-in Ollama compatibility:
| Variable | Description | Example |
|---|---|---|
OLLAMA_HOST |
Server bind address (host:port or host) |
0.0.0.0:11434 |
OLLAMA_MODELS |
Model storage directory | /data/models |
OLLAMA_KEEP_ALIVE |
Default keep-alive duration | 10m, -1, 0 |
OLLAMA_MAX_LOADED_MODELS |
Max concurrent loaded models | 3 |
OLLAMA_NUM_GPU |
GPU layers to offload (-1 = all) | -1 |
A3S_POWER_HOME |
Base directory for all Power data | ~/.a3s/power |
OLLAMA_HOST supports scheme prefixes (e.g. http://0.0.0.0:8080).
Feature Flags
| Flag | Description |
|---|---|
llamacpp |
Enable llama.cpp inference backend via llama-cpp-2. Requires a C++ compiler and CMake. |
Without any feature flags, Power can manage models (pull, list, delete) and serve API responses, but inference calls will return a "backend not available" error.
Development
Build Commands
# Build
# Test
# Lint
# Run
Project Structure
power/
├── Cargo.toml
├── README.md
├── LICENSE
├── .gitignore
└── src/
├── main.rs # Binary entry point (CLI dispatch)
├── lib.rs # Library root (module re-exports)
├── error.rs # PowerError enum + Result<T> alias
├── config.rs # TOML configuration (host, port, data_dir)
├── dirs.rs # Platform-specific paths (~/.a3s/power/)
├── cli/
│ ├── mod.rs # Cli struct + Commands enum (clap)
│ ├── run.rs # Interactive chat + single prompt
│ ├── pull.rs # Download with progress bar
│ ├── push.rs # Push model to remote registry
│ ├── list.rs # Tabular model listing
│ ├── show.rs # Model detail display
│ ├── delete.rs # Model + blob deletion
│ ├── ps.rs # List running models (queries server)
│ ├── stop.rs # Stop/unload a running model
│ └── serve.rs # HTTP server startup
├── model/
│ ├── manifest.rs # ModelManifest, ModelFormat, ModelParameters
│ ├── registry.rs # In-memory index backed by disk manifests
│ ├── storage.rs # Content-addressed blob store (SHA-256)
│ ├── pull.rs # HTTP download with progress callback
│ ├── push.rs # Push model to remote registry
│ ├── resolve.rs # Name-based model resolution (Ollama registry → built-in → HuggingFace)
│ ├── ollama_registry.rs # Ollama registry client (fetch manifests, metadata, blob URLs)
│ ├── modelfile.rs # Modelfile parser (FROM, PARAMETER, SYSTEM, TEMPLATE, etc.)
│ └── known_models.json# Built-in registry of popular GGUF models (offline fallback)
├── backend/
│ ├── mod.rs # Backend trait + BackendRegistry
│ ├── types.rs # Inference types (vision, tools, chat, completion, embedding)
│ ├── llamacpp.rs # llama.cpp backend (feature-gated, multi-model, KV cache reuse)
│ ├── chat_template.rs # Chat template detection, Jinja2 rendering (minijinja), and fallback formatting
│ ├── json_schema.rs # JSON Schema → GBNF grammar converter for structured output
│ ├── tool_parser.rs # Tool call output parser (XML, Mistral, JSON formats)
│ └── test_utils.rs # MockBackend for testing
├── server/
│ ├── mod.rs # Server startup (bind, listen)
│ ├── state.rs # Shared AppState with LRU model tracking
│ ├── router.rs # Axum router with CORS + tracing + metrics
│ └── metrics.rs # Prometheus metrics collection and /metrics handler
└── api/
├── autoload.rs # Model auto-loading on first inference
├── health.rs # GET /health endpoint
├── types.rs # OpenAI + Ollama request/response types
├── sse.rs # Streaming utilities (NDJSON for native API, SSE for OpenAI API)
├── native/
│ ├── mod.rs # Ollama-compatible route group
│ ├── generate.rs # POST /api/generate
│ ├── chat.rs # POST /api/chat (vision + tools)
│ ├── models.rs # GET /api/tags, POST /api/show, DELETE /api/delete
│ ├── pull.rs # POST /api/pull (streaming progress)
│ ├── push.rs # POST /api/push (push to registry)
│ ├── blobs.rs # HEAD/POST/GET /api/blobs/:digest
│ ├── embeddings.rs# POST /api/embeddings
│ ├── embed.rs # POST /api/embed (batch embeddings)
│ ├── ps.rs # GET /api/ps (running models)
│ ├── copy.rs # POST /api/copy (model aliasing)
│ ├── create.rs # POST /api/create (from Modelfile)
│ └── version.rs # GET /api/version
└── openai/
├── mod.rs # OpenAI-compatible route group + shared helpers
├── chat.rs # POST /v1/chat/completions
├── completions.rs # POST /v1/completions
├── models.rs # GET /v1/models
└── embeddings.rs# POST /v1/embeddings
A3S Ecosystem
A3S Power is an infrastructure component of the A3S ecosystem — a standalone model server that enables local LLM inference for other A3S tools.
┌──────────────────────────────────────────────────────────┐
│ A3S Ecosystem │
│ │
│ Infrastructure: a3s-box (MicroVM sandbox runtime) │
│ a3s-power (local model serving) │
│ │ ▲ │
│ Application: a3s-code ────┘ (AI coding agent) │
│ / \ │
│ Utilities: a3s-lane a3s-context │
│ (memory/knowledge) │
│ │
│ a3s-power ◄── You are here │
└──────────────────────────────────────────────────────────┘
| Project | Package | Relationship |
|---|---|---|
| box | a3s-box-* |
Can use Power for local model inference |
| code | a3s-code |
Uses Power as a local model backend |
| lane | a3s-lane |
Independent utility (no direct relationship) |
| context | a3s-context |
Independent utility (no direct relationship) |
Standalone Usage: a3s-power works independently as a local model server for any application:
- Drop-in Ollama replacement with identical API and NDJSON wire format
- Pull any model from Ollama registry by name (
llama3.2:3b,qwen2.5:7b, etc.) - OpenAI SDK compatible for seamless integration
- Local-first inference with no cloud dependency
Roadmap
Phase 1: Core ✅
- CLI model management (pull, list, show, delete)
- Content-addressed storage with SHA-256
- Model manifest system with JSON persistence
- TOML configuration
- Platform-specific directory resolution
- Comprehensive unit test foundation
Phase 2: Backend & Inference ✅
- Backend trait abstraction
- llama.cpp backend via
llama-cpp-2(feature-gated) - Streaming token generation via channels
- Interactive chat with conversation history
- Single prompt mode
Phase 3: HTTP Server ✅
- Axum-based HTTP server with CORS + tracing
- Ollama-compatible native API (12 endpoints + blob management)
- OpenAI-compatible API (4 endpoints)
- SSE streaming for all inference endpoints
- Non-streaming response collection
Phase 4: Polish & Production ✅
- Model registry resolution (name-based pulls with Ollama registry → built-in registry → HuggingFace fallback)
- Embedding generation support (automatic reload with embedding mode)
- Multiple concurrent model loading (HashMap storage with LRU eviction)
- Model auto-loading on first API request
- GPU acceleration configuration (
[gpu]config with layer offloading) - Chat template auto-detection from GGUF metadata (ChatML, Llama, Phi, Generic)
- Health check endpoint (
/health) - Prometheus metrics endpoint (
/metricswith request/token/model counters)
Phase 5: Full Ollama Parity ✅
- Vision/Multimodal support (
MessageContentenum with text + image URL parts) - Tool/Function calling (tool definitions, tool choice, tool call responses)
- Push API + CLI with streaming progress (
POST /api/push,a3s-power push) - Blob management API (
HEAD/POST/GET/DELETE /api/blobs/:digest) - Generate API:
system,template,raw,suffix,context,imagesfields - Native chat
imagesfield (Ollama base64 format) - CLI
cpcommand for model aliasing - New error variants (
UploadFailed,InvalidDigest,BlobNotFound)
Phase 6: Observability & Cost Tracking ✅
End-to-end observability for LLM inference:
- OpenTelemetry-Ready Metrics: Instrument inference pipeline with Prometheus metrics
power_inference_duration_seconds{model}summary (count + sum)power_ttft_seconds{model}summary (time to first token)- Per-model inference instrumentation across all 4 inference endpoints
- Token & Cost Metrics: Per-call recording via Prometheus
power_inference_tokens_total{model, type=input|output}counterpower_cost_dollars{model}counterpower_inference_duration_seconds{model}summarypower_ttft_seconds{model}summary (time to first token)
- Cost Dashboard Data: Aggregate cost by model / day
- JSON export endpoint:
GET /v1/usagewith date range and model filter
- JSON export endpoint:
- Model Lifecycle Metrics: Load time, memory usage, eviction count
power_model_load_duration_seconds{model}summarypower_model_memory_bytes{model}gaugepower_model_evictions_totalcounter
- GPU Utilization Metrics: GPU memory, compute utilization per device
power_gpu_memory_bytes{device}gaugepower_gpu_utilization{device}gauge
Phase 7: Ollama Drop-in Compatibility ✅
Wire-format and runtime compatibility for seamless Ollama replacement:
- Ollama Registry Integration: Pull any model from
registry.ollama.aiby name — primary resolution source with template, system prompt, params, and license metadata - NDJSON Streaming: Native API endpoints (
/api/generate,/api/chat,/api/pull,/api/push) stream asapplication/x-ndjson(Ollama wire format); OpenAI endpoints keep SSE - Automatic Model Unloading: Background keep_alive reaper checks every 5s and unloads idle models (configurable:
"5m","1h","0","-1") - Context Token Return:
/api/generatereturns token IDs incontextfield for conversation continuity - 888 comprehensive unit tests
Phase 8: Advanced Compatibility ✅
- Jinja2/Go Template Engine: Render arbitrary Jinja2 chat templates via
minijinja(Llama 3, Gemma, ChatML, Phi, custom) with hardcoded fallback; prefers Ollama registrytemplate_overrideover GGUF metadata - KV Cache Reuse: Persist
LlamaContextacross requests with prefix matching — skips re-evaluating shared prompt tokens for multi-turn conversation speedup - Tool Call Parsing: Parse model output into structured
tool_calls— supports<tool_call>XML (Hermes/Qwen),[TOOL_CALLS]prefix (Mistral), and raw JSON formats; zero overhead when no tools in request - JSON Schema Structured Output: Support
format: {"type":"object","properties":{...}}via JSON Schema → GBNF grammar conversion; accepts"json",{"type":"json_object"}, or full JSON Schema objects - Vision Inference: Multimodal vision pipeline — accepts base64 images in Ollama
imagesfield and OpenAIimage_urlcontent parts; projector auto-downloaded from Ollama registry; uses llama.cppmtmdAPI for image encoding when projector available - ADAPTER Support: LoRA/QLoRA adapter loading at inference time — Modelfile
ADAPTERdirective parsed, adapter file loaded viallama_lora_adapter_init, applied to context withlora_adapter_setat scale 1.0 - MESSAGE Directive: Pre-seeded conversation history via Modelfile
MESSAGEdirective; messages stored in manifest and automatically prepended to chat requests - 888 comprehensive unit tests
Phase 9: Operational Parity ✅
Runtime and CLI parity for production Ollama replacement:
- Default Port 11434: Matches Ollama's default port for zero-config drop-in replacement
-
psCLI Command: List running (loaded) models viaa3s-power ps(queries serverGET /api/ps) -
stopCLI Command: Unload a running model viaa3s-power stop <model>(sendskeep_alive: 0) - Ollama Environment Variables:
OLLAMA_HOST,OLLAMA_MODELS,OLLAMA_KEEP_ALIVE,OLLAMA_MAX_LOADED_MODELS,OLLAMA_NUM_GPU— override config file for container/script compatibility - Download Resumption: Interrupted model downloads resume automatically via HTTP Range requests with partial file tracking
- 888 comprehensive unit tests
Phase 10: Intelligence & Observability ✅
GPU auto-detection, memory estimation, verbose model inspection, and per-layer pull progress:
- GPU Auto-Detection: Detect Apple Metal (via
system_profiler) and NVIDIA CUDA (vianvidia-smi) GPUs at server startup; auto-setgpu_layers = -1when GPU available and user hasn't explicitly configured - Memory Estimation: Estimate VRAM requirements before loading (model weights + KV cache + compute overhead); log estimates to help users right-size their hardware
- GGUF Metadata Reader: Lightweight binary parser for GGUF v2/v3 file headers — extracts all key-value metadata and tensor descriptors without loading weights into memory
- Verbose Show:
/api/showwithverbose: truereturns full GGUF metadata (architecture, context length, embedding dimensions, etc.) and tensor information (name, shape, type, element count) - Per-Layer Pull Progress: Streaming pull progress shows per-layer digest identifiers (
pulling sha256:abc123...) matching Ollama's output format; resolves model before download to extract layer digests - 888 comprehensive unit tests
Phase 11: Full Options Parity ✅
Complete Ollama generation options support and multi-GPU wiring:
- Missing Generation Options: Added
repeat_last_n,penalize_newline,num_batch,num_thread,num_thread_batch,use_mmap,use_mlock,numa,flash_attention,num_gpu,main_gputoGenerateOptions - Backend Wiring: All new options flow through API → backend
CompletionRequest/ChatRequest→ llama.cpp context params and sampler - Flash Attention: Wired to
LlamaContextParams::with_flash_attention_policy(Enabled)whenflash_attention: true - Multi-GPU:
main_gpuconfig wired toLlamaModelParams::with_main_gpu(); per-requestnum_gpu/main_gpuoverride supported - Memory Lock:
use_mlockconfig wired toLlamaModelParams::with_use_mlock(true)to prevent model swapping - Thread Control:
num_threadandnum_thread_batchwired toLlamaContextParams::with_n_threads()andwith_n_threads_batch() - Batch Size:
num_batchwired toLlamaContextParams::with_n_batch() - Repeat Penalty Window:
repeat_last_nwired toLlamaSampler::penalties()first argument (was hardcoded to 64) - Config Extensions: Added
use_mlock,num_thread,flash_attentiontoPowerConfigwith TOML support - 888 comprehensive unit tests
Phase 12: CLI Run Options Parity ✅
Complete Ollama CLI run command options — all 14/14 options now implemented:
-
--format: JSON output format constraint (accepts"json"or JSON schema object) -
--system: Override system prompt per session (prepended as system message) -
--template: Override chat template (reserved for template engine integration) -
--keep-alive: Model keep-alive duration (e.g."5m","1h","-1"for never unload) -
--verbose: Show timing and token statistics after each generation (prompt eval count/rate, eval count, total duration, tokens/s) -
--insecure: Skip TLS verification flag for registry operations - 888 comprehensive unit tests
Phase 13: Environment Variables & CLI Polish ✅
Complete Ollama environment variable parity and CLI enhancements:
-
OLLAMA_NUM_PARALLEL: Number of parallel request slots (concurrent inference) -
OLLAMA_DEBUG: Enable debug logging (setsRUST_LOG=debugif not already set) -
OLLAMA_ORIGINS: Custom CORS origins (comma-separated); empty = permissive -
OLLAMA_FLASH_ATTENTION: Global flash attention override ("1"or"true") -
OLLAMA_TMPDIR: Custom temporary directory for downloads and scratch files - CLI
show --verbose: Display full GGUF metadata (keys, values, tensor list) from CLI - CLI
pull --insecure: Skip TLS verification for pull operations - CLI
push --insecure: Skip TLS verification for push operations - Interactive
/help: Show available slash commands in interactive chat - Interactive
/clear: Clear conversation history (preserves system prompt) - Interactive
/show: Display model name, message counts, and current settings - Interactive
""": Multi-line input support with triple-quote delimiters - CORS Configuration: Server respects
OLLAMA_ORIGINSfor restricted CORS; defaults to permissive - 888 comprehensive unit tests
Phase 14: Final Ollama Parity ✅
Complete remaining Ollama feature gaps — help subcommand, blob pruning, GPU scheduling:
-
helpsubcommand:a3s-power help [command]prints help for any subcommand (replaces clap's built-in) - Blob pruning:
prune_unused_blobs()removes orphaned blob files not referenced by any manifest; returns count and bytes freed -
OLLAMA_NOPRUNE: Disable automatic blob pruning ("1"or"true") -
OLLAMA_SCHED_SPREAD: Spread model layers across all available GPUs ("1"or"true") - 888 comprehensive unit tests
Phase 15: Thinking & Reasoning 🚧
Critical for DeepSeek-R1, QwQ, and other reasoning models:
-
thinkparameter:ThinkValuetype (bool or"high"/"medium"/"low") in generate/chat requests -
thinkingresponse field: Separate thinking content from response inMessage.thinkingandGenerateResponse.thinking - Thinking parser: Streaming parser that separates
<think>...</think>blocks from content; infer tags from template -
run --thinkCLI flag: Enable thinking mode from interactive chat -
run --hidethinkingCLI flag: Hide thinking output in CLI display - OpenAI
reasoning/reasoning_effort: Map tothinkparameter in/v1/chat/completions
Phase 16: Logprobs & Context Control 🚧
Log probabilities and context window management:
-
logprobs/top_logprobs: Return log probabilities in generate/chat responses withLogprob/TokenLogprobtypes -
truncatefield: Truncate prompt when exceeding context length instead of erroring -
shiftfield: Shift context window when hitting limit instead of erroring -
OLLAMA_CONTEXT_LENGTH: Global default context length override env var -
OLLAMA_KV_CACHE_TYPE: KV cache quantization type (f16/q8_0/q4_0)
Phase 17: OpenAI API Parity 🚧
Additional OpenAI-compatible endpoints and fields:
-
GET /v1/models/:model: Retrieve single model details -
POST /v1/responses: OpenAI Responses API compatibility -
POST /v1/messages: Anthropic Messages API compatibility via middleware -
stream_options.include_usage: Return usage stats in final streaming chunk -
encoding_format:"float"or"base64"for embedding responses -
dimensions: Truncate output embeddings to specified dimension
Phase 18: Create API & Model Management 🚧
Align with Ollama's new structured Create API:
- Structured Create API: Support
from,files,adapters,template,system,parameters,messages,licensefields (not just Modelfile) - Re-quantization: Integrate llama.cpp quantization for
create --quantize - SafeTensors conversion: Convert SafeTensors → GGUF during create
- ShowResponse fields: Add
capabilities,renderer,parser,projector_info,messages,remote_model,remote_host - ProcessResponse fields: Add
size_vram,context_lengthto/api/ps -
tool_callsin GenerateResponse: Return tool calls from/api/generate(not just/api/chat)
Phase 19: Auth & Registry Push 🚧
Account management and registry push:
- Registry push (OCI auth): Push to
registry.ollama.aiwith keypair-based auth -
signin/signoutCLI: Sign in/out of ollama.com account -
POST /api/me: Whoami endpoint -
POST /api/signout: Signout endpoint
Phase 20: Environment Variables & CLI Polish 🚧
Remaining env vars and CLI flags:
-
OLLAMA_GPU_OVERHEAD: Reserve VRAM per GPU (bytes) -
OLLAMA_LOAD_TIMEOUT: Stall detection timeout for model loads -
OLLAMA_MAX_QUEUE: Maximum queued requests -
OLLAMA_NOHISTORY: Disable readline history -
OLLAMA_MULTIUSER_CACHE: Optimize prompt caching for multi-user -
OLLAMA_REMOTES: Allowed hosts for remote models -
show --license/--modelfile/--parameters/--template/--system: Show individual sections -
run --nowordwrap: Disable word wrapping in CLI -
run --truncate/--dimensions: Embedding-specific CLI flags -
_debug_render_only: Debug mode returning rendered template -
GET /andHEAD /: Return"Ollama is running"for compatibility checks - Request queuing: Queue requests when all model slots busy (
OLLAMA_MAX_QUEUE) -
num_parallelwiring: Wire to llama.cppn_parallelfor concurrent request slots
License
MIT