A3S Power

Overview

A3S Power is an Ollama-compatible CLI tool and HTTP server for local model management and inference. It provides both an Ollama-compatible native API and an OpenAI-compatible API, so existing tools, SDKs, and frontends work out of the box.

Basic Usage

# Pull a model by name (resolves from Ollama registry, built-in registry, or HuggingFace)
a3s-power pull llama3.2:3b

# Pull from a direct URL
a3s-power pull https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

# Interactive chat
a3s-power run llama3.2:3b

# Single prompt
a3s-power run llama3.2:3b --prompt "Explain quicksort in one paragraph"

# Push a model to a remote registry
a3s-power push llama3.2:3b --destination https://registry.example.com

# Start HTTP server
a3s-power serve

Features

CLI Model Management: Pull, list, show, delete, and push models from the command line
Ollama Registry Integration: Pull any model from registry.ollama.ai by name (llama3.2:3b) — primary resolution source with built-in registry and HuggingFace fallback
Interactive Chat: Multi-turn conversation with streaming token output
Vision/Multimodal Support: Accept image URLs in chat messages (OpenAI-compatible content array format)
Tool/Function Calling: Structured tool definitions, tool choice, and tool call responses (OpenAI-compatible)
JSON Schema Structured Output: Constrain model output to match JSON Schema via GBNF grammar generation — supports "json", {"type":"json_object"}, or full JSON Schema objects
Chat Template Auto-Detection: Detects ChatML, Llama, Phi, and Generic templates from GGUF metadata
Jinja2 Template Engine: Renders arbitrary Jinja2 chat templates via minijinja (Llama 3, Gemma, ChatML, Phi, custom) with hardcoded fallback
KV Cache Reuse: Persists LlamaContext across requests with prefix matching — skips re-evaluating shared prompt tokens for multi-turn speedup
Tool Call Parsing: Parses model output into structured tool_calls — supports <tool_call> XML, [TOOL_CALLS] prefix, and raw JSON formats
Modelfile Support: Create custom models with FROM, PARAMETER, SYSTEM, TEMPLATE, ADAPTER (LoRA/QLoRA), LICENSE, and MESSAGE (pre-seeded conversations) directives
Multiple Concurrent Models: Load multiple models with LRU eviction at configurable capacity
Automatic Model Unloading: Background keep_alive reaper unloads idle models after configurable timeout (default 5m)
GPU Acceleration: Configurable GPU layer offloading via [gpu] config section with automatic GPU detection (Metal/CUDA), multi-GPU support (main_gpu), and per-request num_gpu override
GPU Auto-Detection: Automatically detects Apple Metal and NVIDIA CUDA GPUs at server startup, sets optimal gpu_layers when not explicitly configured
Memory Estimation: Estimates VRAM requirements before loading a model (model weights + KV cache + compute overhead) and logs warnings
Full Ollama Options: All Ollama generation options supported — repeat_last_n, penalize_newline, num_batch, num_thread, num_thread_batch, use_mmap, use_mlock, numa, flash_attention, num_gpu, main_gpu — in addition to standard sampling parameters
Embedding Support: Real embedding generation with automatic model reload in embedding mode
HTTP Server: Axum-based server with CORS, tracing, and metrics middleware
Ollama-Compatible API: /api/generate, /api/chat, /api/tags, /api/pull, /api/push, /api/show, /api/delete, /api/embeddings, /api/embed, /api/ps, /api/copy, /api/version, /api/blobs/:digest
OpenAI-Compatible API: /v1/chat/completions, /v1/completions, /v1/models, /v1/embeddings
Blob Management API: Check, upload, and download content-addressed blobs via REST
Push API: Upload models to remote registries with progress reporting
NDJSON Streaming: Native API endpoints stream as application/x-ndjson (Ollama wire format); OpenAI endpoints use SSE
Context Token Return: /api/generate returns token IDs in context field for conversation continuity
Prometheus Metrics: GET /metrics endpoint with request counts, durations, tokens, model gauges, inference duration, TTFT, cost, evictions, model memory, and GPU metrics
Usage Dashboard: GET /v1/usage endpoint with date range and model filtering for cost tracking
GGUF Metadata Reader: Lightweight binary parser for GGUF file headers — extracts architecture metadata and tensor descriptors without loading weights
Verbose Show: /api/show with verbose: true returns full GGUF metadata and tensor information
Per-Layer Pull Progress: Pull progress shows per-layer digest identifiers (pulling sha256:abc...) matching Ollama's output format
Content-Addressed Storage: Model blobs stored by SHA-256 hash with automatic deduplication
llama.cpp Backend: GGUF inference via llama-cpp-2 Rust bindings (optional feature flag)
Health Check: GET /health endpoint with uptime, version, and loaded model count
Model Auto-Loading: Models are automatically loaded on first inference request with LRU eviction
TOML Configuration: User-configurable host, port, GPU settings, keep_alive, and storage settings
Ollama Environment Variables: OLLAMA_HOST, OLLAMA_MODELS, OLLAMA_KEEP_ALIVE, OLLAMA_MAX_LOADED_MODELS, OLLAMA_NUM_GPU for drop-in compatibility
Download Resumption: Interrupted model downloads resume automatically via HTTP Range requests
Async-First: Built on Tokio for high-performance async operations

Quality Metrics

Test Coverage

861 unit tests with 90.11% region coverage across 59 source files:

Module	Lines	Coverage	Functions	Coverage
api/health.rs	62	100.00%	10	100.00%
api/mod.rs	27	100.00%	5	100.00%
api/native/mod.rs	22	100.00%	1	100.00%
api/native/ps.rs	149	100.00%	17	100.00%
api/native/version.rs	21	100.00%	6	100.00%
api/openai/mod.rs	30	100.00%	4	100.00%
api/openai/usage.rs	384	100.00%	27	100.00%
backend/llamacpp.rs	186	100.00%	26	100.00%
backend/test_utils.rs	130	100.00%	18	100.00%
cli/delete.rs	102	100.00%	5	100.00%
cli/list.rs	88	100.00%	7	100.00%
error.rs	93	100.00%	19	100.00%
model/manifest.rs	164	100.00%	19	100.00%
server/router.rs	209	100.00%	33	100.00%
backend/json_schema.rs	389	98.97%	53	100.00%
backend/tool_parser.rs	347	99.14%	43	100.00%
model/modelfile.rs	552	99.28%	42	100.00%
server/state.rs	266	99.25%	37	97.30%
api/sse.rs	95	98.95%	16	93.75%
api/types.rs	613	98.37%	52	100.00%
server/metrics.rs	607	98.35%	54	96.30%
backend/chat_template.rs	349	98.28%	32	100.00%
backend/mod.rs	65	98.46%	15	100.00%
dirs.rs	55	98.18%	12	91.67%
backend/types.rs	261	98.08%	23	95.65%
api/native/chat.rs	735	94.42%	32	100.00%
api/native/generate.rs	709	95.77%	32	100.00%
api/native/models.rs	457	96.06%	32	100.00%
config.rs	475	96.84%	60	96.67%
api/openai/embeddings.rs	187	95.72%	9	100.00%
api/native/blobs.rs	212	94.81%	15	100.00%
api/autoload.rs	220	94.09%	24	100.00%
api/native/embed.rs	158	93.04%	9	100.00%
model/gguf.rs	746	93.43%	80	80.00%
api/openai/models.rs	118	93.22%	9	100.00%
api/native/embeddings.rs	133	96.24%	7	100.00%
api/native/copy.rs	60	91.67%	6	100.00%
cli/mod.rs	340	91.18%	34	100.00%
api/native/create.rs	340	90.00%	19	94.74%
api/openai/chat.rs	531	88.14%	23	78.26%
model/registry.rs	308	87.99%	42	83.33%
model/storage.rs	331	87.31%	31	83.87%
cli/show.rs	234	84.19%	15	100.00%
api/openai/completions.rs	394	82.99%	14	78.57%
backend/gpu.rs	281	82.92%	38	92.11%
model/resolve.rs	341	75.66%	54	79.63%
api/native/push.rs	187	75.40%	10	80.00%
cli/push.rs	43	74.42%	10	90.00%
model/ollama_registry.rs	530	73.21%	57	70.18%
cli/ps.rs	152	70.39%	22	81.82%
cli/serve.rs	34	70.59%	4	50.00%
cli/stop.rs	102	70.59%	12	75.00%
server/mod.rs	84	65.48%	12	66.67%
model/push.rs	151	62.91%	27	81.48%
cli/pull.rs	72	62.50%	6	83.33%
api/native/pull.rs	269	50.19%	16	81.25%
cli/run.rs	845	48.88%	57	85.96%
model/pull.rs	384	48.70%	36	63.89%
TOTAL	15429	87.94%	1430	91.47%

Overall: 90.11% region coverage, 91.47% function coverage, 87.94% line coverage

Run coverage report:

LLVM_COV=/opt/homebrew/Cellar/llvm/21.1.8/bin/llvm-cov \
LLVM_PROFDATA=/opt/homebrew/Cellar/llvm/21.1.8/bin/llvm-profdata \
cargo llvm-cov --lib -p a3s-power --summary-only

Architecture

Components

┌─────────────────────────────────────────────────┐
│                  a3s-power                       │
│                                                  │
│  CLI Layer                                       │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│  │ run  │ │ pull │ │ list │ │ push │ │serve │ │
│  └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│     │        │        │        │        │      │
│  Model Layer          │                  │      │
│  ┌────────────────────┴────────┐         │      │
│  │      ModelRegistry          │         │      │
│  │  ┌──────────┐ ┌──────────┐ │         │      │
│  │  │ manifest │ │ storage  │ │         │      │
│  │  └──────────┘ └──────────┘ │         │      │
│  └─────────────────────────────┘         │      │
│                                          │      │
│  Backend Layer                           │      │
│  ┌─────────────────────────────┐         │      │
│  │    BackendRegistry          │         │      │
│  │  ┌──────────────────────┐  │         │      │
│  │  │ LlamaCppBackend      │  │         │      │
│  │  │ (feature: llamacpp)  │  │         │      │
│  │  └──────────────────────┘  │         │      │
│  └─────────────────────────────┘         │      │
│                                          │      │
│  Server Layer ◄──────────────────────────┘      │
│  ┌─────────────────────────────────────┐        │
│  │  Axum Router                        │        │
│  │  ┌────────────┐ ┌────────────────┐  │        │
│  │  │ /api/*     │ │ /v1/*          │  │        │
│  │  │ (Ollama)   │ │ (OpenAI)       │  │        │
│  │  └────────────┘ └────────────────┘  │        │
│  └─────────────────────────────────────┘        │
└─────────────────────────────────────────────────┘

Backend Trait

The Backend trait abstracts inference engines. The llama.cpp backend is feature-gated; without the llamacpp feature, Power can still manage models but returns "backend not available" for inference calls.

#[async_trait]
pub trait Backend: Send + Sync {
    fn name(&self) -> &str;
    fn supports(&self, format: &ModelFormat) -> bool;
    async fn load(&self, manifest: &ModelManifest) -> Result<()>;
    async fn unload(&self, model_name: &str) -> Result<()>;
    async fn chat(&self, model_name: &str, request: ChatRequest)
        -> Result<Pin<Box<dyn Stream<Item = Result<ChatResponseChunk>> + Send>>>;
    async fn complete(&self, model_name: &str, request: CompletionRequest)
        -> Result<Pin<Box<dyn Stream<Item = Result<CompletionResponseChunk>> + Send>>>;
    async fn embed(&self, model_name: &str, request: EmbeddingRequest)
        -> Result<EmbeddingResponse>;
}

Quick Start

Build

# Build without inference backend (model management only)
cargo build -p a3s-power

# Build with llama.cpp inference (requires C++ compiler + CMake)
cargo build -p a3s-power --features llamacpp

Model Management

# Pull a model by name (Ollama registry → built-in registry → HuggingFace fallback)
a3s-power pull llama3.2:3b

# Pull from a direct URL
a3s-power pull https://example.com/model.gguf

# List local models
a3s-power list

# Show model details
a3s-power show my-model

# Delete a model
a3s-power delete my-model

# Push a model to a remote registry
a3s-power push my-model --destination https://registry.example.com

Interactive Chat

# Start interactive chat session
a3s-power run my-model

# Send a single prompt
a3s-power run my-model --prompt "What is Rust?"

HTTP Server

# Start server on default port (127.0.0.1:11434)
a3s-power serve

# Custom host and port
a3s-power serve --host 0.0.0.0 --port 8080

API Reference

Server

Method	Path	Description
`GET`	`/health`	Health check (status, version, uptime, loaded models)
`GET`	`/metrics`	Prometheus metrics (requests, durations, tokens, inference, TTFT, cost, evictions, model memory, GPU)

Native API (Ollama-Compatible)

Method	Path	Description
`POST`	`/api/generate`	Text generation (streaming/non-streaming)
`POST`	`/api/chat`	Chat completion with vision & tool support (streaming/non-streaming)
`POST`	`/api/pull`	Download a model by name or URL (streaming progress)
`POST`	`/api/push`	Push a model to a remote registry
`GET`	`/api/tags`	List local models
`POST`	`/api/show`	Show model details
`DELETE`	`/api/delete`	Delete a model
`POST`	`/api/embeddings`	Generate embeddings
`POST`	`/api/embed`	Batch embedding generation
`GET`	`/api/ps`	List running/loaded models
`POST`	`/api/copy`	Copy/alias a model
`GET`	`/api/version`	Server version
`HEAD`	`/api/blobs/:digest`	Check if a blob exists
`POST`	`/api/blobs/:digest`	Upload a blob with SHA-256 verification
`GET`	`/api/blobs/:digest`	Download a blob
`DELETE`	`/api/blobs/:digest`	Delete a blob

OpenAI-Compatible API

Method	Path	Description
`POST`	`/v1/chat/completions`	Chat completion (streaming/non-streaming)
`POST`	`/v1/completions`	Text completion (streaming/non-streaming)
`GET`	`/v1/models`	List available models
`POST`	`/v1/embeddings`	Generate embeddings
`GET`	`/v1/usage`	Usage and cost dashboard data (date range + model filter)

Examples

List Models

# OpenAI-compatible
curl http://localhost:11434/v1/models

# Ollama-compatible
curl http://localhost:11434/api/tags

Chat Completion (OpenAI)

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Chat Completion with Streaming

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Text Generation (Ollama)

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "Why is the sky blue?"
  }'

Text Completion (OpenAI)

curl http://localhost:11434/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "Once upon a time"
  }'

Vision/Multimodal (OpenAI)

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava:7b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]
    }]
  }'

Tool/Function Calling (OpenAI)

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "What is the weather in SF?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }]
  }'

Push Model

curl -X POST http://localhost:11434/api/push \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2:3b", "destination": "https://registry.example.com"}'

Structured Output (JSON Schema)

# Constrain output to match a JSON Schema
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "prompt": "List 3 colors with hex codes",
    "format": {
      "type": "object",
      "properties": {
        "colors": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "hex": {"type": "string"}
            },
            "required": ["name", "hex"]
          }
        }
      },
      "required": ["colors"]
    }
  }'

Blob Management

# Check if blob exists
curl -I http://localhost:11434/api/blobs/sha256:abc123...

# Upload blob
curl -X POST http://localhost:11434/api/blobs/sha256:abc123... \
  --data-binary @model.gguf

# Download blob
curl http://localhost:11434/api/blobs/sha256:abc123... -o downloaded.gguf

CLI Commands

Command	Description
`a3s-power run <model> [--prompt <text>]`	Load model and start interactive chat, or send a single prompt
`a3s-power pull <name_or_url>`	Download a model by name (`llama3.2:3b`) or direct URL
`a3s-power push <model> --destination <url>`	Push a model to a remote registry
`a3s-power list`	List all locally available models
`a3s-power show <model>`	Show model details (format, size, parameters)
`a3s-power delete <model>`	Delete a model from local storage
`a3s-power create <name> -f <modelfile>`	Create a custom model from a Modelfile
`a3s-power cp <source> <destination>`	Copy/alias a model to a new name
`a3s-power ps`	List running (loaded) models on the server
`a3s-power stop <model>`	Stop (unload) a running model from the server
`a3s-power serve [--host <addr>] [--port <port>]`	Start HTTP server (default: `127.0.0.1:11434`)

Model Storage

Models are stored in ~/.a3s/power/ (override with $A3S_POWER_HOME):

~/.a3s/power/
├── config.toml              # User configuration
└── models/
    ├── manifests/           # JSON manifest files
    │   ├── llama-2-7b.json
    │   └── qwen2.5-7b.json
    └── blobs/               # Content-addressed model files
        ├── sha256-abc123...
        └── sha256-def456...

Content-Addressed Storage

Model files are stored by their SHA-256 hash, enabling:

Deduplication: Identical files share storage
Integrity verification: Blobs can be verified against their hash
Clean deletion: Remove manifest + blob independently

Configuration

Configuration is read from ~/.a3s/power/config.toml:

host = "127.0.0.1"
port = 11434
max_loaded_models = 1
keep_alive = "5m"    # auto-unload idle models ("0"=immediate, "-1"=never, "5m", "1h")

[gpu]
gpu_layers = -1   # offload all layers to GPU (-1=all, 0=CPU only)
main_gpu = 0      # primary GPU index

Field	Default	Description
`host`	`127.0.0.1`	HTTP server bind address
`port`	`11434`	HTTP server port
`data_dir`	`~/.a3s/power`	Base directory for model storage
`max_loaded_models`	`1`	Maximum models loaded in memory concurrently
`keep_alive`	`"5m"`	Auto-unload idle models after this duration (`"0"`=immediate, `"-1"`=never, `"5m"`, `"1h"`, `"30s"`)
`gpu.gpu_layers`	`0`	Number of layers to offload to GPU (0=CPU, -1=all)
`gpu.main_gpu`	`0`	Index of the primary GPU to use

All fields are optional and have sensible defaults.

Environment Variables (Ollama-Compatible)

Environment variables override config file values for drop-in Ollama compatibility:

Variable	Description	Example
`OLLAMA_HOST`	Server bind address (`host:port` or `host`)	`0.0.0.0:11434`
`OLLAMA_MODELS`	Model storage directory	`/data/models`
`OLLAMA_KEEP_ALIVE`	Default keep-alive duration	`10m`, `-1`, `0`
`OLLAMA_MAX_LOADED_MODELS`	Max concurrent loaded models	`3`
`OLLAMA_NUM_GPU`	GPU layers to offload (-1 = all)	`-1`
`A3S_POWER_HOME`	Base directory for all Power data	`~/.a3s/power`

OLLAMA_HOST supports scheme prefixes (e.g. http://0.0.0.0:8080).

Feature Flags

Flag	Description
`llamacpp`	Enable llama.cpp inference backend via `llama-cpp-2`. Requires a C++ compiler and CMake.

Without any feature flags, Power can manage models (pull, list, delete) and serve API responses, but inference calls will return a "backend not available" error.

Development

Build Commands

# Build
cargo build -p a3s-power                          # Debug build
cargo build -p a3s-power --release                 # Release build
cargo build -p a3s-power --features llamacpp       # With llama.cpp

# Test
cargo test -p a3s-power --lib -- --test-threads=1  # All 861 tests

# Lint
cargo clippy -p a3s-power -- -D warnings           # Clippy
cargo fmt -p a3s-power -- --check                   # Format check

# Run
cargo run -p a3s-power -- list                      # CLI
cargo run -p a3s-power -- serve                     # Server

Project Structure

power/
├── Cargo.toml
├── README.md
├── LICENSE
├── .gitignore
└── src/
    ├── main.rs              # Binary entry point (CLI dispatch)
    ├── lib.rs               # Library root (module re-exports)
    ├── error.rs             # PowerError enum + Result<T> alias
    ├── config.rs            # TOML configuration (host, port, data_dir)
    ├── dirs.rs              # Platform-specific paths (~/.a3s/power/)
    ├── cli/
    │   ├── mod.rs           # Cli struct + Commands enum (clap)
    │   ├── run.rs           # Interactive chat + single prompt
    │   ├── pull.rs          # Download with progress bar
    │   ├── push.rs          # Push model to remote registry
    │   ├── list.rs          # Tabular model listing
    │   ├── show.rs          # Model detail display
    │   ├── delete.rs        # Model + blob deletion
    │   ├── ps.rs            # List running models (queries server)
    │   ├── stop.rs          # Stop/unload a running model
    │   └── serve.rs         # HTTP server startup
    ├── model/
    │   ├── manifest.rs      # ModelManifest, ModelFormat, ModelParameters
    │   ├── registry.rs      # In-memory index backed by disk manifests
    │   ├── storage.rs       # Content-addressed blob store (SHA-256)
    │   ├── pull.rs          # HTTP download with progress callback
    │   ├── push.rs          # Push model to remote registry
    │   ├── resolve.rs       # Name-based model resolution (Ollama registry → built-in → HuggingFace)
    │   ├── ollama_registry.rs # Ollama registry client (fetch manifests, metadata, blob URLs)
    │   ├── modelfile.rs     # Modelfile parser (FROM, PARAMETER, SYSTEM, TEMPLATE, etc.)
    │   └── known_models.json# Built-in registry of popular GGUF models (offline fallback)
    ├── backend/
    │   ├── mod.rs           # Backend trait + BackendRegistry
    │   ├── types.rs         # Inference types (vision, tools, chat, completion, embedding)
    │   ├── llamacpp.rs      # llama.cpp backend (feature-gated, multi-model, KV cache reuse)
    │   ├── chat_template.rs # Chat template detection, Jinja2 rendering (minijinja), and fallback formatting
    │   ├── json_schema.rs  # JSON Schema → GBNF grammar converter for structured output
    │   ├── tool_parser.rs   # Tool call output parser (XML, Mistral, JSON formats)
    │   └── test_utils.rs    # MockBackend for testing
    ├── server/
    │   ├── mod.rs           # Server startup (bind, listen)
    │   ├── state.rs         # Shared AppState with LRU model tracking
    │   ├── router.rs        # Axum router with CORS + tracing + metrics
    │   └── metrics.rs       # Prometheus metrics collection and /metrics handler
    └── api/
        ├── autoload.rs      # Model auto-loading on first inference
        ├── health.rs        # GET /health endpoint
        ├── types.rs         # OpenAI + Ollama request/response types
        ├── sse.rs           # Streaming utilities (NDJSON for native API, SSE for OpenAI API)
        ├── native/
        │   ├── mod.rs       # Ollama-compatible route group
        │   ├── generate.rs  # POST /api/generate
        │   ├── chat.rs      # POST /api/chat (vision + tools)
        │   ├── models.rs    # GET /api/tags, POST /api/show, DELETE /api/delete
        │   ├── pull.rs      # POST /api/pull (streaming progress)
        │   ├── push.rs      # POST /api/push (push to registry)
        │   ├── blobs.rs     # HEAD/POST/GET /api/blobs/:digest
        │   ├── embeddings.rs# POST /api/embeddings
        │   ├── embed.rs     # POST /api/embed (batch embeddings)
        │   ├── ps.rs        # GET /api/ps (running models)
        │   ├── copy.rs      # POST /api/copy (model aliasing)
        │   ├── create.rs    # POST /api/create (from Modelfile)
        │   └── version.rs   # GET /api/version
        └── openai/
            ├── mod.rs       # OpenAI-compatible route group + shared helpers
            ├── chat.rs      # POST /v1/chat/completions
            ├── completions.rs # POST /v1/completions
            ├── models.rs    # GET /v1/models
            └── embeddings.rs# POST /v1/embeddings

A3S Ecosystem

A3S Power is an infrastructure component of the A3S ecosystem — a standalone model server that enables local LLM inference for other A3S tools.

┌──────────────────────────────────────────────────────────┐
│                    A3S Ecosystem                          │
│                                                           │
│  Infrastructure:  a3s-box     (MicroVM sandbox runtime)   │
│                   a3s-power   (local model serving)       │
│                      │            ▲                        │
│  Application:     a3s-code    ────┘  (AI coding agent)    │
│                    /   \                                   │
│  Utilities:   a3s-lane  a3s-context                       │
│                         (memory/knowledge)                 │
│                                                           │
│               a3s-power ◄── You are here                  │
└──────────────────────────────────────────────────────────┘

Project	Package	Relationship
box	`a3s-box-*`	Can use Power for local model inference
code	`a3s-code`	Uses Power as a local model backend
lane	`a3s-lane`	Independent utility (no direct relationship)
context	`a3s-context`	Independent utility (no direct relationship)

Standalone Usage: a3s-power works independently as a local model server for any application:

Drop-in Ollama replacement with identical API and NDJSON wire format
Pull any model from Ollama registry by name (llama3.2:3b, qwen2.5:7b, etc.)
OpenAI SDK compatible for seamless integration
Local-first inference with no cloud dependency

Roadmap

Phase 1: Core ✅

CLI model management (pull, list, show, delete)
Content-addressed storage with SHA-256
Model manifest system with JSON persistence
TOML configuration
Platform-specific directory resolution
Comprehensive unit test foundation

Phase 2: Backend & Inference ✅

Backend trait abstraction
llama.cpp backend via llama-cpp-2 (feature-gated)
Streaming token generation via channels
Interactive chat with conversation history
Single prompt mode

Phase 3: HTTP Server ✅

Axum-based HTTP server with CORS + tracing
Ollama-compatible native API (12 endpoints + blob management)
OpenAI-compatible API (4 endpoints)
SSE streaming for all inference endpoints
Non-streaming response collection

Phase 4: Polish & Production ✅

Model registry resolution (name-based pulls with Ollama registry → built-in registry → HuggingFace fallback)
Embedding generation support (automatic reload with embedding mode)
Multiple concurrent model loading (HashMap storage with LRU eviction)
Model auto-loading on first API request
GPU acceleration configuration ([gpu] config with layer offloading)
Chat template auto-detection from GGUF metadata (ChatML, Llama, Phi, Generic)
Health check endpoint (/health)
Prometheus metrics endpoint (/metrics with request/token/model counters)

Phase 5: Full Ollama Parity ✅

Vision/Multimodal support (MessageContent enum with text + image URL parts)
Tool/Function calling (tool definitions, tool choice, tool call responses)
Push API + CLI with streaming progress (POST /api/push, a3s-power push)
Blob management API (HEAD/POST/GET/DELETE /api/blobs/:digest)
Generate API: system, template, raw, suffix, context, images fields
Native chat images field (Ollama base64 format)
CLI cp command for model aliasing
New error variants (UploadFailed, InvalidDigest, BlobNotFound)

Phase 6: Observability & Cost Tracking ✅

End-to-end observability for LLM inference:

OpenTelemetry-Ready Metrics: Instrument inference pipeline with Prometheus metrics
- power_inference_duration_seconds{model} summary (count + sum)
- power_ttft_seconds{model} summary (time to first token)
- Per-model inference instrumentation across all 4 inference endpoints
Token & Cost Metrics: Per-call recording via Prometheus
- power_inference_tokens_total{model, type=input|output} counter
- power_cost_dollars{model} counter
- power_inference_duration_seconds{model} summary
- power_ttft_seconds{model} summary (time to first token)
Cost Dashboard Data: Aggregate cost by model / day
- JSON export endpoint: GET /v1/usage with date range and model filter
Model Lifecycle Metrics: Load time, memory usage, eviction count
- power_model_load_duration_seconds{model} summary
- power_model_memory_bytes{model} gauge
- power_model_evictions_total counter
GPU Utilization Metrics: GPU memory, compute utilization per device
- power_gpu_memory_bytes{device} gauge
- power_gpu_utilization{device} gauge

Phase 7: Ollama Drop-in Compatibility ✅

Wire-format and runtime compatibility for seamless Ollama replacement:

Ollama Registry Integration: Pull any model from registry.ollama.ai by name — primary resolution source with template, system prompt, params, and license metadata
NDJSON Streaming: Native API endpoints (/api/generate, /api/chat, /api/pull, /api/push) stream as application/x-ndjson (Ollama wire format); OpenAI endpoints keep SSE
Automatic Model Unloading: Background keep_alive reaper checks every 5s and unloads idle models (configurable: "5m", "1h", "0", "-1")
Context Token Return: /api/generate returns token IDs in context field for conversation continuity
861 comprehensive unit tests

Phase 8: Advanced Compatibility ✅

Jinja2/Go Template Engine: Render arbitrary Jinja2 chat templates via minijinja (Llama 3, Gemma, ChatML, Phi, custom) with hardcoded fallback; prefers Ollama registry template_override over GGUF metadata
KV Cache Reuse: Persist LlamaContext across requests with prefix matching — skips re-evaluating shared prompt tokens for multi-turn conversation speedup
Tool Call Parsing: Parse model output into structured tool_calls — supports <tool_call> XML (Hermes/Qwen), [TOOL_CALLS] prefix (Mistral), and raw JSON formats; zero overhead when no tools in request
JSON Schema Structured Output: Support format: {"type":"object","properties":{...}} via JSON Schema → GBNF grammar conversion; accepts "json", {"type":"json_object"}, or full JSON Schema objects
Vision Inference: Multimodal vision pipeline — accepts base64 images in Ollama images field and OpenAI image_url content parts; projector auto-downloaded from Ollama registry; uses llama.cpp mtmd API for image encoding when projector available
ADAPTER Support: LoRA/QLoRA adapter loading at inference time — Modelfile ADAPTER directive parsed, adapter file loaded via llama_lora_adapter_init, applied to context with lora_adapter_set at scale 1.0
MESSAGE Directive: Pre-seeded conversation history via Modelfile MESSAGE directive; messages stored in manifest and automatically prepended to chat requests
861 comprehensive unit tests

Phase 9: Operational Parity ✅

Runtime and CLI parity for production Ollama replacement:

Default Port 11434: Matches Ollama's default port for zero-config drop-in replacement
ps CLI Command: List running (loaded) models via a3s-power ps (queries server GET /api/ps)
stop CLI Command: Unload a running model via a3s-power stop <model> (sends keep_alive: 0)
Ollama Environment Variables: OLLAMA_HOST, OLLAMA_MODELS, OLLAMA_KEEP_ALIVE, OLLAMA_MAX_LOADED_MODELS, OLLAMA_NUM_GPU — override config file for container/script compatibility
Download Resumption: Interrupted model downloads resume automatically via HTTP Range requests with partial file tracking
861 comprehensive unit tests

Phase 10: Intelligence & Observability ✅

GPU auto-detection, memory estimation, verbose model inspection, and per-layer pull progress:

GPU Auto-Detection: Detect Apple Metal (via system_profiler) and NVIDIA CUDA (via nvidia-smi) GPUs at server startup; auto-set gpu_layers = -1 when GPU available and user hasn't explicitly configured
Memory Estimation: Estimate VRAM requirements before loading (model weights + KV cache + compute overhead); log estimates to help users right-size their hardware
GGUF Metadata Reader: Lightweight binary parser for GGUF v2/v3 file headers — extracts all key-value metadata and tensor descriptors without loading weights into memory
Verbose Show: /api/show with verbose: true returns full GGUF metadata (architecture, context length, embedding dimensions, etc.) and tensor information (name, shape, type, element count)
Per-Layer Pull Progress: Streaming pull progress shows per-layer digest identifiers (pulling sha256:abc123...) matching Ollama's output format; resolves model before download to extract layer digests
861 comprehensive unit tests

Phase 11: Full Options Parity ✅

Complete Ollama generation options support and multi-GPU wiring:

Missing Generation Options: Added repeat_last_n, penalize_newline, num_batch, num_thread, num_thread_batch, use_mmap, use_mlock, numa, flash_attention, num_gpu, main_gpu to GenerateOptions
Backend Wiring: All new options flow through API → backend CompletionRequest/ChatRequest → llama.cpp context params and sampler
Flash Attention: Wired to LlamaContextParams::with_flash_attention_policy(Enabled) when flash_attention: true
Multi-GPU: main_gpu config wired to LlamaModelParams::with_main_gpu(); per-request num_gpu/main_gpu override supported
Memory Lock: use_mlock config wired to LlamaModelParams::with_use_mlock(true) to prevent model swapping
Thread Control: num_thread and num_thread_batch wired to LlamaContextParams::with_n_threads() and with_n_threads_batch()
Batch Size: num_batch wired to LlamaContextParams::with_n_batch()
Repeat Penalty Window: repeat_last_n wired to LlamaSampler::penalties() first argument (was hardcoded to 64)
Config Extensions: Added use_mlock, num_thread, flash_attention to PowerConfig with TOML support
861 comprehensive unit tests

Phase 12: CLI Run Options Parity ✅

Complete Ollama CLI run command options — all 14/14 options now implemented:

--format: JSON output format constraint (accepts "json" or JSON schema object)
--system: Override system prompt per session (prepended as system message)
--template: Override chat template (reserved for template engine integration)
--keep-alive: Model keep-alive duration (e.g. "5m", "1h", "-1" for never unload)
--verbose: Show timing and token statistics after each generation (prompt eval count/rate, eval count, total duration, tokens/s)
--insecure: Skip TLS verification flag for registry operations
861 comprehensive unit tests

Phase 13: Environment Variables & CLI Polish ✅

Complete Ollama environment variable parity and CLI enhancements:

OLLAMA_NUM_PARALLEL: Number of parallel request slots (concurrent inference)
OLLAMA_DEBUG: Enable debug logging (sets RUST_LOG=debug if not already set)
OLLAMA_ORIGINS: Custom CORS origins (comma-separated); empty = permissive
OLLAMA_FLASH_ATTENTION: Global flash attention override ("1" or "true")
OLLAMA_TMPDIR: Custom temporary directory for downloads and scratch files
CLI show --verbose: Display full GGUF metadata (keys, values, tensor list) from CLI
CLI pull --insecure: Skip TLS verification for pull operations
CLI push --insecure: Skip TLS verification for push operations
Interactive /help: Show available slash commands in interactive chat
Interactive /clear: Clear conversation history (preserves system prompt)
Interactive /show: Display model name, message counts, and current settings
Interactive """: Multi-line input support with triple-quote delimiters
CORS Configuration: Server respects OLLAMA_ORIGINS for restricted CORS; defaults to permissive
861 comprehensive unit tests

Phase 14: Final Ollama Parity ✅

Complete remaining Ollama feature gaps — help subcommand, blob pruning, GPU scheduling:

help subcommand: a3s-power help [command] prints help for any subcommand (replaces clap's built-in)
Blob pruning: prune_unused_blobs() removes orphaned blob files not referenced by any manifest; returns count and bytes freed
OLLAMA_NOPRUNE: Disable automatic blob pruning ("1" or "true")
OLLAMA_SCHED_SPREAD: Spread model layers across all available GPUs ("1" or "true")
861 comprehensive unit tests

License

MIT

a3s-power 0.1.1