A3S Power
Overview
A3S Power is an Ollama-compatible CLI tool and HTTP server for local model management and inference. It provides both an Ollama-compatible native API and an OpenAI-compatible API, so existing tools, SDKs, and frontends work out of the box.
Basic Usage
# Pull a model by name (resolves from built-in registry or HuggingFace)
# Pull from a direct URL
# Interactive chat
# Single prompt
# Start HTTP server
Features
- CLI Model Management: Pull, list, show, and delete models from the command line
- Model Name Resolution: Pull models by name (
llama3.2:3b) with built-in registry and HuggingFace fallback - Interactive Chat: Multi-turn conversation with streaming token output
- Chat Template Auto-Detection: Detects ChatML, Llama, Phi, and Generic templates from GGUF metadata
- Multiple Concurrent Models: Load multiple models with LRU eviction at configurable capacity
- GPU Acceleration: Configurable GPU layer offloading via
[gpu]config section - Embedding Support: Real embedding generation with automatic model reload in embedding mode
- HTTP Server: Axum-based server with CORS, tracing, and metrics middleware
- Ollama-Compatible API:
/api/generate,/api/chat,/api/tags,/api/pull,/api/show,/api/delete,/api/embeddings,/api/embed,/api/ps,/api/copy,/api/version - OpenAI-Compatible API:
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddings - SSE Streaming: All inference and pull endpoints support server-sent events
- Prometheus Metrics:
GET /metricsendpoint with request counts, durations, token counters, and model gauges - Content-Addressed Storage: Model blobs stored by SHA-256 hash with automatic deduplication
- llama.cpp Backend: GGUF inference via
llama-cpp-2Rust bindings (optional feature flag) - Health Check:
GET /healthendpoint with uptime, version, and loaded model count - Model Auto-Loading: Models are automatically loaded on first inference request with LRU eviction
- TOML Configuration: User-configurable host, port, GPU settings, and storage settings
- Async-First: Built on Tokio for high-performance async operations
Quality Metrics
Test Coverage
197 unit tests with 54.2% line coverage / 70.8% function coverage (via cargo llvm-cov):
| File | Line Coverage | Function Coverage |
|---|---|---|
api/autoload.rs |
96.4% | 100.0% |
api/health.rs |
100.0% | 100.0% |
api/types.rs |
100.0% | 100.0% |
api/openai/mod.rs |
100.0% | 100.0% |
api/native/mod.rs |
100.0% | 100.0% |
api/sse.rs |
72.0% | 66.7% |
api/openai/models.rs |
51.9% | 66.7% |
api/native/models.rs |
14.9% | 18.2% |
backend/llamacpp.rs |
100.0% | 100.0% |
backend/types.rs |
100.0% | 100.0% |
backend/mod.rs |
83.7% | 83.3% |
model/manifest.rs |
100.0% | 100.0% |
model/registry.rs |
77.5% | 70.0% |
model/storage.rs |
78.4% | 75.0% |
model/pull.rs |
51.4% | 64.7% |
config.rs |
88.6% | 92.3% |
dirs.rs |
88.5% | 81.8% |
error.rs |
100.0% | 100.0% |
server/router.rs |
100.0% | 100.0% |
server/state.rs |
100.0% | 100.0% |
| TOTAL | 54.2% | 70.8% |
CLI handlers (
cli/*), HTTP handlers (api/native/{chat,generate,pull,embeddings}.rs,api/openai/{chat,completions,embeddings}.rs), andserver/mod.rshave 0% coverage — these require integration tests with live backends and are excluded from the unit-test library target.
Run tests:
Run coverage:
Architecture
Components
┌─────────────────────────────────────────────────┐
│ a3s-power │
│ │
│ CLI Layer │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ run │ │ pull │ │ list │ │ show │ │serve │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │ │ │
│ Model Layer │ │ │
│ ┌────────────────────┴────────┐ │ │
│ │ ModelRegistry │ │ │
│ │ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │ manifest │ │ storage │ │ │ │
│ │ └──────────┘ └──────────┘ │ │ │
│ └─────────────────────────────┘ │ │
│ │ │
│ Backend Layer │ │
│ ┌─────────────────────────────┐ │ │
│ │ BackendRegistry │ │ │
│ │ ┌──────────────────────┐ │ │ │
│ │ │ LlamaCppBackend │ │ │ │
│ │ │ (feature: llamacpp) │ │ │ │
│ │ └──────────────────────┘ │ │ │
│ └─────────────────────────────┘ │ │
│ │ │
│ Server Layer ◄──────────────────────────┘ │
│ ┌─────────────────────────────────────┐ │
│ │ Axum Router │ │
│ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │ /api/* │ │ /v1/* │ │ │
│ │ │ (Ollama) │ │ (OpenAI) │ │ │
│ │ └────────────┘ └────────────────┘ │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Backend Trait
The Backend trait abstracts inference engines. The llama.cpp backend is feature-gated; without the llamacpp feature, Power can still manage models but returns "backend not available" for inference calls.
Quick Start
Build
# Build without inference backend (model management only)
# Build with llama.cpp inference (requires C++ compiler + CMake)
Model Management
# Pull a model by name (built-in registry + HuggingFace fallback)
# Pull from a direct URL
# List local models
# Show model details
# Delete a model
Interactive Chat
# Start interactive chat session
# Send a single prompt
HTTP Server
# Start server on default port (127.0.0.1:11435)
# Custom host and port
API Reference
Server
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check (status, version, uptime, loaded models) |
GET |
/metrics |
Prometheus metrics (request counts, durations, tokens, model gauge) |
Native API (Ollama-Compatible)
| Method | Path | Description |
|---|---|---|
POST |
/api/generate |
Text generation (streaming/non-streaming) |
POST |
/api/chat |
Chat completion (streaming/non-streaming) |
POST |
/api/pull |
Download a model by name or URL (streaming progress) |
GET |
/api/tags |
List local models |
POST |
/api/show |
Show model details |
DELETE |
/api/delete |
Delete a model |
POST |
/api/embeddings |
Generate embeddings |
POST |
/api/embed |
Batch embedding generation |
GET |
/api/ps |
List running/loaded models |
POST |
/api/copy |
Copy/alias a model |
GET |
/api/version |
Server version |
OpenAI-Compatible API
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completion (streaming/non-streaming) |
POST |
/v1/completions |
Text completion (streaming/non-streaming) |
GET |
/v1/models |
List available models |
POST |
/v1/embeddings |
Generate embeddings |
Examples
List Models
# OpenAI-compatible
# Ollama-compatible
Chat Completion (OpenAI)
Chat Completion with Streaming
Text Generation (Ollama)
Text Completion (OpenAI)
CLI Commands
| Command | Description |
|---|---|
a3s-power run <model> [--prompt <text>] |
Load model and start interactive chat, or send a single prompt |
a3s-power pull <name_or_url> |
Download a model by name (llama3.2:3b) or direct URL |
a3s-power list |
List all locally available models |
a3s-power show <model> |
Show model details (format, size, parameters) |
a3s-power delete <model> |
Delete a model from local storage |
a3s-power serve [--host <addr>] [--port <port>] |
Start HTTP server (default: 127.0.0.1:11435) |
Model Storage
Models are stored in ~/.a3s/power/ (override with $A3S_POWER_HOME):
~/.a3s/power/
├── config.toml # User configuration
└── models/
├── manifests/ # JSON manifest files
│ ├── llama-2-7b.json
│ └── qwen2.5-7b.json
└── blobs/ # Content-addressed model files
├── sha256-abc123...
└── sha256-def456...
Content-Addressed Storage
Model files are stored by their SHA-256 hash, enabling:
- Deduplication: Identical files share storage
- Integrity verification: Blobs can be verified against their hash
- Clean deletion: Remove manifest + blob independently
Configuration
Configuration is read from ~/.a3s/power/config.toml:
= "127.0.0.1"
= 11435
= 1
[]
= -1 # offload all layers to GPU (-1=all, 0=CPU only)
= 0 # primary GPU index
| Field | Default | Description |
|---|---|---|
host |
127.0.0.1 |
HTTP server bind address |
port |
11435 |
HTTP server port |
data_dir |
~/.a3s/power |
Base directory for model storage |
max_loaded_models |
1 |
Maximum models loaded in memory concurrently |
gpu.gpu_layers |
0 |
Number of layers to offload to GPU (0=CPU, -1=all) |
gpu.main_gpu |
0 |
Index of the primary GPU to use |
All fields are optional and have sensible defaults.
Feature Flags
| Flag | Description |
|---|---|
llamacpp |
Enable llama.cpp inference backend via llama-cpp-2. Requires a C++ compiler and CMake. |
Without any feature flags, Power can manage models (pull, list, delete) and serve API responses, but inference calls will return a "backend not available" error.
Development
Build Commands
# Build
# Test
# Lint
# Run
Project Structure
power/
├── Cargo.toml
├── README.md
├── LICENSE
├── .gitignore
└── src/
├── main.rs # Binary entry point (CLI dispatch)
├── lib.rs # Library root (module re-exports)
├── error.rs # PowerError enum + Result<T> alias
├── config.rs # TOML configuration (host, port, data_dir)
├── dirs.rs # Platform-specific paths (~/.a3s/power/)
├── cli/
│ ├── mod.rs # Cli struct + Commands enum (clap)
│ ├── run.rs # Interactive chat + single prompt
│ ├── pull.rs # Download with progress bar
│ ├── list.rs # Tabular model listing
│ ├── show.rs # Model detail display
│ ├── delete.rs # Model + blob deletion
│ └── serve.rs # HTTP server startup
├── model/
│ ├── manifest.rs # ModelManifest, ModelFormat, ModelParameters
│ ├── registry.rs # In-memory index backed by disk manifests
│ ├── storage.rs # Content-addressed blob store (SHA-256)
│ ├── pull.rs # HTTP download with progress callback
│ ├── resolve.rs # Name-based model resolution (built-in + HuggingFace)
│ └── known_models.json# Built-in registry of popular GGUF models
├── backend/
│ ├── mod.rs # Backend trait + BackendRegistry
│ ├── types.rs # Inference request/response types
│ ├── llamacpp.rs # llama.cpp backend (feature-gated, multi-model)
│ ├── chat_template.rs # Chat template detection and formatting
│ └── test_utils.rs # MockBackend for testing
├── server/
│ ├── mod.rs # Server startup (bind, listen)
│ ├── state.rs # Shared AppState with LRU model tracking
│ ├── router.rs # Axum router with CORS + tracing + metrics
│ └── metrics.rs # Prometheus metrics collection and /metrics handler
└── api/
├── autoload.rs # Model auto-loading on first inference
├── health.rs # GET /health endpoint
├── types.rs # OpenAI + Ollama request/response types
├── sse.rs # SSE streaming utilities
├── native/
│ ├── mod.rs # Ollama-compatible route group
│ ├── generate.rs # POST /api/generate
│ ├── chat.rs # POST /api/chat
│ ├── models.rs # GET /api/tags, POST /api/show, DELETE /api/delete
│ ├── pull.rs # POST /api/pull (streaming progress)
│ ├── embeddings.rs# POST /api/embeddings
│ ├── embed.rs # POST /api/embed (batch embeddings)
│ ├── ps.rs # GET /api/ps (running models)
│ ├── copy.rs # POST /api/copy (model aliasing)
│ └── version.rs # GET /api/version
└── openai/
├── mod.rs # OpenAI-compatible route group + shared helpers
├── chat.rs # POST /v1/chat/completions
├── completions.rs # POST /v1/completions
├── models.rs # GET /v1/models
└── embeddings.rs# POST /v1/embeddings
A3S Ecosystem
A3S Power is an infrastructure component of the A3S ecosystem — a standalone model server that enables local LLM inference for other A3S tools.
┌──────────────────────────────────────────────────────────┐
│ A3S Ecosystem │
│ │
│ Infrastructure: a3s-box (MicroVM sandbox runtime) │
│ a3s-power (local model serving) │
│ │ ▲ │
│ Application: a3s-code ────┘ (AI coding agent) │
│ / \ │
│ Utilities: a3s-lane a3s-context │
│ (memory/knowledge) │
│ │
│ a3s-power ◄── You are here │
└──────────────────────────────────────────────────────────┘
| Project | Package | Relationship |
|---|---|---|
| box | a3s-box-* |
Can use Power for local model inference |
| code | a3s-code |
Uses Power as a local model backend |
| lane | a3s-lane |
Independent utility (no direct relationship) |
| context | a3s-context |
Independent utility (no direct relationship) |
Standalone Usage: a3s-power works independently as a local model server for any application:
- Drop-in Ollama replacement with identical API
- OpenAI SDK compatible for seamless integration
- Local-first inference with no cloud dependency
Roadmap
Phase 1: Core ✅
- CLI model management (pull, list, show, delete)
- Content-addressed storage with SHA-256
- Model manifest system with JSON persistence
- TOML configuration
- Platform-specific directory resolution
- 108 comprehensive unit tests
Phase 2: Backend & Inference ✅
- Backend trait abstraction
- llama.cpp backend via
llama-cpp-2(feature-gated) - Streaming token generation via channels
- Interactive chat with conversation history
- Single prompt mode
Phase 3: HTTP Server ✅
- Axum-based HTTP server with CORS + tracing
- Ollama-compatible native API (7 endpoints)
- OpenAI-compatible API (4 endpoints)
- SSE streaming for all inference endpoints
- Non-streaming response collection
Phase 4: Polish & Production ✅
- Model registry resolution (name-based pulls with built-in registry + HuggingFace fallback)
- Embedding generation support (automatic reload with embedding mode)
- Multiple concurrent model loading (HashMap storage with LRU eviction)
- Model auto-loading on first API request
- GPU acceleration configuration (
[gpu]config with layer offloading) - Chat template auto-detection from GGUF metadata (ChatML, Llama, Phi, Generic)
- Health check endpoint (
/health) - Prometheus metrics endpoint (
/metricswith request/token/model counters) - 197 comprehensive unit tests
License
MIT