A3S Power

Overview

A3S Power is an Ollama-compatible CLI tool and HTTP server for local model management and inference. It provides both an Ollama-compatible native API and an OpenAI-compatible API, so existing tools, SDKs, and frontends work out of the box.

Basic Usage

# Pull a model by name (resolves from built-in registry or HuggingFace)
a3s-power pull llama3.2:3b

# Pull from a direct URL
a3s-power pull https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

# Interactive chat
a3s-power run llama3.2:3b

# Single prompt
a3s-power run llama3.2:3b --prompt "Explain quicksort in one paragraph"

# Start HTTP server
a3s-power serve

Features

CLI Model Management: Pull, list, show, and delete models from the command line
Model Name Resolution: Pull models by name (llama3.2:3b) with built-in registry and HuggingFace fallback
Interactive Chat: Multi-turn conversation with streaming token output
Chat Template Auto-Detection: Detects ChatML, Llama, Phi, and Generic templates from GGUF metadata
Multiple Concurrent Models: Load multiple models with LRU eviction at configurable capacity
GPU Acceleration: Configurable GPU layer offloading via [gpu] config section
Embedding Support: Real embedding generation with automatic model reload in embedding mode
HTTP Server: Axum-based server with CORS, tracing, and metrics middleware
Ollama-Compatible API: /api/generate, /api/chat, /api/tags, /api/pull, /api/show, /api/delete, /api/embeddings, /api/embed, /api/ps, /api/copy, /api/version
OpenAI-Compatible API: /v1/chat/completions, /v1/completions, /v1/models, /v1/embeddings
SSE Streaming: All inference and pull endpoints support server-sent events
Prometheus Metrics: GET /metrics endpoint with request counts, durations, token counters, and model gauges
Content-Addressed Storage: Model blobs stored by SHA-256 hash with automatic deduplication
llama.cpp Backend: GGUF inference via llama-cpp-2 Rust bindings (optional feature flag)
Health Check: GET /health endpoint with uptime, version, and loaded model count
Model Auto-Loading: Models are automatically loaded on first inference request with LRU eviction
TOML Configuration: User-configurable host, port, GPU settings, and storage settings
Async-First: Built on Tokio for high-performance async operations

Quality Metrics

Test Coverage

197 unit tests with 54.2% line coverage / 70.8% function coverage (via cargo llvm-cov):

File	Line Coverage	Function Coverage
`api/autoload.rs`	96.4%	100.0%
`api/health.rs`	100.0%	100.0%
`api/types.rs`	100.0%	100.0%
`api/openai/mod.rs`	100.0%	100.0%
`api/native/mod.rs`	100.0%	100.0%
`api/sse.rs`	72.0%	66.7%
`api/openai/models.rs`	51.9%	66.7%
`api/native/models.rs`	14.9%	18.2%
`backend/llamacpp.rs`	100.0%	100.0%
`backend/types.rs`	100.0%	100.0%
`backend/mod.rs`	83.7%	83.3%
`model/manifest.rs`	100.0%	100.0%
`model/registry.rs`	77.5%	70.0%
`model/storage.rs`	78.4%	75.0%
`model/pull.rs`	51.4%	64.7%
`config.rs`	88.6%	92.3%
`dirs.rs`	88.5%	81.8%
`error.rs`	100.0%	100.0%
`server/router.rs`	100.0%	100.0%
`server/state.rs`	100.0%	100.0%
TOTAL	54.2%	70.8%

CLI handlers (cli/*), HTTP handlers (api/native/{chat,generate,pull,embeddings}.rs, api/openai/{chat,completions,embeddings}.rs), and server/mod.rs have 0% coverage — these require integration tests with live backends and are excluded from the unit-test library target.

Run tests:

cargo test -p a3s-power --lib -- --test-threads=1

Run coverage:

cargo llvm-cov -p a3s-power --lib --summary-only -- --test-threads=1

Architecture

Components

┌─────────────────────────────────────────────────┐
│                  a3s-power                       │
│                                                  │
│  CLI Layer                                       │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│  │ run  │ │ pull │ │ list │ │ show │ │serve │ │
│  └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│     │        │        │        │        │      │
│  Model Layer          │                  │      │
│  ┌────────────────────┴────────┐         │      │
│  │      ModelRegistry          │         │      │
│  │  ┌──────────┐ ┌──────────┐ │         │      │
│  │  │ manifest │ │ storage  │ │         │      │
│  │  └──────────┘ └──────────┘ │         │      │
│  └─────────────────────────────┘         │      │
│                                          │      │
│  Backend Layer                           │      │
│  ┌─────────────────────────────┐         │      │
│  │    BackendRegistry          │         │      │
│  │  ┌──────────────────────┐  │         │      │
│  │  │ LlamaCppBackend      │  │         │      │
│  │  │ (feature: llamacpp)  │  │         │      │
│  │  └──────────────────────┘  │         │      │
│  └─────────────────────────────┘         │      │
│                                          │      │
│  Server Layer ◄──────────────────────────┘      │
│  ┌─────────────────────────────────────┐        │
│  │  Axum Router                        │        │
│  │  ┌────────────┐ ┌────────────────┐  │        │
│  │  │ /api/*     │ │ /v1/*          │  │        │
│  │  │ (Ollama)   │ │ (OpenAI)       │  │        │
│  │  └────────────┘ └────────────────┘  │        │
│  └─────────────────────────────────────┘        │
└─────────────────────────────────────────────────┘

Backend Trait

The Backend trait abstracts inference engines. The llama.cpp backend is feature-gated; without the llamacpp feature, Power can still manage models but returns "backend not available" for inference calls.

#[async_trait]
pub trait Backend: Send + Sync {
    fn name(&self) -> &str;
    fn supports(&self, format: &ModelFormat) -> bool;
    async fn load(&self, manifest: &ModelManifest) -> Result<()>;
    async fn unload(&self, model_name: &str) -> Result<()>;
    async fn chat(&self, model_name: &str, request: ChatRequest)
        -> Result<Pin<Box<dyn Stream<Item = Result<ChatResponseChunk>> + Send>>>;
    async fn complete(&self, model_name: &str, request: CompletionRequest)
        -> Result<Pin<Box<dyn Stream<Item = Result<CompletionResponseChunk>> + Send>>>;
    async fn embed(&self, model_name: &str, request: EmbeddingRequest)
        -> Result<EmbeddingResponse>;
}

Quick Start

Build

# Build without inference backend (model management only)
cargo build -p a3s-power

# Build with llama.cpp inference (requires C++ compiler + CMake)
cargo build -p a3s-power --features llamacpp

Model Management

# Pull a model by name (built-in registry + HuggingFace fallback)
a3s-power pull llama3.2:3b

# Pull from a direct URL
a3s-power pull https://example.com/model.gguf

# List local models
a3s-power list

# Show model details
a3s-power show my-model

# Delete a model
a3s-power delete my-model

Interactive Chat

# Start interactive chat session
a3s-power run my-model

# Send a single prompt
a3s-power run my-model --prompt "What is Rust?"

HTTP Server

# Start server on default port (127.0.0.1:11435)
a3s-power serve

# Custom host and port
a3s-power serve --host 0.0.0.0 --port 8080

API Reference

Server

Method	Path	Description
`GET`	`/health`	Health check (status, version, uptime, loaded models)
`GET`	`/metrics`	Prometheus metrics (request counts, durations, tokens, model gauge)

Native API (Ollama-Compatible)

Method	Path	Description
`POST`	`/api/generate`	Text generation (streaming/non-streaming)
`POST`	`/api/chat`	Chat completion (streaming/non-streaming)
`POST`	`/api/pull`	Download a model by name or URL (streaming progress)
`GET`	`/api/tags`	List local models
`POST`	`/api/show`	Show model details
`DELETE`	`/api/delete`	Delete a model
`POST`	`/api/embeddings`	Generate embeddings
`POST`	`/api/embed`	Batch embedding generation
`GET`	`/api/ps`	List running/loaded models
`POST`	`/api/copy`	Copy/alias a model
`GET`	`/api/version`	Server version

OpenAI-Compatible API

Method	Path	Description
`POST`	`/v1/chat/completions`	Chat completion (streaming/non-streaming)
`POST`	`/v1/completions`	Text completion (streaming/non-streaming)
`GET`	`/v1/models`	List available models
`POST`	`/v1/embeddings`	Generate embeddings

Examples

List Models

# OpenAI-compatible
curl http://localhost:11435/v1/models

# Ollama-compatible
curl http://localhost:11435/api/tags

Chat Completion (OpenAI)

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Chat Completion with Streaming

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Text Generation (Ollama)

curl http://localhost:11435/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "Why is the sky blue?"
  }'

Text Completion (OpenAI)

curl http://localhost:11435/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "Once upon a time"
  }'

CLI Commands

Command	Description
`a3s-power run <model> [--prompt <text>]`	Load model and start interactive chat, or send a single prompt
`a3s-power pull <name_or_url>`	Download a model by name (`llama3.2:3b`) or direct URL
`a3s-power list`	List all locally available models
`a3s-power show <model>`	Show model details (format, size, parameters)
`a3s-power delete <model>`	Delete a model from local storage
`a3s-power serve [--host <addr>] [--port <port>]`	Start HTTP server (default: `127.0.0.1:11435`)

Model Storage

Models are stored in ~/.a3s/power/ (override with $A3S_POWER_HOME):

~/.a3s/power/
├── config.toml              # User configuration
└── models/
    ├── manifests/           # JSON manifest files
    │   ├── llama-2-7b.json
    │   └── qwen2.5-7b.json
    └── blobs/               # Content-addressed model files
        ├── sha256-abc123...
        └── sha256-def456...

Content-Addressed Storage

Model files are stored by their SHA-256 hash, enabling:

Deduplication: Identical files share storage
Integrity verification: Blobs can be verified against their hash
Clean deletion: Remove manifest + blob independently

Configuration

Configuration is read from ~/.a3s/power/config.toml:

host = "127.0.0.1"
port = 11435
max_loaded_models = 1

[gpu]
gpu_layers = -1   # offload all layers to GPU (-1=all, 0=CPU only)
main_gpu = 0      # primary GPU index

Field	Default	Description
`host`	`127.0.0.1`	HTTP server bind address
`port`	`11435`	HTTP server port
`data_dir`	`~/.a3s/power`	Base directory for model storage
`max_loaded_models`	`1`	Maximum models loaded in memory concurrently
`gpu.gpu_layers`	`0`	Number of layers to offload to GPU (0=CPU, -1=all)
`gpu.main_gpu`	`0`	Index of the primary GPU to use

All fields are optional and have sensible defaults.

Feature Flags

Flag	Description
`llamacpp`	Enable llama.cpp inference backend via `llama-cpp-2`. Requires a C++ compiler and CMake.

Without any feature flags, Power can manage models (pull, list, delete) and serve API responses, but inference calls will return a "backend not available" error.

Development

Build Commands

# Build
cargo build -p a3s-power                          # Debug build
cargo build -p a3s-power --release                 # Release build
cargo build -p a3s-power --features llamacpp       # With llama.cpp

# Test
cargo test -p a3s-power --lib -- --test-threads=1  # All 197 tests

# Lint
cargo clippy -p a3s-power -- -D warnings           # Clippy
cargo fmt -p a3s-power -- --check                   # Format check

# Run
cargo run -p a3s-power -- list                      # CLI
cargo run -p a3s-power -- serve                     # Server

Project Structure

power/
├── Cargo.toml
├── README.md
├── LICENSE
├── .gitignore
└── src/
    ├── main.rs              # Binary entry point (CLI dispatch)
    ├── lib.rs               # Library root (module re-exports)
    ├── error.rs             # PowerError enum + Result<T> alias
    ├── config.rs            # TOML configuration (host, port, data_dir)
    ├── dirs.rs              # Platform-specific paths (~/.a3s/power/)
    ├── cli/
    │   ├── mod.rs           # Cli struct + Commands enum (clap)
    │   ├── run.rs           # Interactive chat + single prompt
    │   ├── pull.rs          # Download with progress bar
    │   ├── list.rs          # Tabular model listing
    │   ├── show.rs          # Model detail display
    │   ├── delete.rs        # Model + blob deletion
    │   └── serve.rs         # HTTP server startup
    ├── model/
    │   ├── manifest.rs      # ModelManifest, ModelFormat, ModelParameters
    │   ├── registry.rs      # In-memory index backed by disk manifests
    │   ├── storage.rs       # Content-addressed blob store (SHA-256)
    │   ├── pull.rs          # HTTP download with progress callback
    │   ├── resolve.rs       # Name-based model resolution (built-in + HuggingFace)
    │   └── known_models.json# Built-in registry of popular GGUF models
    ├── backend/
    │   ├── mod.rs           # Backend trait + BackendRegistry
    │   ├── types.rs         # Inference request/response types
    │   ├── llamacpp.rs      # llama.cpp backend (feature-gated, multi-model)
    │   ├── chat_template.rs # Chat template detection and formatting
    │   └── test_utils.rs    # MockBackend for testing
    ├── server/
    │   ├── mod.rs           # Server startup (bind, listen)
    │   ├── state.rs         # Shared AppState with LRU model tracking
    │   ├── router.rs        # Axum router with CORS + tracing + metrics
    │   └── metrics.rs       # Prometheus metrics collection and /metrics handler
    └── api/
        ├── autoload.rs      # Model auto-loading on first inference
        ├── health.rs        # GET /health endpoint
        ├── types.rs         # OpenAI + Ollama request/response types
        ├── sse.rs           # SSE streaming utilities
        ├── native/
        │   ├── mod.rs       # Ollama-compatible route group
        │   ├── generate.rs  # POST /api/generate
        │   ├── chat.rs      # POST /api/chat
        │   ├── models.rs    # GET /api/tags, POST /api/show, DELETE /api/delete
        │   ├── pull.rs      # POST /api/pull (streaming progress)
        │   ├── embeddings.rs# POST /api/embeddings
        │   ├── embed.rs     # POST /api/embed (batch embeddings)
        │   ├── ps.rs        # GET /api/ps (running models)
        │   ├── copy.rs      # POST /api/copy (model aliasing)
        │   └── version.rs   # GET /api/version
        └── openai/
            ├── mod.rs       # OpenAI-compatible route group + shared helpers
            ├── chat.rs      # POST /v1/chat/completions
            ├── completions.rs # POST /v1/completions
            ├── models.rs    # GET /v1/models
            └── embeddings.rs# POST /v1/embeddings

A3S Ecosystem

A3S Power is an infrastructure component of the A3S ecosystem — a standalone model server that enables local LLM inference for other A3S tools.

┌──────────────────────────────────────────────────────────┐
│                    A3S Ecosystem                          │
│                                                           │
│  Infrastructure:  a3s-box     (MicroVM sandbox runtime)   │
│                   a3s-power   (local model serving)       │
│                      │            ▲                        │
│  Application:     a3s-code    ────┘  (AI coding agent)    │
│                    /   \                                   │
│  Utilities:   a3s-lane  a3s-context                       │
│                         (memory/knowledge)                 │
│                                                           │
│               a3s-power ◄── You are here                  │
└──────────────────────────────────────────────────────────┘

Project	Package	Relationship
box	`a3s-box-*`	Can use Power for local model inference
code	`a3s-code`	Uses Power as a local model backend
lane	`a3s-lane`	Independent utility (no direct relationship)
context	`a3s-context`	Independent utility (no direct relationship)

Standalone Usage: a3s-power works independently as a local model server for any application:

Drop-in Ollama replacement with identical API
OpenAI SDK compatible for seamless integration
Local-first inference with no cloud dependency

Roadmap

Phase 1: Core ✅

CLI model management (pull, list, show, delete)
Content-addressed storage with SHA-256
Model manifest system with JSON persistence
TOML configuration
Platform-specific directory resolution
108 comprehensive unit tests

Phase 2: Backend & Inference ✅

Backend trait abstraction
llama.cpp backend via llama-cpp-2 (feature-gated)
Streaming token generation via channels
Interactive chat with conversation history
Single prompt mode

Phase 3: HTTP Server ✅

Axum-based HTTP server with CORS + tracing
Ollama-compatible native API (7 endpoints)
OpenAI-compatible API (4 endpoints)
SSE streaming for all inference endpoints
Non-streaming response collection

Phase 4: Polish & Production ✅

Model registry resolution (name-based pulls with built-in registry + HuggingFace fallback)
Embedding generation support (automatic reload with embedding mode)
Multiple concurrent model loading (HashMap storage with LRU eviction)
Model auto-loading on first API request
GPU acceleration configuration ([gpu] config with layer offloading)
Chat template auto-detection from GGUF metadata (ChatML, Llama, Phi, Generic)
Health check endpoint (/health)
Prometheus metrics endpoint (/metrics with request/token/model counters)
197 comprehensive unit tests

License

MIT

a3s-power 0.1.0