A3S Power
Overview
A3S Power is a privacy-preserving LLM inference server designed to run inside Trusted Execution Environments (TEE). It provides an OpenAI-compatible API for chat completions, text completions, and embeddings — with hardware-enforced memory encryption, model integrity verification, and automatic log redaction.
Power is built to run inside a3s-box MicroVMs with AMD SEV-SNP or Intel TDX, ensuring that inference data (prompts, responses, model weights) never leaves the encrypted enclave.
How It Works
┌─────────────────────────────────────────────────────┐
│ a3s-box MicroVM (AMD SEV-SNP / Intel TDX) │
│ ┌───────────────────────────────────────────────┐ │
│ │ a3s-power │ │
│ │ │ │
│ │ 1. Verify model integrity (SHA-256) │ │
│ │ 2. Generate remote attestation report │ │
│ │ 3. Serve inference via OpenAI API │ │
│ │ 4. Redact all inference content from logs │ │
│ └───────────────────────────────────────────────┘ │
│ Hardware-encrypted memory — host cannot read │
└─────────────────────────────────────────────────────┘
Features
- TEE-Aware Runtime: Auto-detects AMD SEV-SNP (
/dev/sev-guest) and Intel TDX (/dev/tdx_guest) at startup; simulated mode for development (A3S_TEE_SIMULATE=1) - Remote Attestation:
TeeProvidertrait withAttestationReportgeneration — cryptographic proof that inference runs in a genuine TEE; AMD SEV-SNP uses real/dev/sev-guestioctl (SNP_GET_REPORT), Intel TDX uses real/dev/tdx-guestioctl (TDX_CMD_GET_REPORT0); full raw reports included for client verification - Model Integrity Verification: SHA-256 hash verification of all model files at startup against configured expected hashes; fails fast on tampering
- Deep Log Redaction:
PrivacyProvidertrait strips inference content from all log output in TEE mode — covers 10 sensitive JSON keys ("content","prompt","text","arguments","input","delta","system","message","query","instruction");sanitize_error()strips prompt fragments from error messages;suppress_token_metricsrounds token counts to nearest 10 to prevent side-channel inference - Memory Zeroing:
SensitiveStringwrapper auto-zeroizes on drop;zeroize_string()/zeroize_bytes()utilities for clearing inference buffers viazeroizecrate - Encrypted Model Loading: AES-256-GCM encryption/decryption of model files;
DecryptedModelRAII wrapper securely wipes temp files on drop;MemoryDecryptedModeldecrypts entirely in RAM withmlock(never touches disk); key from file or env var - KeyProvider Trait: Abstract key loading for HSM integration and zero-downtime key rotation;
StaticKeyProviderwraps existing file/env key source;RotatingKeyProviderholds multiple keys and advances onrotate_key()— deploy new key, rotate, remove old - Rate Limiting: Token-bucket rate limiter middleware (
rate_limit_rps) and concurrency cap (max_concurrent_requests) applied to all/v1/*endpoints; returns429 Too Many Requestswith OpenAI-style error body - Model-Attestation Binding:
GET /v1/attestation?model=<name>embeds the model's SHA-256 hash intoreport_dataalongside the nonce — layout[nonce(32)][model_sha256(32)]— cryptographically tying the attestation to the specific model being served - Health + TEE Status:
GET /healthreports TEE type, attestation status, and model verification state - OpenAI-Compatible API:
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddings— works with any OpenAI SDK - Pure Rust Inference (default): GGUF model inference via
mistralrs(built on candle) — no C++ dependency, ideal for TEE auditing - SafeTensors Inference: HuggingFace SafeTensors chat models loaded via
TextModelBuilderwith ISQ on-load quantization (default Q8_0); register withformat=safetensors, configure ISQ viadefault_parameters.isq(e.g.Q4K,Q6K,Q8_0) - Vision/Multimodal Inference: Vision models (e.g. LLaVA, Phi-3-Vision) loaded via
VisionModelBuilder; register withformat=vision; pass base64-encoded images viaimagesfield or OpenAI-stylecontentparts (image_url); ISQ quantization supported - True Token-by-Token Streaming: Chat completions use
stream_chat_requestfor per-token SSE delivery; eachResponse::Chunkis forwarded immediately as it is generated - Embedding Models: HuggingFace-format embedding models (e.g. Qwen3-Embedding, GTE, NomicBert) loaded via
EmbeddingModelBuilder; register withformat=huggingface, callPOST /v1/embeddings; empty-input fast path returns immediately - llama.cpp Backend (optional): GGUF inference via
llama-cpp-2Rust bindings (feature-gated, requires C++ toolchain) - GPU Acceleration: Auto-detection of Apple Metal and NVIDIA CUDA; configurable layer offloading, multi-GPU support
- Chat Template Engine: Jinja2-compatible template rendering via
minijinja(Llama 3, ChatML, Phi, Gemma, custom) - Tool/Function Calling: Structured tool definitions with XML, Mistral, and JSON output parsing
- JSON Schema Structured Output: Constrain model output via JSON Schema → GBNF grammar conversion
- Thinking & Reasoning: Streaming
<think>block parser for DeepSeek-R1, QwQ reasoning models - KV Cache Reuse: Prefix matching across multi-turn requests for conversation speedup
- Content-Addressed Storage: Model blobs stored by SHA-256 hash with automatic deduplication
- Automatic Model Lifecycle: LRU eviction, configurable keep-alive, background reaper for idle models
- TEE Metrics: Prometheus counters for attestation reports (
power_tee_attestations_total), model decryptions (power_tee_model_decryptions_total), and log redactions (power_tee_redactions_total) - RA-TLS Transport: Feature-gated (
tls) TLS server with self-signed ECDSA P-256 certificate; whenra_tls = true, the TEE attestation report is embedded as a custom X.509 extension (OID 1.3.6.1.4.1.56560.1.1) so clients can cryptographically verify the server is running inside a genuine TEE before trusting inference - Vsock Transport: Feature-gated (
vsock, Linux only) AF_VSOCK server for a3s-box MicroVM guest-host communication; exposes the same API as the TCP listener without requiring any network configuration inside the VM - Prometheus Metrics: Request counts, durations, tokens, inference timing, TTFT, model memory, GPU utilization
- HCL Configuration: HashiCorp Configuration Language for all settings
Architecture
A3S Power is organized into 6 layers. Each layer has a clear responsibility and communicates only with adjacent layers through trait-based interfaces.
System Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ a3s-power │
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ API Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ ┌──────────────┐ │ │
│ │ │ /v1/chat/ │ │ /v1/models │ │ /v1/embed │ │ /v1/attest │ │ │
│ │ │ completions │ │ /v1/models/ │ │ dings │ │ ation │ │ │
│ │ │ │ │ pull │ │ │ │ │ │ │
│ │ │ /v1/ │ │ /v1/models/ │ │ │ │ /health │ │ │
│ │ │ completions │ │ :name │ │ │ │ /metrics │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ └─────┬──────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌──────┴────────────────┴───────────────┴───────────────┘ │ │
│ │ │ autoload: LRU eviction → decrypt → integrity check → load │ │
│ │ └──────┬────────────────────────────────────────────────── │ │
│ └─────────┼─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────┼─────────────────────────────────────────────────────────┐ │
│ │ Server │Layer │ │
│ │ ┌──────┴───────────────────────────────────────────────────────┐ │ │
│ │ │ Middleware Stack (outermost → innermost) │ │ │
│ │ │ RateLimiter → RequestID → Metrics → Tracing → CORS → Auth │ │ │
│ │ └──────────────────────────┬───────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌──────────┐ ┌─────────┐ ┌┴─────────┐ ┌──────────┐ ┌─────────┐ │ │
│ │ │ AppState │ │ Auth │ │ Audit │ │ Metrics │ │Transport│ │ │
│ │ │ (model │ │ (Bearer │ │ (JSONL/ │ │(Promethe │ │TCP/TLS/ │ │ │
│ │ │lifecycle,│ │ SHA256 │ │ encrypt/ │ │ us, 16 │ │ Vsock) │ │ │
│ │ │ LRU, │ │ const- │ │ async/ │ │ metric │ │ │ │ │
│ │ │ privacy) │ │ time) │ │ noop) │ │ groups) │ │ │ │ │
│ │ └──────┬───┘ └─────────┘ └──────────┘ └──────────┘ └─────────┘ │ │
│ └─────────┼─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────┼─────────────────────────────────────────────────────────┐ │
│ │ Backend│Layer │ │
│ │ ┌──────┴───────────────────────────────────────────────────────┐ │ │
│ │ │ BackendRegistry (priority-based, TEE-aware routing) │ │ │
│ │ │ ┌─────────────────────┬─────────────────┬────────────────┐ │ │ │
│ │ │ │ MistralRsBackend ★ │ LlamaCppBackend │ PicolmBackend │ │ │ │
│ │ │ │ pure Rust (candle) │ C++ bindings │ pure Rust │ │ │ │
│ │ │ │ GGUF/SafeTensors/ │ GGUF only │ layer-stream │ │ │ │
│ │ │ │ HuggingFace/Vision │ KV cache, LoRA │ O(layer_size) │ │ │ │
│ │ │ │ ISQ quantization │ grammar, vision │ TEE-optimized │ │ │ │
│ │ │ └─────────────────────┴─────────────────┴────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ Shared: chat_template · gpu · json_schema · tool_parser │ │ │
│ │ │ think_parser · gguf_stream │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Model Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ ┌─────────────┐ │ │
│ │ │ ModelRegistry│ │ BlobStorage │ │ GgufMeta │ │ HfPull │ │ │
│ │ │ (RwLock<Map>)│ │ (SHA-256 │ │ (parser, │ │ (Range │ │ │
│ │ │ manifest │ │ content- │ │ memory │ │ resume, │ │ │
│ │ │ persistence) │ │ addressed) │ │ estim.) │ │ SSE prog.) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────┘ └─────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ TEE Layer (cross-cutting security) │ │
│ │ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌─────────────────┐ │ │
│ │ │Attestation │ │ Encrypted │ │ Privacy │ │ Model Seal │ │ │
│ │ │(TeeProvider│ │ Model │ │(Provider │ │ (SHA-256 + │ │ │
│ │ │ SEV-SNP, │ │ AES-256- │ │ redact, │ │ Ed25519 sig) │ │ │
│ │ │ TDX, ioctl)│ │ GCM, 3 │ │ zeroize, │ │ │ │ │
│ │ │ │ │ modes) │ │ suppress)│ │ │ │ │
│ │ └────────────┘ └────────────┘ └──────────┘ └─────────────────┘ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌─────────────────┐ │ │
│ │ │KeyProvider │ │ TeePolicy │ │ EPC │ │ RA-TLS Cert │ │ │
│ │ │(Static, │ │(allowlist, │ │(memory │ │ (X.509 + │ │ │
│ │ │ Rotating, │ │ measure- │ │ detect, │ │ attestation │ │ │
│ │ │ HSM ext.) │ │ ment pin) │ │ routing) │ │ extension) │ │ │
│ │ └────────────┘ └────────────┘ └──────────┘ └─────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Verify Layer (client-side SDK) │ │
│ │ ┌──────────────────────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ verify_report() │ │ HardwareVerifier trait │ │ │
│ │ │ · nonce binding (const-time) │ │ · SevSnpVerifier (AMD KDS) │ │ │
│ │ │ · model hash binding │ │ · TdxVerifier (Intel PCS) │ │ │
│ │ │ · measurement check │ │ · ECDSA P-384 / P-256 │ │ │
│ │ └──────────────────────────────┘ └─────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Infrastructure: config.rs (HCL) · dirs.rs · error.rs (14 var.) │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Core vs Extension
Power follows the Minimal Core + External Extensions pattern. Core components are stable and non-replaceable; extensions are trait-based and swappable.
Core (7) Extensions (8 trait-based)
───────────────────────── ──────────────────────────────────────
AppState (model lifecycle) Backend: MistralRs / LlamaCpp / Picolm
BackendRegistry + Backend trait TeeProvider: SEV-SNP / TDX / Simulated
ModelRegistry + ModelManifest PrivacyProvider: redaction policy
PowerConfig (HCL) TeePolicy: allowlist + measurement pin
PowerError (14 variants → HTTP) KeyProvider: Static / Rotating / KMS
Router + middleware stack AuthProvider: API key (SHA-256)
RequestContext (per-request) AuditLogger: JSONL / Encrypted / Async / Noop
HardwareVerifier: AMD KDS / Intel PCS
Request Flow: Chat Completion
Client
│
│ POST /v1/chat/completions
▼
┌─────────────────────────────────────────────────────────────────┐
│ Middleware Stack │
│ RateLimiter ─► RequestID ─► Metrics ─► Tracing ─► CORS ─► Auth │
└────────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ chat::handler() │
│ │
│ 1. Build RequestContext (request_id, auth_id) │
│ 2. Privacy: sanitize_log() if redaction enabled │
│ 3. ModelRegistry.get(model) → ModelManifest │
│ 4. BackendRegistry.find_for_format(format) → Backend │
│ │
│ 5. autoload::ensure_loaded() │
│ ├─ LRU eviction if at max_loaded_models │
│ ├─ If .enc: KeyProvider.get_key() → AES-256-GCM decrypt │
│ │ ├─ MemoryDecryptedModel (mlock RAM, zeroize on drop) │
│ │ ├─ DecryptedModel (temp file, secure wipe on drop) │
│ │ └─ LayerStreamingDecryptedModel (chunk-by-chunk) │
│ ├─ model_seal: verify SHA-256 integrity │
│ ├─ model_seal: verify Ed25519 signature (if configured) │
│ └─ Backend.load(manifest) │
│ │
│ 6. Backend.chat(model, request) → Stream<ChatResponseChunk> │
│ 7. Streaming SSE: role → content chunks (TTFT) → usage → DONE │
│ 8. Privacy: zeroize buffers, round token counts │
│ 9. Timing padding (±20% jitter) if configured │
│ 10. Audit: log event, Metrics: record duration/tokens │
│ 11. If keep_alive=0: Backend.unload() → RAII secure wipe │
└─────────────────────────────────────────────────────────────────┘
TEE Security Integration
The TEE layer is cross-cutting — it integrates at every layer of the stack:
Layer TEE Integration
────────────── ──────────────────────────────────────────────────────
API Log redaction, buffer zeroization, token rounding,
timing padding, attestation endpoint (nonce + model bind)
Server Encrypted audit logs (AES-256-GCM), constant-time auth,
RAII decrypted model storage, RA-TLS cert with attestation
X.509 extension, TEE-specific Prometheus counters
Backend EPC-aware routing (auto picolm when model > 75% EPC),
KV cache isolation per request, mlock weight pinning
Model Content-addressed SHA-256 storage, GGUF memory estimation
for EPC budget planning
TEE Attestation (SEV-SNP/TDX ioctl), AES-256-GCM encryption
(3 modes: file/RAM/streaming), Ed25519 model signatures,
key rotation, policy enforcement, log redaction (9 keys),
SensitiveString (auto-zeroize), EPC memory detection
Verify Client-side: nonce binding, model hash binding,
measurement check (all constant-time), hardware signature
verification via AMD KDS / Intel PCS certificate chains
Encrypted Model Decryption Modes
┌─────────────────────────────────────────┐
│ KeyProvider.get_key() │
│ Static ─── Rotating ─── (HSM ext.) │
└──────────────────┬──────────────────────┘
│ AES-256-GCM key
┌──────────────────┼──────────────────────┐
│ │ │
┌─────┴──────┐ ┌──────┴───────┐ ┌──────────┴──────────┐
│ DecryptedMo│ │ MemoryDecrypt│ │ LayerStreamingDecry │
│ del (file) │ │ edModel (RAM)│ │ ptedModel (chunks) │
│ │ │ │ │ │
│ Temp .dec │ │ mlock-pinned │ │ Chunk-by-chunk │
│ file on │ │ RAM buffer, │ │ Zeroizing<Vec<u8>> │
│ disk, zero │ │ never touches│ │ per layer, for │
│ overwrite │ │ disk, zeroize│ │ picolm streaming │
│ + delete │ │ on drop │ │ O(layer_size) peak │
│ on drop │ │ │ │ │
└────────────┘ └──────────────┘ └─────────────────────┘
Any Any picolm only
backend backend
Backend Trait
Three backends are available, each feature-gated:
mistralrs(default): Pure Rust inference via candle. GGUF, SafeTensors, HuggingFace, Vision formats. ISQ on-load quantization. No C++ toolchain. Ideal for TEE supply-chain auditing.llamacpp(optional): C++ llama.cpp viallama-cpp-2bindings. GGUF only. Session KV cache with prefix matching, LoRA adapters, MTMD multimodal, grammar constraints, mirostat sampling.picolm(optional): Pure Rust layer-streaming. GGUF only. Peak RAM = O(layer_size) not O(model_size). Enables 7B+ models in 512MB TEE EPC. Zero C dependencies.
The BackendRegistry selects backends by priority and model format. In TEE environments, find_for_tee() auto-routes to picolm when the model exceeds 75% of available EPC memory.
Without any backend feature enabled, Power can manage models but returns "backend not available" for inference.
Extension Points
All extension points are trait-based with working default implementations — the system works out of the box:
/// Remote attestation provider (TEE hardware abstraction).
/// Privacy protection for inference logs.
/// Model decryption key management (extensible to HSM/KMS).
/// Authentication mechanism.
/// Audit trail persistence.
/// TEE policy enforcement.
/// Client-side hardware attestation signature verification.
Installation
Cargo (cross-platform)
# Default: pure Rust inference via mistral.rs (no C++ toolchain needed)
# With llama.cpp inference backend (requires C++ compiler + CMake)
# Model management only (no inference)
Build from Source
# Default: pure Rust inference via mistral.rs
# With llama.cpp inference instead
# Binary at target/release/a3s-power
Homebrew (macOS)
Configuration
Configuration is read from ~/.a3s/power/config.hcl (HCL format):
host = "127.0.0.1"
port = 11434
max_loaded_models = 1
keep_alive = "5m"
# TEE privacy protection
tee_mode = true
redact_logs = true
# Model integrity verification (checked at startup when tee_mode = true)
model_hashes = {
"llama3.2:3b" = "sha256:abc123..."
"qwen2.5:7b" = "sha256:def456..."
}
# GPU acceleration
gpu {
gpu_layers = -1 # -1 = offload all layers, 0 = CPU only
main_gpu = 0
}
Configuration Reference
| Field | Default | Description |
|---|---|---|
host |
127.0.0.1 |
HTTP server bind address |
port |
11434 |
HTTP server port |
data_dir |
~/.a3s/power |
Base directory for model storage |
max_loaded_models |
1 |
Maximum models loaded concurrently |
keep_alive |
"5m" |
Auto-unload idle models ("0" = immediate, "-1" = never) |
use_mlock |
false |
Lock model weights in memory (prevent swapping) |
num_thread |
auto | Thread count for inference |
flash_attention |
false |
Enable flash attention |
num_parallel |
1 |
Concurrent inference slots |
tee_mode |
false |
Enable TEE: attestation, integrity checks, log redaction |
redact_logs |
false |
Redact inference content from logs |
model_hashes |
{} |
Expected SHA-256 hashes for model verification |
model_signing_key |
null |
Ed25519 public key (hex) for verifying model .sig signatures |
gpu.gpu_layers |
0 |
GPU layer offloading (-1 = all) |
gpu.main_gpu |
0 |
Primary GPU index |
model_key_source |
null |
Decryption key for .enc model files: { file = "/path/to/key.hex" } or { env = "MY_KEY_VAR" } |
key_provider |
"static" |
Key provider type: "static" (uses model_key_source) or "rotating" (uses key_rotation_sources) |
key_rotation_sources |
[] |
For rotating provider: list of key sources in rotation order |
in_memory_decrypt |
false |
Decrypt .enc models entirely in RAM with mlock (never writes plaintext to disk) |
suppress_token_metrics |
false |
Round token counts in responses to nearest 10 (prevents exact token-count side-channel) |
rate_limit_rps |
0 |
Max requests per second for /v1/* endpoints (0 = unlimited) |
max_concurrent_requests |
0 |
Max concurrent requests for /v1/* endpoints (0 = unlimited) |
tls_port |
null |
TLS server port; when set, a TLS server starts in parallel (tls feature required) |
ra_tls |
false |
Embed TEE attestation in TLS cert (RA-TLS); requires tls_port + tee_mode |
vsock_port |
null |
Vsock port for guest-host communication (vsock feature, Linux only) |
Environment Variables
| Variable | Description |
|---|---|
A3S_POWER_HOME |
Base directory for all Power data (default: ~/.a3s/power) |
A3S_POWER_HOST |
Server bind address |
A3S_POWER_PORT |
Server port |
A3S_POWER_DATA_DIR |
Model storage directory |
A3S_POWER_MAX_MODELS |
Max concurrent loaded models |
A3S_POWER_KEEP_ALIVE |
Default keep-alive duration |
A3S_POWER_GPU_LAYERS |
GPU layer offloading |
A3S_POWER_TEE_MODE |
Enable TEE mode ("1" or "true") |
A3S_POWER_REDACT_LOGS |
Enable log redaction ("1" or "true") |
A3S_POWER_TLS_PORT |
TLS server port (tls feature required) |
A3S_POWER_RA_TLS |
Enable RA-TLS attestation embedding ("1" or "true") |
A3S_POWER_VSOCK_PORT |
Vsock port (vsock feature, Linux only) |
A3S_TEE_SIMULATE |
Simulate TEE environment for development ("1") |
TEE Privacy Protection
Model Integrity Verification
When tee_mode = true and model_hashes is configured, Power verifies every model file's SHA-256 hash at startup. If any model fails verification, the server refuses to start.
tee_mode = true
model_hashes = {
"llama3.2:3b" = "sha256:a1b2c3d4e5f6..."
}
INFO TEE mode enabled tee_type="sev-snp"
INFO Model integrity verified model="llama3.2:3b"
INFO All model integrity checks passed count=1
Remote Attestation
The TeeProvider detects the TEE environment and generates attestation reports:
| TEE Type | Detection | Description |
|---|---|---|
| AMD SEV-SNP | /dev/sev-guest |
Hardware memory encryption + attestation |
| Intel TDX | /dev/tdx_guest |
Trust Domain Extensions |
| Simulated | A3S_TEE_SIMULATE=1 |
Development/testing mode |
| None | (default) | No TEE detected |
The /health endpoint exposes TEE status:
Log Redaction
When redact_logs = true, the PrivacyProvider automatically strips inference content from all log output:
// Before redaction:
{"content": "tell me a secret", "model": "llama3"}
// After redaction:
{"content": "[REDACTED]", "model": "llama3"}
Redacted JSON keys: "content", "prompt", "text", "arguments", "input", "delta", "system", "message", "query", "instruction" — covering chat messages, tool call arguments, streaming deltas, system prompts, and completion requests.
Error messages that echo prompt content are also sanitized via sanitize_error(). When suppress_token_metrics = true, token counts in responses are rounded to the nearest 10 to prevent exact token-count side-channel inference.
API Reference
Server Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check with TEE status, version, uptime, loaded models |
GET |
/metrics |
Prometheus metrics (requests, durations, tokens, inference, TTFT, model memory, GPU) |
OpenAI-Compatible API
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completion (streaming/non-streaming, vision, tools, thinking) |
POST |
/v1/completions |
Text completion (streaming/non-streaming) |
POST |
/v1/embeddings |
Generate embeddings |
GET |
/v1/models |
List all registered models |
GET |
/v1/models/:name |
Get a single model by name |
POST |
/v1/models |
Register a local model file (name, path body fields) |
DELETE |
/v1/models/:name |
Unload and deregister a model |
POST |
/v1/models/pull |
Pull a GGUF model from HuggingFace Hub (name, force body fields); streams SSE progress events; requires hf feature; concurrent pulls of the same model are deduplicated |
GET |
/v1/models/pull/:name/status |
Get persisted pull progress for a model (status, completed, total, error); requires hf feature |
GET |
/v1/attestation |
TEE attestation report (returns 503 if TEE not enabled); optional ?nonce=<hex> binds client nonce; optional ?model=<name> binds model SHA-256 into report_data |
Examples
Chat Completion
Streaming Chat
Text Completion
Tool/Function Calling
Structured Output (JSON Schema)
List Models
Pull a Model from HuggingFace Hub
Requires the hf feature (cargo build --features hf). Streams SSE progress:
# By quantization tag (resolves filename via HF API)
# By exact filename
# Private/gated model with HF token
# Force re-download
SSE response stream:
data: {"status":"resuming","offset":104857600,"total":2147483648} ← if resuming
data: {"status":"downloading","completed":209715200,"total":2147483648}
data: {"status":"verifying"}
data: {"status":"success","id":"bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M","object":"model","created":1234567890}
Interrupted downloads resume automatically on retry — the partial file is identified by a SHA-256 of the download URL and picked up via HTTP Range requests. Set HF_TOKEN env var as an alternative to passing token in the request body.
Health Check (with TEE status)
Model Storage
Models are stored in ~/.a3s/power/ (override with $A3S_POWER_HOME):
~/.a3s/power/
├── config.hcl # HCL configuration
└── models/
├── manifests/ # JSON manifest files
│ ├── llama3.2-3b.json
│ └── qwen2.5-7b.json
└── blobs/ # Content-addressed model files
├── sha256-abc123...
└── sha256-def456...
Model files are stored by SHA-256 hash, enabling deduplication and integrity verification.
Feature Flags
| Flag | Default | Description |
|---|---|---|
mistralrs |
✅ enabled | Pure Rust inference backend via mistralrs (candle-based). No C++ toolchain required. Ideal for TEE auditing. |
llamacpp |
❌ disabled | llama.cpp inference backend via llama-cpp-2. Requires C++ compiler + CMake. Full-featured (KV cache, LoRA, grammar, mirostat). |
picolm |
❌ disabled | Experimental. Pure Rust layer-streaming GGUF inference. Peak RAM = O(layer_size) not O(model_size). Enables 7B+ models in 512MB TEE EPC. Zero C dependencies — fully auditable. ⚠️ Forward pass uses stub arithmetic (not real transformer ops); tokenizer uses byte-fallback (not BPE). Produces placeholder output — not suitable for production inference yet. Infrastructure (mmap, layer iteration, sampling) is production-ready. |
hf |
❌ disabled | HuggingFace Hub model pull (POST /v1/models/pull). Range resume, SSE progress, HF_TOKEN auth. |
tls |
❌ disabled | RA-TLS transport: TLS server with self-signed cert + optional attestation X.509 extension. Adds axum-server, rcgen, time deps. |
vsock |
❌ disabled | Vsock transport for a3s-box MicroVM guest-host HTTP. Linux only — requires AF_VSOCK kernel support. Adds tokio-vsock dep. |
hw-verify |
❌ disabled | Hardware attestation signature verification. AMD KDS (ECDSA P-384) + Intel PCS (ECDSA P-256) certificate chain validation. |
tee-minimal |
❌ disabled | Composite: picolm + tls + vsock. Smallest auditable TEE build — no mistralrs/candle, no C++. Recommended for production TEE deployments. |
Without a backend feature (mistralrs, llamacpp, or picolm), Power can manage models but inference calls return "backend not available".
TEE Deployment
For production TEE deployments (AMD SEV-SNP / Intel TDX), use the tee-minimal build profile:
Why tee-minimal?
Inside a TEE, every crate in the inference path is part of the trusted computing base.
The tee-minimal profile minimizes this surface:
| Profile | Inference backend | Dep tree lines | C dependencies |
|---|---|---|---|
default |
mistralrs (candle) | ~2,000 | None |
tee-minimal |
picolm (pure Rust) | ~1,220 | None |
llamacpp |
llama.cpp | ~1,800+ | Yes (C++) |
What tee-minimal includes
- picolm backend: Pure Rust layer-streaming GGUF inference (~860 lines, fully auditable)
- Full TEE stack: attestation, model integrity (SHA-256), log redaction, memory zeroing
- Encrypted model loading: AES-256-GCM with
in_memory_decryptorstreaming_decrypt - RA-TLS transport: attestation embedded in X.509 cert
- Vsock transport: for a3s-box MicroVM guest-host communication
Layer-streaming inference
The picolm backend streams transformer layers one at a time through a single working buffer. Peak RAM stays at O(layer_size) rather than O(model_size), enabling 7B+ models inside a 512MB EPC:
# config.hcl — TEE deployment with layer-streaming
tee_mode = true
redact_logs = true
# For encrypted models: decrypt one layer at a time (requires picolm feature)
streaming_decrypt = true
# Or: decrypt full model into mlock RAM (compatible with all backends)
# in_memory_decrypt = true
Supply-chain audit
See docs/supply-chain.md for:
- Full dependency listing per feature profile
- Audit status for each crate in the
tee-minimalinference path - Security properties of
LayerStreamingDecryptedModel - How to reproduce dependency counts and audit unsafe blocks
Building with RA-TLS
# Build with TLS support
# Test TLS cert generation
To enable RA-TLS, set tls_port and ra_tls = true alongside tee_mode = true:
tee_mode = true
tls_port = 11443
ra_tls = true
At startup, the TLS server binds on the configured port with a fresh self-signed ECDSA P-256 certificate. When ra_tls = true and a TEE provider is active, the certificate includes the attestation report as OID extension 1.3.6.1.4.1.56560.1.1. Clients can extract and verify this extension to confirm they are communicating with a genuine TEE before trusting inference output.
Development
Build & Test
# Build
# Test (787+ tests)
# Test with TLS feature
# Lint
# Run
Project Structure
power/
├── Cargo.toml
├── justfile # Build, test, coverage, lint, CI targets
├── README.md
└── src/
├── main.rs # Entry point: load HCL config → server::start()
├── lib.rs # Module declarations
├── config.rs # PowerConfig (HCL deserialization + env overrides)
├── dirs.rs # Platform paths (~/.a3s/power/{manifests,blobs,pulls})
├── error.rs # PowerError enum (14 variants) + HTTP status mapping
│
├── api/ # API layer — OpenAI-compatible HTTP handlers
│ ├── mod.rs # Shared utilities, timestamp helpers
│ ├── types.rs # OpenAI request/response types (chat, completion, embedding)
│ ├── health.rs # GET /health (TEE status, version, uptime, loaded models)
│ ├── autoload.rs # Model lifecycle: LRU eviction → decrypt → verify → load
│ └── openai/ # OpenAI-compatible endpoint handlers
│ ├── mod.rs # Route definitions, openai_error() helper
│ ├── chat.rs # POST /v1/chat/completions (streaming SSE + JSON)
│ ├── completions.rs # POST /v1/completions
│ ├── embeddings.rs # POST /v1/embeddings
│ ├── models.rs # GET/POST/DELETE /v1/models, POST /v1/models/pull
│ └── attestation.rs # GET /v1/attestation (nonce + model hash binding)
│
├── backend/ # Backend layer — inference engine abstraction
│ ├── mod.rs # Backend trait (8 methods) + BackendRegistry (priority, TEE routing)
│ ├── types.rs # ChatRequest, ChatResponseChunk, EmbeddingRequest, Tool, ToolCall
│ ├── mistralrs_backend.rs # Pure Rust: GGUF/SafeTensors/HF/Vision, ISQ (feature: mistralrs) ★
│ ├── llamacpp.rs # C++ bindings: KV cache, LoRA, MTMD vision, grammar (feature: llamacpp)
│ ├── picolm.rs # Pure Rust layer-streaming, O(layer_size) RAM (feature: picolm)
│ ├── chat_template.rs # Jinja2 chat template rendering (ChatML/Llama/Phi/Generic)
│ ├── gpu.rs # Metal + CUDA detection, auto gpu_layers config
│ ├── json_schema.rs # JSON Schema → GBNF grammar for constrained output
│ ├── tool_parser.rs # Tool call parsing (XML/Hermes, Mistral, raw JSON)
│ ├── think_parser.rs # Streaming <think> block extraction (DeepSeek-R1, QwQ)
│ ├── gguf_stream.rs # GGUF v2/v3 mmap reader for picolm layer-streaming
│ └── test_utils.rs # MockBackend for testing
│
├── model/ # Model layer — storage, registry, pull
│ ├── mod.rs # Module declarations
│ ├── manifest.rs # ModelManifest, ModelFormat (Gguf/SafeTensors/HuggingFace/Vision)
│ ├── registry.rs # ModelRegistry (RwLock<HashMap>, JSON manifest persistence)
│ ├── storage.rs # Content-addressed blob store (SHA-256 naming, prune)
│ ├── gguf.rs # GGUF metadata reader, memory estimation (KV cache + compute)
│ ├── pull.rs # HuggingFace Hub pull with Range resume, SSE progress (feature: hf)
│ └── pull_state.rs # Persistent pull state (Pulling/Done/Failed) as JSON
│
├── server/ # Server layer — transport, auth, metrics, audit
│ ├── mod.rs # Server startup orchestration (TCP/TLS/Vsock), graceful shutdown
│ ├── state.rs # AppState: model lifecycle, LRU, decrypted model RAII, privacy
│ ├── router.rs # Axum router + middleware: rate limit, request ID, metrics, auth
│ ├── auth.rs # AuthProvider trait, ApiKeyAuth (SHA-256, constant-time)
│ ├── audit.rs # AuditLogger trait: JSONL / Encrypted / Async / Noop
│ ├── metrics.rs # Prometheus metrics (16 groups: HTTP, inference, TTFT, GPU, TEE)
│ ├── request_context.rs # Per-request context (request_id, auth_id, created_at)
│ ├── lock.rs # Shared RwLock helpers
│ └── vsock.rs # AF_VSOCK transport (feature: vsock, Linux only)
│
├── tee/ # TEE layer — cross-cutting security
│ ├── mod.rs # Module entry
│ ├── attestation.rs # TeeProvider trait, SEV-SNP/TDX ioctl, report_data binding
│ ├── encrypted_model.rs # AES-256-GCM: DecryptedModel / MemoryDecrypted / LayerStreaming
│ ├── key_provider.rs # KeyProvider trait: StaticKeyProvider + RotatingKeyProvider
│ ├── model_seal.rs # SHA-256 integrity + Ed25519 signature verification
│ ├── policy.rs # TeePolicy trait: allowlist + measurement pinning
│ ├── privacy.rs # PrivacyProvider: log redaction (9 keys), SensitiveString, zeroize
│ ├── epc.rs # EPC memory detection (/proc/meminfo), 75% threshold routing
│ └── cert.rs # RA-TLS X.509 cert with attestation extension (feature: tls)
│
├── verify/ # Verify layer — client-side attestation SDK
│ ├── mod.rs # verify_report(), nonce/hash/measurement binding (constant-time)
│ └── hw_verify.rs # SevSnpVerifier (AMD KDS) + TdxVerifier (Intel PCS)
│
└── bin/
└── a3s-power-verify.rs # CLI for offline attestation report verification
A3S Ecosystem
A3S Power is the inference engine of the A3S privacy-preserving AI platform. It runs inside a3s-box MicroVMs to provide hardware-isolated LLM inference.
┌──────────────────────────────────────────────────────────────────┐
│ A3S Ecosystem │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ a3s-box MicroVM (AMD SEV-SNP / Intel TDX) │ │
│ │ ┌────────────────────────────────────────────────────┐ │ │
│ │ │ a3s-power │ │ │
│ │ │ OpenAI API ← Vsock/RA-TLS → host │ │ │
│ │ └────────────────────────────────────────────────────┘ │ │
│ │ Hardware-encrypted memory — host cannot read │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ▲ Vsock │
│ │ │
│ ┌────┴─────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ a3s-gateway │ │ a3s-event │ │ a3s-code │ │
│ │ (API route) │ │ (event bus) │ │ (AI coding agent) │ │
│ └──────────────┘ └──────────────┘ └────────────────────────┘ │
│ │
│ Client-side: │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ a3s-power verify SDK │ │
│ │ Nonce binding · Model hash binding · HW signature check │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
| Component | Relationship to Power |
|---|---|
| a3s-box | Hosts Power inside TEE-enabled MicroVMs (AMD SEV-SNP / Intel TDX) |
| a3s-code | Uses Power as a local inference backend |
| a3s-gateway | Routes inference requests to Power instances |
| a3s-event | Distributes inference events across the platform |
| verify SDK | Client-side attestation verification (nonce, model hash, HW signature) |
Roadmap
Completed
-
Core inference engine (llama.cpp, chat templates, tool calling, structured output, thinking)
-
Pure Rust inference backend —
mistralrsfeature (default): GGUF inference via candle, no C++ dependency; ideal for TEE supply-chain auditing -
OpenAI-compatible API (
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddings) -
Content-addressed model storage with SHA-256
-
GPU auto-detection and acceleration (Metal, CUDA, multi-GPU)
-
KV cache reuse with prefix matching
-
Prometheus metrics and health endpoint
-
TEE refactoring — removed Ollama compatibility layer (~6,900 lines deleted)
-
HCL-only configuration (removed TOML)
-
TEE awareness —
TeeProvidertrait,DefaultTeeProvider(SEV-SNP, TDX, Simulated) -
Model integrity verification — SHA-256 at startup
-
Privacy protection —
PrivacyProvidertrait, log redaction -
TEE status in
/healthendpoint -
Attestation endpoint —
GET /v1/attestationfor clients to verify TEE -
Memory zeroing —
zeroizecrate,SensitiveStringauto-zeroize wrapper -
Encrypted model loading — AES-256-GCM,
DecryptedModelRAII secure wipe, key from file/env -
PrivacyProvider integrated into inference chain — prompt/response wrapped in
SensitiveString,sanitize_logapplied at every log site -
EncryptedModel integrated into autoload —
.encmodels auto-detected, decrypted, RAII cleanup on unload/eviction -
TEE metrics — Prometheus counters for attestation reports, model decryptions, and log redactions
-
Attestation nonce —
?nonce=<hex>binds client nonce intoreport_datato prevent replay attacks -
RA-TLS transport —
tlsfeature: self-signed ECDSA P-256 cert;ra_tls = trueembeds JSON attestation report as custom X.509 extension (OID 1.3.6.1.4.1.56560.1.1); TLS server spawned in parallel with plain HTTP -
Vsock transport —
vsockfeature (Linux only): AF_VSOCK server for a3s-box MicroVM guest-host HTTP communication; uses same axum router as TCP; no network config required inside the VM -
SEV-SNP ioctl — real
/dev/sev-guestioctl (SNP_GET_REPORT) for hardware attestation reports; extractsreport_data(64 bytes) andmeasurement(48 bytes) from firmware response; full raw report included for client-side verification -
TDX ioctl — real
/dev/tdx-guestioctl (TDX_CMD_GET_REPORT0) for hardware attestation reports; extractsreportdata(64 bytes) andmrtd(48 bytes) from TDREPORT; supports both/dev/tdx-guestand/dev/tdx_guestdevice paths -
KeyProvider trait —
StaticKeyProvider(wraps file/env key source) +RotatingKeyProvider(multiple keys, zero-downtime rotation viarotate_key()); initialized on server startup;AppState.key_providerfield -
Deep log redaction —
PrivacyProvidercovers 10 sensitive JSON keys;sanitize_error()strips prompt fragments from error messages -
Token metric suppression —
suppress_token_metricsconfig rounds token counts to nearest 10 to prevent side-channel inference -
In-memory decryption config —
in_memory_decryptfield;MemoryDecryptedModeldecrypts intomlock-pinned RAM, never writes plaintext to disk -
Rate limiting — token-bucket middleware (
rate_limit_rps) + concurrency cap (max_concurrent_requests) on/v1/*; returns429with OpenAI-style error -
Model-attestation binding —
build_report_data(nonce, model_hash)layout[nonce(32)][sha256(32)];TeeProvider::attestation_report_with_model()default impl;GET /v1/attestation?model=<name>ties attestation to specific model -
Embedding model support —
ModelFormat::HuggingFacevariant;MistralRsBackendloads HF embedding models viaEmbeddingModelBuilderwith local path;POST /v1/embeddingsfully functional; register withformat=huggingface -
SafeTensors inference —
ModelFormat::SafeTensorsvariant;MistralRsBackendloads local safetensors chat models viaTextModelBuilderwith ISQ on-load quantization; ISQ type configurable viadefault_parameters.isq(Q4_0, Q4K, Q6K, Q8_0, HQQ4, HQQ8, etc.); defaults to Q8_0; register withformat=safetensors -
Client attestation verification SDK —
verifymodule withverify_report(),verify_nonce_binding(),verify_model_hash_binding(),verify_measurement();HardwareVerifiertrait for pluggable hardware signature verification;a3s-power-verifyCLI binary -
Graceful shutdown — SIGTERM + Ctrl-C handled via
shutdown_signal(); unloads all models (triggers RAII zeroize of decrypted weights); flushes audit log viaAuditLogger::flush()before exit;AsyncJsonLinesAuditLoggerflush uses oneshot channel to wait for background writer to drain -
HuggingFace Hub model pull —
hffeature:POST /v1/models/pulldownloads GGUF models from HuggingFace Hub; supportsowner/repo:Q4_K_M(resolves filename via HF API) andowner/repo/file.gguf(direct); streams SSE progress events (resuming,downloading,verifying,success); resume interrupted downloads via HTTP Range requests (deterministic partial filename = SHA-256 of URL); HF token auth for private/gated models viatokenrequest field orHF_TOKENenv var; stores in content-addressed blob store; SHA-256 verified;forceflag for re-download -
Pull concurrent control —
Mutex<HashSet>inAppStatededuplicates concurrent pulls of the same model; returns409 Conflictif a pull is already in progress -
Pull progress persistence — JSON state files in
~/.a3s/power/pulls/;GET /v1/models/pull/:name/statusreturns{status, completed, total, error}; survives server restarts; throttled writes (every 5%) to minimize disk I/O -
True token-by-token streaming —
stream_chat_requestreplaces non-streaming path; eachResponse::Chunkforwarded immediately via mpsc channel;Response::Donesetsfinish_reason -
Vision/multimodal inference —
ModelFormat::Visionvariant;MistralRsBackendloads vision models viaVisionModelBuilderwith ISQ; base64 images accepted viaimagesfield or OpenAIimage_urlcontent parts; decoded withimage+base64crates -
picolm backend — pure Rust layer-streaming GGUF inference (
picolmfeature); peak RAM = O(layer_size) not O(model_size); enables 7B+ models in 512MB TEE EPC; zero C dependencies;GgufFilemmap reader + top-p sampler in ~860 lines -
EPC memory detection —
tee::epcmodule reads/proc/meminfo;BackendRegistry::find_for_tee()auto-routes to picolm when model exceeds 75% of available EPC -
LayerStreamingDecryptedModel— chunk-by-chunk access to AES-256-GCM encrypted models; each chunk returned asZeroizing<Vec<u8>>, zeroized on drop;streaming_decryptconfig field -
tee-minimalfeature profile —picolm+tls+vsock; smallest auditable TEE build (~1,220 dep tree lines vs ~2,000 for default); no mistralrs/candle, no C++ -
Supply-chain audit document —
docs/supply-chain.md; per-profile dependency listing, audit status table, threat model
Community
Join us on Discord for questions, discussions, and updates.
License
MIT