apr-cli
CLI tool for APR model inspection, debugging, and operations.
Installation
This installs the apr binary.
Features
- Model Inspection: View APR model structure, metadata, and weights
- Debugging: Hex dumps, tree visualization, flow analysis
- Operations: List, compare, and validate APR models
- Kernel Explainability: Explain kernel pipelines, equivalence classes (A-F), and proof status
- TUI Mode: Interactive terminal interface for model exploration
Usage
# Show help
# Inspect a model
# List models in directory
# Interactive TUI mode
# Compare two models
Chat Interface
Interactive chat with language models (supports APR, GGUF, SafeTensors):
# Chat with a GGUF model (GPU acceleration by default)
# Force CPU inference
# Explicitly request GPU acceleration
# Adjust generation parameters
Quantization
Streaming SafeTensors to Q4K APR (ALB-093)
Quantize sharded HuggingFace SafeTensors models directly to Q4K APR format with bounded memory. No intermediate files required.
# Quantize sharded SafeTensors model to Q4K APR
# Plan mode — estimate sizes without executing
# Batch quantize to multiple schemes
# JSON output for CI integration
The streaming pipeline reads shards one at a time via mmap, quantizes each tensor individually, and streams the output. Peak memory is bounded by the largest single tensor (~2-4 GB) regardless of total model size.
Supported schemes: int8 (i8, q8_0), int4 (i4, q4_0), fp16 (f16, half), q4k (q4_k, q4_k_m)
Tensor routing: Weight matrices (2D, >= 256 elements) are quantized to Q4K. Norm weights, embeddings, biases, and small tensors are kept at F32 for precision.
Optional Features
Pre-flight Capacity Planning
Check if a model fits your GPU before downloading:
# From HuggingFace (fetches only ~2KB config.json, no weights)
# With quantization override
# From local file
# JSON output
Reports VRAM budget, throughput estimates, roofline analysis, and contract checks (READY / WARNINGS / BLOCKED verdict).
Inference Server
Enable the inference feature to serve models via HTTP:
The server provides an OpenAI-compatible API:
# Health check
# Chat completions
# Streaming
Debugging with Tracing
Use the X-Trace-Level header to enable inference tracing for debugging:
# Brick-level tracing (token operations)
# Step-level tracing (forward pass steps)
# Layer-level tracing (per-layer timing)
Trace levels:
brick: Token-by-token operation timingstep: Forward pass steps (embed, attention, mlp, lm_head)layer: Per-layer timing breakdown (24+ layers)
CUDA GPU Acceleration
Enable CUDA support for NVIDIA GPUs:
GPU-Accelerated Server
Start the server with GPU acceleration for maximum throughput:
# Single-request GPU mode (~83 tok/s on RTX 4090)
# Batched GPU mode - 2.9x faster than Ollama (~850 tok/s)
Performance Comparison
| Mode | Throughput | vs Ollama | Memory |
|---|---|---|---|
| CPU (baseline) | ~15 tok/s | 0.05x | 1.1 GB |
| GPU (single) | ~83 tok/s | 0.25x | 1.5 GB |
| GPU (batched) | ~850 tok/s | 2.9x | 1.9 GB |
| Ollama | ~333 tok/s | 1.0x | - |
GPU Server Output
=== APR Serve ===
Model: qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
Binding: 127.0.0.1:8080
Detected format: GGUF
Loading GGUF model (mmap)...
GGUF loaded: 339 tensors, 26 metadata entries
Building quantized inference model...
Model ready: 28 layers, vocab_size=151936, hidden_dim=1536
Enabling optimized CUDA acceleration (PAR-111)...
Initializing GPU on device 0...
Pre-uploaded 934 MB weights to GPU
CUDA optimized model ready
Performance: 755+ tok/s (2.6x Ollama)
Example GPU Request
# Chat completion with GPU acceleration
Examples
# Run the tracing example
# Run the GPU chat inference example (requires CUDA)
Performance Testing
Test GPU inference performance:
# Start GPU server
# Run benchmark (separate terminal)
for; do
done
QA and Testing
The apr CLI includes comprehensive QA commands for model validation:
# Run falsifiable QA checklist
# With custom throughput threshold
# Compare against Ollama
# JSON output for CI integration
For automated QA testing, use the example runners:
# Full 21-cell QA matrix
# Popperian falsification tests
License
MIT