apr-cli
CLI tool for APR model inspection, debugging, and operations.
Installation
This installs the apr binary.
Features
- Model Inspection: View APR model structure, metadata, and weights
- Debugging: Hex dumps, tree visualization, flow analysis
- Operations: List, compare, and validate APR models
- TUI Mode: Interactive terminal interface for model exploration
Usage
# Show help
# Inspect a model
# List models in directory
# Interactive TUI mode
# Compare two models
Chat Interface
Interactive chat with language models (supports APR, GGUF, SafeTensors):
# Chat with a GGUF model (GPU acceleration by default)
# Force CPU inference
# Explicitly request GPU acceleration
# Adjust generation parameters
Optional Features
Inference Server
Enable the inference feature to serve models via HTTP:
The server provides an OpenAI-compatible API:
# Health check
# Chat completions
# Streaming
Debugging with Tracing
Use the X-Trace-Level header to enable inference tracing for debugging:
# Brick-level tracing (token operations)
# Step-level tracing (forward pass steps)
# Layer-level tracing (per-layer timing)
Trace levels:
brick: Token-by-token operation timingstep: Forward pass steps (embed, attention, mlp, lm_head)layer: Per-layer timing breakdown (24+ layers)
CUDA GPU Acceleration
Enable CUDA support for NVIDIA GPUs:
GPU-Accelerated Server
Start the server with GPU acceleration for maximum throughput:
# Single-request GPU mode (~83 tok/s on RTX 4090)
# Batched GPU mode - 2.9x faster than Ollama (~850 tok/s)
Performance Comparison
| Mode | Throughput | vs Ollama | Memory |
|---|---|---|---|
| CPU (baseline) | ~15 tok/s | 0.05x | 1.1 GB |
| GPU (single) | ~83 tok/s | 0.25x | 1.5 GB |
| GPU (batched) | ~850 tok/s | 2.9x | 1.9 GB |
| Ollama | ~333 tok/s | 1.0x | - |
GPU Server Output
=== APR Serve ===
Model: qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
Binding: 127.0.0.1:8080
Detected format: GGUF
Loading GGUF model (mmap)...
GGUF loaded: 339 tensors, 26 metadata entries
Building quantized inference model...
Model ready: 28 layers, vocab_size=151936, hidden_dim=1536
Enabling optimized CUDA acceleration (PAR-111)...
Initializing GPU on device 0...
Pre-uploaded 934 MB weights to GPU
CUDA optimized model ready
Performance: 755+ tok/s (2.6x Ollama)
Example GPU Request
# Chat completion with GPU acceleration
Examples
# Run the tracing example
# Run the GPU chat inference example (requires CUDA)
Performance Testing
Test GPU inference performance:
# Start GPU server
# Run benchmark (separate terminal)
for; do
done
QA and Testing
The apr CLI includes comprehensive QA commands for model validation:
# Run falsifiable QA checklist
# With custom throughput threshold
# Compare against Ollama
# JSON output for CI integration
For automated QA testing, use the example runners:
# Full 21-cell QA matrix
# Popperian falsification tests
License
MIT