apr-cli

CLI tool for APR model inspection, debugging, and operations.

Installation

cargo install apr-cli

This installs the apr binary.

Features

Model Inspection: View APR model structure, metadata, and weights
Debugging: Hex dumps, tree visualization, flow analysis
Operations: List, compare, and validate APR models
Kernel Explainability: Explain kernel pipelines, equivalence classes (A-F), and proof status
TUI Mode: Interactive terminal interface for model exploration

Usage

# Show help
apr --help

# Inspect a model
apr inspect model.apr

# List models in directory
apr list ./models/

# Interactive TUI mode
apr tui model.apr

# Compare two models
apr diff model1.apr model2.apr

Chat Interface

Interactive chat with language models (supports APR, GGUF, SafeTensors):

# Chat with a GGUF model (GPU acceleration by default)
apr chat model.gguf

# Force CPU inference
apr chat model.gguf --no-gpu

# Explicitly request GPU acceleration
apr chat model.gguf --gpu

# Adjust generation parameters
apr chat model.gguf --temperature 0.7 --top-p 0.9 --max-tokens 512

Quantization

Streaming SafeTensors to Q4K APR (ALB-093)

Quantize sharded HuggingFace SafeTensors models directly to Q4K APR format with bounded memory. No intermediate files required.

# Quantize sharded SafeTensors model to Q4K APR
apr quantize /path/to/safetensors/model/ --scheme q4k -o output.apr

# Plan mode — estimate sizes without executing
apr quantize /path/to/model/ --scheme q4k --plan

# Batch quantize to multiple schemes
apr quantize model.apr --batch int4,int8,q4k -o models/

# JSON output for CI integration
apr quantize /path/to/model/ --scheme q4k -o output.apr --json

The streaming pipeline reads shards one at a time via mmap, quantizes each tensor individually, and streams the output. Peak memory is bounded by the largest single tensor (~2-4 GB) regardless of total model size.

Supported schemes: int8 (i8, q8_0), int4 (i4, q4_0), fp16 (f16, half), q4k (q4_k, q4_k_m)

Tensor routing: Weight matrices (2D, >= 256 elements) are quantized to Q4K. Norm weights, embeddings, biases, and small tensors are kept at F32 for precision.

Optional Features

Pre-flight Capacity Planning

Check if a model fits your GPU before downloading:

# From HuggingFace (fetches only ~2KB config.json, no weights)
apr serve plan hf://Qwen/Qwen2.5-Coder-1.5B-Instruct --gpu

# With quantization override
apr serve plan microsoft/phi-2 --gpu --quant Q4_K_M

# From local file
apr serve plan model.gguf --gpu

# JSON output
apr serve plan hf://mistralai/Mistral-7B-Instruct-v0.3 --gpu --format json

Reports VRAM budget, throughput estimates, roofline analysis, and contract checks (READY / WARNINGS / BLOCKED verdict).

Inference Server

Enable the inference feature to serve models via HTTP:

cargo install apr-cli --features inference

apr serve run model.gguf --port 8080

The server provides an OpenAI-compatible API:

# Health check
curl http://localhost:8080/health

# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}],"stream":true,"max_tokens":50}'

Debugging with Tracing

Use the X-Trace-Level header to enable inference tracing for debugging:

# Brick-level tracing (token operations)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Trace-Level: brick" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hi"}],"max_tokens":5}'

# Step-level tracing (forward pass steps)
curl -H "X-Trace-Level: step" ...

# Layer-level tracing (per-layer timing)
curl -H "X-Trace-Level: layer" ...

Trace levels:

brick: Token-by-token operation timing
step: Forward pass steps (embed, attention, mlp, lm_head)
layer: Per-layer timing breakdown (24+ layers)

CUDA GPU Acceleration

Enable CUDA support for NVIDIA GPUs:

cargo install apr-cli --features inference,cuda

GPU-Accelerated Server

Start the server with GPU acceleration for maximum throughput:

# Single-request GPU mode (~83 tok/s on RTX 4090)
apr serve model.gguf --port 8080 --gpu

# Batched GPU mode - 2.9x faster than Ollama (~850 tok/s)
apr serve model.gguf --port 8080 --gpu --batch

Performance Comparison

Mode	Throughput	vs Ollama	Memory
CPU (baseline)	~15 tok/s	0.05x	1.1 GB
GPU (single)	~83 tok/s	0.25x	1.5 GB
GPU (batched)	~850 tok/s	2.9x	1.9 GB
Ollama	~333 tok/s	1.0x	-

GPU Server Output

=== APR Serve ===

Model: qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
Binding: 127.0.0.1:8080

Detected format: GGUF
Loading GGUF model (mmap)...
GGUF loaded: 339 tensors, 26 metadata entries
Building quantized inference model...
Model ready: 28 layers, vocab_size=151936, hidden_dim=1536
Enabling optimized CUDA acceleration (PAR-111)...
  Initializing GPU on device 0...
  Pre-uploaded 934 MB weights to GPU
CUDA optimized model ready

Performance: 755+ tok/s (2.6x Ollama)

Example GPU Request

# Chat completion with GPU acceleration
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "Write a Rust function to add two numbers"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Examples

# Run the tracing example
cargo run --example serve_with_tracing --features inference

# Run the GPU chat inference example (requires CUDA)
cargo run --example gpu_chat_inference --features inference,cuda

Performance Testing

Test GPU inference performance:

# Start GPU server
apr serve /path/to/model.gguf --port 8096 --gpu --batch

# Run benchmark (separate terminal)
for i in {1..10}; do
  time curl -s -X POST http://localhost:8096/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"default","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}' > /dev/null
done

QA and Testing

The apr CLI includes comprehensive QA commands for model validation:

# Run falsifiable QA checklist
apr qa model.gguf

# With custom throughput threshold
apr qa model.gguf --assert-tps 100

# Compare against Ollama
apr qa model.gguf --assert-speedup 2.0

# JSON output for CI integration
apr qa model.gguf --skip-ollama --json

For automated QA testing, use the example runners:

# Full 21-cell QA matrix
cargo run --example qa_run -- --full-matrix

# Popperian falsification tests
cargo run --example qa_falsify

License

MIT

apr-cli 0.4.11