docs.rs failed to build apr-cli-0.2.18
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build:
apr-cli-0.2.16
apr-cli
CLI tool for APR model inspection, debugging, and operations.
Installation
This installs the apr binary.
Features
- Model Inspection: View APR model structure, metadata, and weights
- Debugging: Hex dumps, tree visualization, flow analysis
- Operations: List, compare, and validate APR models
- TUI Mode: Interactive terminal interface for model exploration
Usage
# Show help
# Inspect a model
# List models in directory
# Interactive TUI mode
# Compare two models
Chat Interface
Interactive chat with language models (supports APR, GGUF, SafeTensors):
# Chat with a GGUF model (GPU acceleration by default)
# Force CPU inference
# Explicitly request GPU acceleration
# Adjust generation parameters
Optional Features
Inference Server
Enable the inference feature to serve models via HTTP:
The server provides an OpenAI-compatible API:
# Health check
# Chat completions
# Streaming
Debugging with Tracing
Use the X-Trace-Level header to enable inference tracing for debugging:
# Brick-level tracing (token operations)
# Step-level tracing (forward pass steps)
# Layer-level tracing (per-layer timing)
Trace levels:
brick: Token-by-token operation timingstep: Forward pass steps (embed, attention, mlp, lm_head)layer: Per-layer timing breakdown (24+ layers)
CUDA GPU Acceleration
Enable CUDA support for NVIDIA GPUs:
GPU-Accelerated Server
Start the server with GPU acceleration for maximum throughput:
# Single-request GPU mode (~83 tok/s on RTX 4090)
# Batched GPU mode - 2.9x faster than Ollama (~850 tok/s)
Performance Comparison
| Mode | Throughput | vs Ollama | Memory |
|---|---|---|---|
| CPU (baseline) | ~15 tok/s | 0.05x | 1.1 GB |
| GPU (single) | ~83 tok/s | 0.25x | 1.5 GB |
| GPU (batched) | ~850 tok/s | 2.9x | 1.9 GB |
| Ollama | ~333 tok/s | 1.0x | - |
GPU Server Output
=== APR Serve ===
Model: qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
Binding: 127.0.0.1:8080
Detected format: GGUF
Loading GGUF model (mmap)...
GGUF loaded: 339 tensors, 26 metadata entries
Building quantized inference model...
Model ready: 28 layers, vocab_size=151936, hidden_dim=1536
Enabling optimized CUDA acceleration (PAR-111)...
Initializing GPU on device 0...
Pre-uploaded 934 MB weights to GPU
CUDA optimized model ready
Performance: 755+ tok/s (2.6x Ollama)
Example GPU Request
# Chat completion with GPU acceleration
Examples
# Run the tracing example
# Run the GPU chat inference example (requires CUDA)
Performance Testing
Test GPU inference performance:
# Start GPU server
# Run benchmark (separate terminal)
for; do
done
QA and Testing
The apr CLI includes comprehensive QA commands for model validation:
# Run falsifiable QA checklist
# With custom throughput threshold
# Compare against Ollama
# JSON output for CI integration
For automated QA testing, use the example runners:
# Full 21-cell QA matrix
# Popperian falsification tests
License
MIT