Features
Library
- LlamaCppClient: Async OpenAI-compatible API client (chat/completion/embeddings)
- HubClient: Pure Rust HuggingFace Hub downloads with progress callbacks
- Server orchestration: Programmatic llama-server lifecycle management
- Benchmark suite: 5-test triage (throughput, tool calls, codegen, reasoning)
CLI
|
Installation
[]
= "0.1.0"
= { = "1.0", = ["full"] }
Quick Start
use ;
async
LlamaCppClient
OpenAI-compatible client for llama.cpp server (/v1/chat/completions, /v1/completions, /v1/embeddings).
use ;
// Create client
let client = new?;
let client = with_api_key?;
let client = default?; // localhost:8080
// Chat completion (non-streaming)
let request = new
.message
.temperature
.max_tokens;
let response = client.chat_completion.await?;
// Streaming chat completion
let request = new
.message
.stream
.max_tokens;
let mut stream = client.chat_completion_stream.await?;
while let Some = stream.next.await
// Text completion
let request = new
.max_tokens
.temperature;
let response = client.completion.await?;
// Embeddings
let request = new;
let response = client.embedding.await?;
let embedding = &response.data.embedding;
Request builders support: temperature, max_tokens, top_p, stream, stop, chat_template_kwargs (for chat) and prompt (for completion).
HuggingFace Hub
Download and manage GGUF models directly from HuggingFace Hub.
use ;
// Create client (auto-detects HF_TOKEN or ~/.cache/huggingface/token)
let hub = new?;
// Search models
let results = hub.search.await?;
for r in results
// List GGUF files in a repo
let files = hub.list_gguf.await?;
for f in files
// Download with progress
let progress: ProgressFn = Boxnew;
let path = hub.download.await?;
println!;
// List cached models
let cached = hub.list_cached?;
for m in cached
// Delete cached model
hub.delete.await?;
Cache directory: ~/.cache/lancor/models/ (configurable via HubClient::with_cache_dir(path)).
Server Orchestration
Programmatic control over llama-server, llama-cli, llama-quantize, and llama-bench.
LlamaServer
use ;
// Configure server
let config = new
.host
.port
.gpu_layers // Offload layers to GPU
.ctx_size // Context length
.parallel // Parallel sequences
.threads // CPU threads
.batch_size // Batch size for prompt processing
.flash_attn // Enable flash attention
.mlock // Lock model in RAM
.api_key // Require API key
.arg; // Extra args
// Start server
let mut server = start?;
server.wait_healthy.await?;
println!;
// Use with client
let client = new?;
// ... make requests
// Stop server
server.stop?;
ServerConfig defaults: host=127.0.0.1, port=8080, n_gpu_layers=99, ctx_size=8192, n_parallel=1, cont_batching=true, metrics=true.
LlamaCli
Run inference with llama-cli (captures stdout):
use CliConfig;
let config = new
.prompt
.predict
.temperature
.interactive; // Enable interactive mode
let output = run_cli?;
println!;
Quantization
use ;
quantize?;
Supported QuantType: Q4_0, Q4_1, Q4_K_S, Q4_K_M, Q5_0, Q5_1, Q5_K_S, Q5_K_M, Q6_K, Q8_0, IQ2_XXS, IQ2_XS, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS, F16, F32.
Raw llama-bench wrapper
use bench;
let output = bench?;
println!;
Benchmark Suite
5-test triage for comparing model quantizations and sizes.
use ;
use ServerConfig;
// Single model (auto-starts/stops server)
let result = run_suite_managed.await?;
// Against existing server
let cfg = new
.base_url;
let result = run_suite.await?;
// Compare multiple models
let models = vec!;
let results = compare.await?;
print_table;
Benchmark tests:
- Throughput: tokens/s for prompt processing and generation
- Tool call: single function call accuracy
- Multi-tool: parallel tool invocation (min 5 tools)
- Codegen: fizzbuzz implementation (score 0-4)
- Reasoning: logic puzzle correctness
Output example:
┌──────────────────┬───────┬──────────┬──────────┬──────┬───────┬──────┬───────────┐
│ Model │ Size │ PP tok/s │ TG tok/s │ Tool │ Multi │ Code │ Reasoning │
├──────────────────┼───────┼──────────┼──────────┼──────┼───────┼──────┼───────────┤
│ Qwen3.5-35B-Q4_K │ 20.1G│ 45.2 │ 128.7 │ ✓ │ 5/5 │ 4/4 │ ✓ │
└──────────────────┴───────┴──────────┴──────────┴──────┴───────┴──────┴───────────┘
JSON export: lancor::bench::to_json(results).
CLI Reference
lancor pull <repo> [file]
Download a GGUF model from HuggingFace Hub.
# List available GGUF files in a repo
# Download specific file
lancor list
List all cached models.
# Output:
# unsloth/Qwen3.5-35B-A3B-GGUF: model-Q4_K_M.gguf (20.12 GB)
# /home/user/.cache/lancor/models/unsloth--Qwen3.5-35B-A3B-GGUF/model-Q4_K_M.gguf
lancor search <query>
Search HuggingFace Hub for models.
# Output:
# unsloth/Qwen3.5-35B-A3B-GGUF downloads=12345
# ...
lancor rm <repo> <file>
Delete a cached model file.
lancor bench <model|--all> [options]
Run the benchmark suite.
# Benchmark a single model (auto-manages server)
# Benchmark all cached models
# Benchmark against existing server
# JSON output
Benchmark options:
--label NAME— Model label for results table--port PORT— Server port (default: 8080, for auto-managed)--ngl LAYERS— GPU layers (default: 99)--ctx SIZE— Context size (default: 8192)--url URL— Use existing server instead of starting one--all— Benchmark all cached GGUF models--json— Output JSON instead of table
Requirements
- Rust 1.91+
- llama.cpp binaries on PATH:
llama-server,llama-cli,llama-quantize,llama-bench - For HubClient: network access to huggingface.co
Running llama-server manually
Then use LlamaCppClient to interact with it.
Ecosystem
| Project | What |
|---|---|
| ares | Agentic AI server — uses lancor for local llama.cpp inference |
| pawan | Self-healing CLI coding agent |
| daedra | Web search MCP server |
| thulp | Execution context engineering |
Built by DIRMACS.
License
GPL-3.0