# AGENTS.md — llm-manager
## Project overview
**llm-manager** is a terminal UI (TUI) for managing local LLM models. It searches HuggingFace, downloads GGUF models, loads them via llama.cpp's `llama-server`, and lets you chat with them.
**Stack:** Rust 2024, ratatui 0.29, crossterm 0.28, tokio, reqwest, axum.
## Directory structure
```
src/
├── main.rs # Entry point, event loop, model discovery, metrics polling
├── config.rs # Config loading/saving, YAML-based, profiles, presets
├── models.rs # Domain types (SearchResult, DownloadState, ModelSettings, etc.)
├── serve.rs # Standalone serve mode CLI (--model, --profile, --api-port, --api-key)
├── serve_api.rs # Axum-based API proxy server for serve mode
├── backend/
│ ├── hub.rs # HuggingFace API: search, list files, download
│ ├── server.rs # llama.cpp server spawning (resolve_backend_binary, spawn_server)
│ ├── benchmark.rs # Benchmark tuning system (RuntimeOnly and Full modes)
│ ├── hardware.rs # GPU detection (AMD/NVIDIA/Intel), platform detection
│ ├── tls.rs # TLS certificate generation for secure connections
│ └── ws_server.rs # WebSocket metrics server
├── tui/
│ ├── mod.rs # Module declaration, format_size/format_number helpers
│ ├── app.rs # App::new() — all state initialization
│ ├── app/types.rs # App struct and all state sub-structs
│ ├── app/*.rs # app/ — async/sync operations, help, pickers, panels
│ ├── event/ # Keyboard/mouse event handling
│ ├── render.rs # Top-level render dispatcher
│ └── panel/ # Individual panel render functions
├── config/
│ ├── model_config.rs # Per-model YAML config store
│ ├── profiles.rs # Profile YAML store
│ └── presets.rs # System prompt preset YAML store
└── ws_server.rs
```
## Key architectural patterns
### App state machine (`src/tui/app/types.rs`)
`App` holds all state. `models_mode` is the mode enum that controls rendering:
```rust
pub enum ModelsMode {
List, // Local model list
Search { query, results, sort_by, show_readme, page, loading, has_more },
Files { model_id, files, selected_idx, previous_query, previous_results, selected_result },
}
```
`ActivePanel` enum controls which panel has focus:
```rust
pub enum ActivePanel {
Models, Log, ServerSettings, LlmSettings, Profiles,
SystemPromptPresets, SearchReadme, ActiveModel, ModelInfo, Downloads,
}
```
`GlobalMode` enum controls overlay modes:
```rust
pub enum GlobalMode {
Normal,
CmdLine { cmd_line: String },
HostPicker { entries: Vec<(String, String)>, selected: usize },
BackendPicker { entries: Vec<(Backend, Option<String>)>, selected: usize },
ProfilePicker { entries, selected },
PromptPicker { entries, selected, editing, edit_buffer, edit_cursor_pos, confirm_delete },
Confirmation { selected: bool, kind: ConfirmationKind },
RpcManager,
About,
MaxConcurrentPicker { value: String },
BenchTuneSetup { config, selected_idx, bench_mode_selection, editing_prompt, editing_kwargs },
}
```
`ConfirmationKind` variants: `Exit`, `Reset`, `Delete`, `Unload`, `DeleteBackend`.
`LoadingPhase` variants: `ServerStarting`, `LoadingModel`, `LoadingMeta`, `LoadingTensors`, `ServerListening`, `Complete`.
### Log panel expand/collapse
The `App` struct has a `log_expanded: bool` field. When true:
- Layout switches to 2-chunk: status bar + log fills remaining space
- Models panel, Settings panel, and active model info are hidden
- Log panel shows `[Enter] expand` / `[Esc] collapse` hint in status bar
### Event handling (`src/tui/event/mod.rs`)
Key handling is hierarchical:
1. Global shortcuts (Ctrl+C, Tab, Ctrl+H, etc.)
2. CmdLine overlay
3. Exit/Reset confirmation
4. Version picker mode (takes priority when `ModelsMode::VersionPicker`)
5. Search mode (takes priority when `ModelsMode::Search`)
6. Files mode
7. Download mode
8. Normal mode → dispatch to panel-specific handlers
**Important:** Each branch calls `return` to prevent fallthrough. Adding a new mode requires early returns.
### Token count display
Token count (`ctx_used`) is primarily driven by the `/metrics` endpoint. Log-parsing of `n_tokens` is used as a fallback if the API reports 0, ensuring real-time updates during inference. This hybrid approach prevents "stuck" values while maintaining accuracy.
### Search filtering (`src/backend/hub.rs`)
Search uses `&filter=gguf` on the HuggingFace API URL so the API itself only returns GGUF models. A post-filter then checks that the model_id contains the search query (case-insensitive). Default 70 results per page, max 200.
### Panel help
- **Ctrl+H** — Panel-specific help via `render_panel()` (contextual for current panel)
- Panel-specific help content is generated by `App::panel_help_lines()` (app.rs)
### Rendering (`src/tui/render.rs`)
Top-level layout: status bar → top panels → active model → log. The models panel renders differently based on `models_mode`.
### Download cancellation
Download runs in a spawned tokio task. Cancellation uses `Arc<AtomicBool>` shared between the task and the UI. Pressing `c` sets the flag; the download loop checks it each iteration.
### Backend picker
The backend picker allows selecting llama.cpp binary versions per-backend. Triggered from the "LLama.cpp Version" field in LLM Settings.
- `Enter` selects backend+version for the active backend
- `d` deletes a backend version from disk (with `ConfirmationKind::DeleteBackend`)
- `Esc` exits back to settings
**HostPicker** (`GlobalMode::HostPicker`): Shows network interfaces and their IPs. `Enter` selects host, `d` refreshes, `Ctrl+H` closes.
**MaxConcurrentPicker** (`GlobalMode::MaxConcurrentPicker`): Numeric entry modal for max concurrent predictions (1-10).
**BenchTuneSetup** (`GlobalMode::BenchTuneSetup`): Full benchmark configuration modal. `Alt+m` toggles benchmark mode (RuntimeOnly vs Full), `Alt+p` edits prompt, `Alt+n` edits n_predict, `Alt+i` edits iterations, `Alt+c` edits chat template kwargs. Space toggles parameter enablement, Enter starts benchmark.
**YarnRoPESettings** (`GlobalMode::YarnRoPESettings`): YaRN RoPE parameter modal. `selected_field` values: -1 (Yarn RoPE enabled toggle), 0 (rope_scale), 1 (rope_freq_base), 2 (rope_freq_scale). `Enter` edits the selected field or applies the value. `Space` toggles Yarn RoPE enabled. `Up`/`Down` cycles fields. `Esc` cancels editing or exits. Only digits, `.`, `-`, `e`, `E` are accepted as input.
### Dirty tracking (`is_settings_dirty` in `app.rs`)
Compares each field index-by-index. When a field is dirty, its label is rendered in yellow.
**Index consistency** — all indices must be identical across:
- `settings.rs` dirty check match arms (line ~133)
- `event.rs` `apply_numeric_setting` / `adjust_setting` match arms
- `event.rs` `handle_settings_key` toggle shortcuts (`e` / `Ctrl+E`)
- `event.rs` comment block (line ~836)
- `app.rs` `is_settings_dirty` match arms
### LLM Settings panel (55 fields in expert mode, 28 standard)
Standard fields (always visible):
```
Loading (0-2): Context length, System prompt preset, Keep in memory (mlock)
GPU (3-8): GPU Layers, Flash Attention, KV Cache Offload, Cache Type K, Cache Type V, Active Experts
Evaluation (9-11): Eval Batch, Unified KV, Max Concurrent Predictions
Sampling (12-17): Seed, Temperature, Top-k, Top-p, Min P, Max Tokens
Repetition (18-21): Repetition Penalty, Rep. Last N, Presence Penalty, Frequency Penalty
Backend (22): Tags (Enter to edit)
Backend (23): LLama.cpp Version (shows CPU / Vulkan / ROCm / CUDA versions)
Yarn RoPE (24-25): Yarn RoPE (toggle), Yarn Params (opens modal with rope_scale, rope_freq_base, rope_freq_scale)
MTP (26-27): Enable MTP, Draft Tokens
```
Expert fields (toggle with `Ctrl+X`):
```
Expert Loading (28-33): Threads Batch, UBatch Size, Keep, SWA Full, MMap, NUMA
Expert GPU (34-41): Split Mode, Tensor Split, Main GPU, Fit, LoRA, LoRA Scaled, RPC, Embedding
Expert Sampling (42-47): Typical P, Mirostat, Mirostat LR, Mirostat Ent, Ignore EOS, Samplers
Expert DRY (48-51): DRY Multiplier, DRY Base, DRY Allowed Length, DRY Penalty Last N
Expert Server (52-54): Cache Prompt, Cache Reuse, WebUI
```
### Expert mode
Toggle with `Ctrl+X` in the LLM Settings panel. Shows/hides fields 28-54. Resets selected index and clears render cache on toggle.
### GPU Layers cycling (`src/models.rs::GpuLayersMode`)
```rust
pub enum GpuLayersMode {
Auto, // llama.cpp auto-detects based on VRAM (default)
Specific(u32), // exact number of layers
All, // -ngl 999 (all layers)
}
```
Arrow keys cycle through modes: `Auto` → `1` → `2` → ... → `N` (total layers) → `All` → `Auto`.
**VRAM estimation** (`src/models.rs::estimate_vram_mib`):
- `Auto` uses a heuristic (~60% of total layers)
- `Specific(n)` uses exactly `n` layers
- `All` uses all layers
## Llama.cpp binary management
### Backend selection (`src/models.rs`)
Five main backends plus platform variants:
```rust
pub enum Backend {
Cpu, Vulkan, Rocm, RocmLemonade, Cuda,
CpuArm64, CpuWindows, VulkanWindows,
CudaWindows12_4, CudaWindows13_1, HipWindows,
CpuMacosArm64, CpuMacosX64,
}
```
### Binary storage
```
~/.local/share/llm-manager/bin/
├── llama-server-cpu-{version}/llama-server
├── llama-server-vulkan-{version}/llama-server
├── llama-server-rocm-{version}/llama-server
├── llama-server-rocm-lemonade-{version}/llama-server
└── llama-server-cuda-{version}/llama-server
```
### Per-backend version config
```yaml
llama_cpp_version_cpu: null
llama_cpp_version_vulkan: null
llama_cpp_version_rocm: null
llama_cpp_version_rocm_lemonade: null
llama_cpp_version_cuda: null
```
### Asset names
- **CPU:** `llama-{tag}-bin-ubuntu-x64.tar.gz`
- **Vulkan:** `llama-{tag}-bin-ubuntu-vulkan-x64.tar.gz`
- **ROCm:** `llama-{tag}-bin-ubuntu-rocm-7.2-x64.tar.gz`
- **ROCm Lemonade:** `llama-{tag}-ubuntu-rocm-{gfx_suffix}-x64.zip`
- **CUDA:** `llama.cpp-{tag}-cuda-12.8-amd64.tar.gz`
### Binary resolution (`src/backend/hub.rs`)
`resolve_backend_binary(backend, version)` checks if the binary + `libllama.so` exist. If not, it downloads and extracts the archive.
## Domain types (`src/models.rs`)
### Server mode
```rust
pub enum ServerMode { Normal, Router, Bench, BenchTune }
```
### Cache types
- **CacheType** (main KV cache): `F16`, `BF16`, `Fq8_0`, `Fq4_1`
- **CacheQuantType** (KV quantization): `F32`, `F16`, `BF16`, `Q8_0`, `Q4_0`, `Q4_1`, `Iq4Nl`, `Q5_0`, `Q5_1`
### Other enums
- **SplitMode:** `None`, `Layer` (default), `Row`, `Tensor`
- **NumMode:** `None` (default), `Distribute`, `Isolate`, `Numactl`
- **RopeScaling:** `None` (default), `Linear`, `Yarn`
- **Mirostat:** `Off` (default), `Mirostat`, `Mirostat2`
- **Samplers:** Semicolon-separated sampler order string
- **ModelState:** `Available`, `Loading`, `Benchmarking`, `Loaded { port, pid }`, `Failed { error }`
- **SearchSort:** `Relevance`, `Downloads`, `Likes`, `Trending`, `CreatedAt`
### Key structs
- **SearchResult** — fields: `model_id`, `model_name`, `tags`, `downloads`, `likes`, `pipeline_tag`, `size`, `parameters`, `capabilities`, `context_length`, `readme`, `quantization`, `license`, `trending_score`, `created_at`
- **GgufMetadata** — parsed GGUF metadata (layers, hidden_size, n_ctx_train, etc.)
- **LoadProgress** — `layers_total`, `layers_loaded`, `tensors_loaded`, `buffers`
- **ServerMetrics** — `loaded`, `tps`, `prompt_tps`, `cpu_usage`, `gpu_mem_used/total`, `ram_used`, `ctx_used/max`, `total_vram_used`, `decoded_tokens`, `throughput`, `latency_per_token_ms`
- **ModelSettings** — 70+ fields covering loading, GPU, sampling, repetition, RoPE, server config
- **BenchTuneConfig/Param/ParamValue/Result/Metrics/Status/Progress/Mode** — benchmark tuning types
## Benchmark Tuning (`src/backend/benchmark.rs`)
Two modes:
- **RuntimeOnly**: Single server, params sent in request body (no server restarts)
- **Full**: New server spawned for each parameter combination
Tunable parameters: temperature, top_p, top_k, repeat_penalty, flash_attn, threads, batch_size, expert_count.
Output formats: Markdown table, JSON, YAML, HTML report.
## Profiles, System Prompt Presets, RPC Workers, TLS (`src/config/`)
### Profiles (`config/profiles.rs`)
Named profiles of settings presets. Built-in: Qwen, Gemma, Llama, Mistral, Phi.
### System Prompt Presets (`config/presets.rs`)
Named system prompts. Built-in: General, Coder, Thinker, Mathematician.
### RPC Workers (`config.rs`)
Remote workers for distributed inference. Port default: 50052.
### TLS (`src/backend/tls.rs`)
Self-signed certificate generation (CA + server cert), persisted in `~/.config/llm-manager/tls/`.
### Config struct
```rust
pub struct Config {
pub models_dirs: Vec<PathBuf>,
pub llama_server: PathBuf,
pub default: DefaultParams,
pub model_overrides: ModelConfigStore,
pub profiles: ProfileStore,
pub system_prompt_presets: PresetStore,
pub rpc_workers: Vec<RpcWorker>,
pub ws_server: WsServer,
pub search_limit: u32,
}
```
## Serve mode and API proxy (`src/serve.rs`, `src/serve_api.rs`)
### Standalone serve CLI
```bash
./build.sh serve --model /path/to/model.gguf --api-port 49222 --api-key secret
```
### API Proxy Server
Axum-based HTTP proxy. Key endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, `/v1/messages`, `/v1/embeddings`, `/v1/models`, `/health`, `/metrics`, `/models/load`, `/models/unload`, `/api/status`, and many more. All unmatched paths are auto-proxied.
## Coding rules
### Planning
1. Identify root cause, not just symptoms
2. List affected files and functions
3. Use a `todowrite` tool to track work as a numbered TODO list
4. Mark each item as `in_progress` before starting, `completed` when done
5. Keep the TODO list visible
### Dependencies
- No new dependencies without asking
- Prefer `ratatui` widgets over custom rendering
### Error handling
- Use `anyhow::Result` for async/API functions
- Use `thiserror` for application-specific error types
- Log errors with `app.add_log()` in the TUI
### Naming conventions
- `snake_case` for functions, variables, modules
- `PascalCase` for types, enums, variants
- Module names are lowercase (`backend`, `panel`)
### Async
- `handle_key` is async (for search queries)
- Download is spawned as a tokio task; progress flows through `mpsc`
- Main loop uses `crossterm::event::poll()` with a 100ms timeout
### TUI specifics
- Use `ratatui` widgets when possible
- Style with `Style` / `Color` / `Modifier` — prefer semantic colors:
- Yellow: headers, active elements
- Cyan: navigation hints
- Green: success/completed
- Red: errors/failure
- Avoid hardcoding terminal dimensions
### Configuration
- Config is YAML-based, stored in `~/.config/llm-manager/`
- New config fields go in `config.rs`; add defaults in `Default` impls
### Testing
- Unit tests in `mod tests` blocks
- Integration testing is manual (run the app)
### Git commits
- **Never commit changes yourself.** Always ask the user before committing.
- If the user explicitly asks you to commit, then do it.
- If the user wants you to stage changes, do that — but still wait for explicit permission to commit.
## Common tasks
### Adding a new panel
1. Create `src/tui/panel/name.rs` with a `render(f, area, app)` function
2. Add `mod name;` to `src/tui/panel/mod.rs`
3. Add to `ActivePanel` enum in `app.rs`
4. Dispatch in `render.rs` and `event.rs`
### Adding a new keyboard shortcut
1. Add to `handle_key()` in `event.rs`
2. Update the status bar in `render_status_bar()` in `render.rs`
3. If it changes state, update `App` fields in `app.rs`
### Adding a new API endpoint
1. Add the function in `src/backend/hub.rs`
2. Call from `event.rs` (usually in the search/files branch)
3. Update `SearchResult` or other types in `models.rs` if needed
### Adding a new backend
1. Add variant to `Backend` enum in `models.rs` with serde/Display impl
2. Add `llama_cpp_version_{backend}` field to `DefaultParams` and `ModelSettings` in `config.rs` and `models.rs`
3. Update `from_settings()` / `apply()` in `config.rs`
4. Update `resolve_backend_binary()` in `hub.rs` for asset name
5. Update `spawn_server()` in `server.rs` for version lookup
6. Update `refresh_cached_versions()` in `app.rs` for directory detection
7. Update version picker in `models.rs` and event handling in `event.rs`