llm-manager 1.1.0

# AGENTS.md — llm-manager

## Project overview

**llm-manager** is a terminal UI (TUI) for managing local LLM models. It searches HuggingFace, downloads GGUF models, loads them via llama.cpp's `llama-server`, and lets you chat with them.

**Stack:** Rust 2024, ratatui 0.29, crossterm 0.28, tokio, reqwest, axum.

## Directory structure

```
src/
├── main.rs          # Entry point, event loop, model discovery, metrics polling
├── config.rs        # Config loading/saving, YAML-based, profiles, presets
├── models.rs        # Domain types (SearchResult, DownloadState, ModelSettings, etc.)
├── serve.rs         # Standalone serve mode CLI (--model, --profile, --api-port, --api-key)
├── serve_api.rs     # Axum-based API proxy server for serve mode
├── backend/
│   ├── hub.rs       # HuggingFace API: search, list files, download
│   ├── server.rs    # llama.cpp server spawning (resolve_backend_binary, spawn_server)
│   ├── benchmark.rs # Benchmark tuning system (RuntimeOnly and Full modes)
│   ├── hardware.rs  # GPU detection (AMD/NVIDIA/Intel), platform detection
│   ├── tls.rs       # TLS certificate generation for secure connections
│   └── ws_server.rs # WebSocket metrics server
├── tui/
│   ├── mod.rs       # Module declaration, format_size/format_number helpers
│   ├── app.rs       # App::new() — all state initialization
│   ├── app/types.rs # App struct and all state sub-structs
│   ├── app/*.rs     # app/ — async/sync operations, help, pickers, panels
│   ├── event/       # Keyboard/mouse event handling
│   ├── render.rs    # Top-level render dispatcher
│   └── panel/       # Individual panel render functions
├── config/
│   ├── model_config.rs  # Per-model YAML config store
│   ├── profiles.rs      # Profile YAML store
│   └── presets.rs       # System prompt preset YAML store
└── ws_server.rs
```

## Key architectural patterns

### App state machine (`src/tui/app/types.rs`)

`App` holds all state. `models_mode` is the mode enum that controls rendering:

```rust
pub enum ModelsMode {
    List,       // Local model list
    Search { query, results, sort_by, show_readme, page, loading, has_more },
    Files { model_id, files, selected_idx, previous_query, previous_results, selected_result },
}
```

`ActivePanel` enum controls which panel has focus:

```rust
pub enum ActivePanel {
    Models, Log, ServerSettings, LlmSettings, Profiles,
    SystemPromptPresets, SearchReadme, ActiveModel, ModelInfo, Downloads,
}
```

`GlobalMode` enum controls overlay modes:

```rust
pub enum GlobalMode {
    Normal,
    CmdLine { cmd_line: String },
    HostPicker { entries: Vec<(String, String)>, selected: usize },
    BackendPicker { entries: Vec<(Backend, Option<String>)>, selected: usize },
    ProfilePicker { entries, selected },
    PromptPicker { entries, selected, editing, edit_buffer, edit_cursor_pos, confirm_delete },
    Confirmation { selected: bool, kind: ConfirmationKind },
    RpcManager,
    About,
    MaxConcurrentPicker { value: String },
    BenchTuneSetup { config, selected_idx, bench_mode_selection, editing_prompt, editing_kwargs },
}
```

`ConfirmationKind` variants: `Exit`, `Reset`, `Delete`, `Unload`, `DeleteBackend`.

`LoadingPhase` variants: `ServerStarting`, `LoadingModel`, `LoadingMeta`, `LoadingTensors`, `ServerListening`, `Complete`.

### Log panel expand/collapse

The `App` struct has a `log_expanded: bool` field. When true:
- Layout switches to 2-chunk: status bar + log fills remaining space
- Models panel, Settings panel, and active model info are hidden
- Log panel shows `[Enter] expand` / `[Esc] collapse` hint in status bar

### Event handling (`src/tui/event/mod.rs`)

Key handling is hierarchical:
1. Global shortcuts (Ctrl+C, Tab, Ctrl+H, etc.)
2. CmdLine overlay
3. Exit/Reset confirmation
4. Version picker mode (takes priority when `ModelsMode::VersionPicker`)
5. Search mode (takes priority when `ModelsMode::Search`)
6. Files mode
7. Download mode
8. Normal mode → dispatch to panel-specific handlers

**Important:** Each branch calls `return` to prevent fallthrough. Adding a new mode requires early returns.

### Token count display

Token count (`ctx_used`) is primarily driven by the `/metrics` endpoint. Log-parsing of `n_tokens` is used as a fallback if the API reports 0, ensuring real-time updates during inference. This hybrid approach prevents "stuck" values while maintaining accuracy.

### Search filtering (`src/backend/hub.rs`)

Search uses `&filter=gguf` on the HuggingFace API URL so the API itself only returns GGUF models. A post-filter then checks that the model_id contains the search query (case-insensitive). Default 70 results per page, max 200.

### Panel help

- **Ctrl+H** — Panel-specific help via `render_panel()` (contextual for current panel)
- Panel-specific help content is generated by `App::panel_help_lines()` (app.rs)

### Rendering (`src/tui/render.rs`)

Top-level layout: status bar → top panels → active model → log. The models panel renders differently based on `models_mode`.

### Download cancellation

Download runs in a spawned tokio task. Cancellation uses `Arc<AtomicBool>` shared between the task and the UI. Pressing `c` sets the flag; the download loop checks it each iteration.

### Backend picker

The backend picker allows selecting llama.cpp binary versions per-backend. Triggered from the "LLama.cpp Version" field in LLM Settings.

- `Enter` selects backend+version for the active backend
- `d` deletes a backend version from disk (with `ConfirmationKind::DeleteBackend`)
- `Esc` exits back to settings

**HostPicker** (`GlobalMode::HostPicker`): Shows network interfaces and their IPs. `Enter` selects host, `d` refreshes, `Ctrl+H` closes.

**MaxConcurrentPicker** (`GlobalMode::MaxConcurrentPicker`): Numeric entry modal for max concurrent predictions (1-10).

**BenchTuneSetup** (`GlobalMode::BenchTuneSetup`): Full benchmark configuration modal. `Alt+m` toggles benchmark mode (RuntimeOnly vs Full), `Alt+p` edits prompt, `Alt+n` edits n_predict, `Alt+i` edits iterations, `Alt+c` edits chat template kwargs. Space toggles parameter enablement, Enter starts benchmark.

**YarnRoPESettings** (`GlobalMode::YarnRoPESettings`): YaRN RoPE parameter modal. `selected_field` values: -1 (Yarn RoPE enabled toggle), 0 (rope_scale), 1 (rope_freq_base), 2 (rope_freq_scale). `Enter` edits the selected field or applies the value. `Space` toggles Yarn RoPE enabled. `Up`/`Down` cycles fields. `Esc` cancels editing or exits. Only digits, `.`, `-`, `e`, `E` are accepted as input.

### Dirty tracking (`is_settings_dirty` in `app.rs`)

Compares each field index-by-index. When a field is dirty, its label is rendered in yellow.

**Index consistency** — all indices must be identical across:
- `settings.rs` dirty check match arms (line ~133)
- `event.rs` `apply_numeric_setting` / `adjust_setting` match arms
- `event.rs` `handle_settings_key` toggle shortcuts (`e` / `Ctrl+E`)
- `event.rs` comment block (line ~836)
- `app.rs` `is_settings_dirty` match arms

### LLM Settings panel (55 fields in expert mode, 28 standard)

Standard fields (always visible):
```
Loading (0-2):   Context length, System prompt preset, Keep in memory (mlock)
GPU (3-8):       GPU Layers, Flash Attention, KV Cache Offload, Cache Type K, Cache Type V, Active Experts
Evaluation (9-11): Eval Batch, Unified KV, Max Concurrent Predictions
Sampling (12-17): Seed, Temperature, Top-k, Top-p, Min P, Max Tokens
Repetition (18-21): Repetition Penalty, Rep. Last N, Presence Penalty, Frequency Penalty
Backend (22):    Tags (Enter to edit)
Backend (23):    LLama.cpp Version (shows CPU / Vulkan / ROCm / CUDA versions)
Yarn RoPE (24-25): Yarn RoPE (toggle), Yarn Params (opens modal with rope_scale, rope_freq_base, rope_freq_scale)
MTP (26-27):     Enable MTP, Draft Tokens
```

Expert fields (toggle with `Ctrl+X`):
```
Expert Loading (28-33): Threads Batch, UBatch Size, Keep, SWA Full, MMap, NUMA
Expert GPU (34-41): Split Mode, Tensor Split, Main GPU, Fit, LoRA, LoRA Scaled, RPC, Embedding
Expert Sampling (42-47): Typical P, Mirostat, Mirostat LR, Mirostat Ent, Ignore EOS, Samplers
Expert DRY (48-51): DRY Multiplier, DRY Base, DRY Allowed Length, DRY Penalty Last N
Expert Server (52-54): Cache Prompt, Cache Reuse, WebUI
```

### Expert mode

Toggle with `Ctrl+X` in the LLM Settings panel. Shows/hides fields 28-54. Resets selected index and clears render cache on toggle.

### GPU Layers cycling (`src/models.rs::GpuLayersMode`)

```rust
pub enum GpuLayersMode {
    Auto,       // llama.cpp auto-detects based on VRAM (default)
    Specific(u32), // exact number of layers
    All,        // -ngl 999 (all layers)
}
```

Arrow keys cycle through modes: `Auto` → `1` → `2` → ... → `N` (total layers) → `All` → `Auto`.

**VRAM estimation** (`src/models.rs::estimate_vram_mib`):
- `Auto` uses a heuristic (~60% of total layers)
- `Specific(n)` uses exactly `n` layers
- `All` uses all layers

## Llama.cpp binary management

### Backend selection (`src/models.rs`)

Five main backends plus platform variants:

```rust
pub enum Backend {
    Cpu, Vulkan, Rocm, RocmLemonade, Cuda,
    CpuArm64, CpuWindows, VulkanWindows,
    CudaWindows12_4, CudaWindows13_1, HipWindows,
    CpuMacosArm64, CpuMacosX64,
}
```

### Binary storage

```
~/.local/share/llm-manager/bin/
├── llama-server-cpu-{version}/llama-server
├── llama-server-vulkan-{version}/llama-server
├── llama-server-rocm-{version}/llama-server
├── llama-server-rocm-lemonade-{version}/llama-server
└── llama-server-cuda-{version}/llama-server
```

### Per-backend version config

```yaml
llama_cpp_version_cpu: null
llama_cpp_version_vulkan: null
llama_cpp_version_rocm: null
llama_cpp_version_rocm_lemonade: null
llama_cpp_version_cuda: null
```

### Asset names

- **CPU:** `llama-{tag}-bin-ubuntu-x64.tar.gz`
- **Vulkan:** `llama-{tag}-bin-ubuntu-vulkan-x64.tar.gz`
- **ROCm:** `llama-{tag}-bin-ubuntu-rocm-7.2-x64.tar.gz`
- **ROCm Lemonade:** `llama-{tag}-ubuntu-rocm-{gfx_suffix}-x64.zip`
- **CUDA:** `llama.cpp-{tag}-cuda-12.8-amd64.tar.gz`

### Binary resolution (`src/backend/hub.rs`)

`resolve_backend_binary(backend, version)` checks if the binary + `libllama.so` exist. If not, it downloads and extracts the archive.

## Domain types (`src/models.rs`)

### Server mode

```rust
pub enum ServerMode { Normal, Router, Bench, BenchTune }
```

### Cache types

- **CacheType** (main KV cache): `F16`, `BF16`, `Fq8_0`, `Fq4_1`
- **CacheQuantType** (KV quantization): `F32`, `F16`, `BF16`, `Q8_0`, `Q4_0`, `Q4_1`, `Iq4Nl`, `Q5_0`, `Q5_1`

### Other enums

- **SplitMode:** `None`, `Layer` (default), `Row`, `Tensor`
- **NumMode:** `None` (default), `Distribute`, `Isolate`, `Numactl`
- **RopeScaling:** `None` (default), `Linear`, `Yarn`
- **Mirostat:** `Off` (default), `Mirostat`, `Mirostat2`
- **Samplers:** Semicolon-separated sampler order string
- **ModelState:** `Available`, `Loading`, `Benchmarking`, `Loaded { port, pid }`, `Failed { error }`
- **SearchSort:** `Relevance`, `Downloads`, `Likes`, `Trending`, `CreatedAt`

### Key structs

- **SearchResult** — fields: `model_id`, `model_name`, `tags`, `downloads`, `likes`, `pipeline_tag`, `size`, `parameters`, `capabilities`, `context_length`, `readme`, `quantization`, `license`, `trending_score`, `created_at`
- **GgufMetadata** — parsed GGUF metadata (layers, hidden_size, n_ctx_train, etc.)
- **LoadProgress** — `layers_total`, `layers_loaded`, `tensors_loaded`, `buffers`
- **ServerMetrics** — `loaded`, `tps`, `prompt_tps`, `cpu_usage`, `gpu_mem_used/total`, `ram_used`, `ctx_used/max`, `total_vram_used`, `decoded_tokens`, `throughput`, `latency_per_token_ms`
- **ModelSettings** — 70+ fields covering loading, GPU, sampling, repetition, RoPE, server config
- **BenchTuneConfig/Param/ParamValue/Result/Metrics/Status/Progress/Mode** — benchmark tuning types

## Benchmark Tuning (`src/backend/benchmark.rs`)

Two modes:
- **RuntimeOnly**: Single server, params sent in request body (no server restarts)
- **Full**: New server spawned for each parameter combination

Tunable parameters: temperature, top_p, top_k, repeat_penalty, flash_attn, threads, batch_size, expert_count.

Output formats: Markdown table, JSON, YAML, HTML report.

## Profiles, System Prompt Presets, RPC Workers, TLS (`src/config/`)

### Profiles (`config/profiles.rs`)

Named profiles of settings presets. Built-in: Qwen, Gemma, Llama, Mistral, Phi.

### System Prompt Presets (`config/presets.rs`)

Named system prompts. Built-in: General, Coder, Thinker, Mathematician.

### RPC Workers (`config.rs`)

Remote workers for distributed inference. Port default: 50052.

### TLS (`src/backend/tls.rs`)

Self-signed certificate generation (CA + server cert), persisted in `~/.config/llm-manager/tls/`.

### Config struct

```rust
pub struct Config {
    pub models_dirs: Vec<PathBuf>,
    pub llama_server: PathBuf,
    pub default: DefaultParams,
    pub model_overrides: ModelConfigStore,
    pub profiles: ProfileStore,
    pub system_prompt_presets: PresetStore,
    pub rpc_workers: Vec<RpcWorker>,
    pub ws_server: WsServer,
    pub search_limit: u32,
}
```

## Serve mode and API proxy (`src/serve.rs`, `src/serve_api.rs`)

### Standalone serve CLI

```bash
./build.sh serve --model /path/to/model.gguf --api-port 49222 --api-key secret
```

### API Proxy Server

Axum-based HTTP proxy. Key endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, `/v1/messages`, `/v1/embeddings`, `/v1/models`, `/health`, `/metrics`, `/models/load`, `/models/unload`, `/api/status`, and many more. All unmatched paths are auto-proxied.

## Coding rules

### Planning

1. Identify root cause, not just symptoms
2. List affected files and functions
3. Use a `todowrite` tool to track work as a numbered TODO list
4. Mark each item as `in_progress` before starting, `completed` when done
5. Keep the TODO list visible

### Dependencies

- No new dependencies without asking
- Prefer `ratatui` widgets over custom rendering

### Error handling

- Use `anyhow::Result` for async/API functions
- Use `thiserror` for application-specific error types
- Log errors with `app.add_log()` in the TUI

### Naming conventions

- `snake_case` for functions, variables, modules
- `PascalCase` for types, enums, variants
- Module names are lowercase (`backend`, `panel`)

### Async

- `handle_key` is async (for search queries)
- Download is spawned as a tokio task; progress flows through `mpsc`
- Main loop uses `crossterm::event::poll()` with a 100ms timeout

### TUI specifics

- Use `ratatui` widgets when possible
- Style with `Style` / `Color` / `Modifier` — prefer semantic colors:
  - Yellow: headers, active elements
  - Cyan: navigation hints
  - Green: success/completed
  - Red: errors/failure
- Avoid hardcoding terminal dimensions

### Configuration

- Config is YAML-based, stored in `~/.config/llm-manager/`
- New config fields go in `config.rs`; add defaults in `Default` impls

### Testing

- Unit tests in `mod tests` blocks
- Integration testing is manual (run the app)

### Git commits

- **Never commit changes yourself.** Always ask the user before committing.
- If the user explicitly asks you to commit, then do it.
- If the user wants you to stage changes, do that — but still wait for explicit permission to commit.

## Common tasks

### Adding a new panel

1. Create `src/tui/panel/name.rs` with a `render(f, area, app)` function
2. Add `mod name;` to `src/tui/panel/mod.rs`
3. Add to `ActivePanel` enum in `app.rs`
4. Dispatch in `render.rs` and `event.rs`

### Adding a new keyboard shortcut

1. Add to `handle_key()` in `event.rs`
2. Update the status bar in `render_status_bar()` in `render.rs`
3. If it changes state, update `App` fields in `app.rs`

### Adding a new API endpoint

1. Add the function in `src/backend/hub.rs`
2. Call from `event.rs` (usually in the search/files branch)
3. Update `SearchResult` or other types in `models.rs` if needed

### Adding a new backend

1. Add variant to `Backend` enum in `models.rs` with serde/Display impl
2. Add `llama_cpp_version_{backend}` field to `DefaultParams` and `ModelSettings` in `config.rs` and `models.rs`
3. Update `from_settings()` / `apply()` in `config.rs`
4. Update `resolve_backend_binary()` in `hub.rs` for asset name
5. Update `spawn_server()` in `server.rs` for version lookup
6. Update `refresh_cached_versions()` in `app.rs` for directory detection
7. Update version picker in `models.rs` and event handling in `event.rs`