llm-manager 1.1.0

Terminal UI for managing LLMs
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
# AGENTS.md — llm-manager

## Project overview

**llm-manager** is a terminal UI (TUI) for managing local LLM models. It searches HuggingFace, downloads GGUF models, loads them via llama.cpp's `llama-server`, and lets you chat with them.

**Stack:** Rust 2024, ratatui 0.29, crossterm 0.28, tokio, reqwest, axum.

## Directory structure

```
src/
├── main.rs          # Entry point, event loop, model discovery, metrics polling
├── config.rs        # Config loading/saving, YAML-based, profiles, presets
├── models.rs        # Domain types (SearchResult, DownloadState, ModelSettings, etc.)
├── serve.rs         # Standalone serve mode CLI (--model, --profile, --api-port, --api-key)
├── serve_api.rs     # Axum-based API proxy server for serve mode
├── backend/
│   ├── hub.rs       # HuggingFace API: search, list files, download
│   ├── server.rs    # llama.cpp server spawning (resolve_backend_binary, spawn_server)
│   ├── benchmark.rs # Benchmark tuning system (RuntimeOnly and Full modes)
│   ├── hardware.rs  # GPU detection (AMD/NVIDIA/Intel), platform detection
│   ├── tls.rs       # TLS certificate generation for secure connections
│   └── ws_server.rs # WebSocket metrics server
├── tui/
│   ├── mod.rs       # Module declaration, format_size/format_number helpers
│   ├── app.rs       # App::new() — all state initialization
│   ├── app/types.rs # App struct and all state sub-structs
│   ├── app/*.rs     # app/ — async/sync operations, help, pickers, panels
│   ├── event/       # Keyboard/mouse event handling
│   ├── render.rs    # Top-level render dispatcher
│   └── panel/       # Individual panel render functions
├── config/
│   ├── model_config.rs  # Per-model YAML config store
│   ├── profiles.rs      # Profile YAML store
│   └── presets.rs       # System prompt preset YAML store
└── ws_server.rs
```

## Key architectural patterns

### App state machine (`src/tui/app/types.rs`)

`App` holds all state. `models_mode` is the mode enum that controls rendering:

```rust
pub enum ModelsMode {
    List,       // Local model list
    Search { query, results, sort_by, show_readme, page, loading, has_more },
    Files { model_id, files, selected_idx, previous_query, previous_results, selected_result },
}
```

`ActivePanel` enum controls which panel has focus:

```rust
pub enum ActivePanel {
    Models, Log, ServerSettings, LlmSettings, Profiles,
    SystemPromptPresets, SearchReadme, ActiveModel, ModelInfo, Downloads,
}
```

`GlobalMode` enum controls overlay modes:

```rust
pub enum GlobalMode {
    Normal,
    CmdLine { cmd_line: String },
    HostPicker { entries: Vec<(String, String)>, selected: usize },
    BackendPicker { entries: Vec<(Backend, Option<String>)>, selected: usize },
    ProfilePicker { entries, selected },
    PromptPicker { entries, selected, editing, edit_buffer, edit_cursor_pos, confirm_delete },
    Confirmation { selected: bool, kind: ConfirmationKind },
    RpcManager,
    About,
    MaxConcurrentPicker { value: String },
    BenchTuneSetup { config, selected_idx, bench_mode_selection, editing_prompt, editing_kwargs },
}
```

`ConfirmationKind` variants: `Exit`, `Reset`, `Delete`, `Unload`, `DeleteBackend`.

`LoadingPhase` variants: `ServerStarting`, `LoadingModel`, `LoadingMeta`, `LoadingTensors`, `ServerListening`, `Complete`.

### Log panel expand/collapse

The `App` struct has a `log_expanded: bool` field. When true:
- Layout switches to 2-chunk: status bar + log fills remaining space
- Models panel, Settings panel, and active model info are hidden
- Log panel shows `[Enter] expand` / `[Esc] collapse` hint in status bar

### Event handling (`src/tui/event/mod.rs`)

Key handling is hierarchical:
1. Global shortcuts (Ctrl+C, Tab, Ctrl+H, etc.)
2. CmdLine overlay
3. Exit/Reset confirmation
4. Version picker mode (takes priority when `ModelsMode::VersionPicker`)
5. Search mode (takes priority when `ModelsMode::Search`)
6. Files mode
7. Download mode
8. Normal mode → dispatch to panel-specific handlers

**Important:** Each branch calls `return` to prevent fallthrough. Adding a new mode requires early returns.

### Token count display

Token count (`ctx_used`) is primarily driven by the `/metrics` endpoint. Log-parsing of `n_tokens` is used as a fallback if the API reports 0, ensuring real-time updates during inference. This hybrid approach prevents "stuck" values while maintaining accuracy.

### Search filtering (`src/backend/hub.rs`)

Search uses `&filter=gguf` on the HuggingFace API URL so the API itself only returns GGUF models. A post-filter then checks that the model_id contains the search query (case-insensitive). Default 70 results per page, max 200.

### Panel help

- **Ctrl+H** — Panel-specific help via `render_panel()` (contextual for current panel)
- Panel-specific help content is generated by `App::panel_help_lines()` (app.rs)

### Rendering (`src/tui/render.rs`)

Top-level layout: status bar → top panels → active model → log. The models panel renders differently based on `models_mode`.

### Download cancellation

Download runs in a spawned tokio task. Cancellation uses `Arc<AtomicBool>` shared between the task and the UI. Pressing `c` sets the flag; the download loop checks it each iteration.

### Backend picker

The backend picker allows selecting llama.cpp binary versions per-backend. Triggered from the "LLama.cpp Version" field in LLM Settings.

- `Enter` selects backend+version for the active backend
- `d` deletes a backend version from disk (with `ConfirmationKind::DeleteBackend`)
- `Esc` exits back to settings

**HostPicker** (`GlobalMode::HostPicker`): Shows network interfaces and their IPs. `Enter` selects host, `d` refreshes, `Ctrl+H` closes.

**MaxConcurrentPicker** (`GlobalMode::MaxConcurrentPicker`): Numeric entry modal for max concurrent predictions (1-10).

**BenchTuneSetup** (`GlobalMode::BenchTuneSetup`): Full benchmark configuration modal. `Alt+m` toggles benchmark mode (RuntimeOnly vs Full), `Alt+p` edits prompt, `Alt+n` edits n_predict, `Alt+i` edits iterations, `Alt+c` edits chat template kwargs. Space toggles parameter enablement, Enter starts benchmark.

**YarnRoPESettings** (`GlobalMode::YarnRoPESettings`): YaRN RoPE parameter modal. `selected_field` values: -1 (Yarn RoPE enabled toggle), 0 (rope_scale), 1 (rope_freq_base), 2 (rope_freq_scale). `Enter` edits the selected field or applies the value. `Space` toggles Yarn RoPE enabled. `Up`/`Down` cycles fields. `Esc` cancels editing or exits. Only digits, `.`, `-`, `e`, `E` are accepted as input.

### Dirty tracking (`is_settings_dirty` in `app.rs`)

Compares each field index-by-index. When a field is dirty, its label is rendered in yellow.

**Index consistency** — all indices must be identical across:
- `settings.rs` dirty check match arms (line ~133)
- `event.rs` `apply_numeric_setting` / `adjust_setting` match arms
- `event.rs` `handle_settings_key` toggle shortcuts (`e` / `Ctrl+E`)
- `event.rs` comment block (line ~836)
- `app.rs` `is_settings_dirty` match arms

### LLM Settings panel (55 fields in expert mode, 28 standard)

Standard fields (always visible):
```
Loading (0-2):   Context length, System prompt preset, Keep in memory (mlock)
GPU (3-8):       GPU Layers, Flash Attention, KV Cache Offload, Cache Type K, Cache Type V, Active Experts
Evaluation (9-11): Eval Batch, Unified KV, Max Concurrent Predictions
Sampling (12-17): Seed, Temperature, Top-k, Top-p, Min P, Max Tokens
Repetition (18-21): Repetition Penalty, Rep. Last N, Presence Penalty, Frequency Penalty
Backend (22):    Tags (Enter to edit)
Backend (23):    LLama.cpp Version (shows CPU / Vulkan / ROCm / CUDA versions)
Yarn RoPE (24-25): Yarn RoPE (toggle), Yarn Params (opens modal with rope_scale, rope_freq_base, rope_freq_scale)
MTP (26-27):     Enable MTP, Draft Tokens
```

Expert fields (toggle with `Ctrl+X`):
```
Expert Loading (28-33): Threads Batch, UBatch Size, Keep, SWA Full, MMap, NUMA
Expert GPU (34-41): Split Mode, Tensor Split, Main GPU, Fit, LoRA, LoRA Scaled, RPC, Embedding
Expert Sampling (42-47): Typical P, Mirostat, Mirostat LR, Mirostat Ent, Ignore EOS, Samplers
Expert DRY (48-51): DRY Multiplier, DRY Base, DRY Allowed Length, DRY Penalty Last N
Expert Server (52-54): Cache Prompt, Cache Reuse, WebUI
```

### Expert mode

Toggle with `Ctrl+X` in the LLM Settings panel. Shows/hides fields 28-54. Resets selected index and clears render cache on toggle.

### GPU Layers cycling (`src/models.rs::GpuLayersMode`)

```rust
pub enum GpuLayersMode {
    Auto,       // llama.cpp auto-detects based on VRAM (default)
    Specific(u32), // exact number of layers
    All,        // -ngl 999 (all layers)
}
```

Arrow keys cycle through modes: `Auto` → `1` → `2` → ... → `N` (total layers) → `All` → `Auto`.

**VRAM estimation** (`src/models.rs::estimate_vram_mib`):
- `Auto` uses a heuristic (~60% of total layers)
- `Specific(n)` uses exactly `n` layers
- `All` uses all layers

## Llama.cpp binary management

### Backend selection (`src/models.rs`)

Five main backends plus platform variants:

```rust
pub enum Backend {
    Cpu, Vulkan, Rocm, RocmLemonade, Cuda,
    CpuArm64, CpuWindows, VulkanWindows,
    CudaWindows12_4, CudaWindows13_1, HipWindows,
    CpuMacosArm64, CpuMacosX64,
}
```

### Binary storage

```
~/.local/share/llm-manager/bin/
├── llama-server-cpu-{version}/llama-server
├── llama-server-vulkan-{version}/llama-server
├── llama-server-rocm-{version}/llama-server
├── llama-server-rocm-lemonade-{version}/llama-server
└── llama-server-cuda-{version}/llama-server
```

### Per-backend version config

```yaml
llama_cpp_version_cpu: null
llama_cpp_version_vulkan: null
llama_cpp_version_rocm: null
llama_cpp_version_rocm_lemonade: null
llama_cpp_version_cuda: null
```

### Asset names

- **CPU:** `llama-{tag}-bin-ubuntu-x64.tar.gz`
- **Vulkan:** `llama-{tag}-bin-ubuntu-vulkan-x64.tar.gz`
- **ROCm:** `llama-{tag}-bin-ubuntu-rocm-7.2-x64.tar.gz`
- **ROCm Lemonade:** `llama-{tag}-ubuntu-rocm-{gfx_suffix}-x64.zip`
- **CUDA:** `llama.cpp-{tag}-cuda-12.8-amd64.tar.gz`

### Binary resolution (`src/backend/hub.rs`)

`resolve_backend_binary(backend, version)` checks if the binary + `libllama.so` exist. If not, it downloads and extracts the archive.

## Domain types (`src/models.rs`)

### Server mode

```rust
pub enum ServerMode { Normal, Router, Bench, BenchTune }
```

### Cache types

- **CacheType** (main KV cache): `F16`, `BF16`, `Fq8_0`, `Fq4_1`
- **CacheQuantType** (KV quantization): `F32`, `F16`, `BF16`, `Q8_0`, `Q4_0`, `Q4_1`, `Iq4Nl`, `Q5_0`, `Q5_1`

### Other enums

- **SplitMode:** `None`, `Layer` (default), `Row`, `Tensor`
- **NumMode:** `None` (default), `Distribute`, `Isolate`, `Numactl`
- **RopeScaling:** `None` (default), `Linear`, `Yarn`
- **Mirostat:** `Off` (default), `Mirostat`, `Mirostat2`
- **Samplers:** Semicolon-separated sampler order string
- **ModelState:** `Available`, `Loading`, `Benchmarking`, `Loaded { port, pid }`, `Failed { error }`
- **SearchSort:** `Relevance`, `Downloads`, `Likes`, `Trending`, `CreatedAt`

### Key structs

- **SearchResult** — fields: `model_id`, `model_name`, `tags`, `downloads`, `likes`, `pipeline_tag`, `size`, `parameters`, `capabilities`, `context_length`, `readme`, `quantization`, `license`, `trending_score`, `created_at`
- **GgufMetadata** — parsed GGUF metadata (layers, hidden_size, n_ctx_train, etc.)
- **LoadProgress**`layers_total`, `layers_loaded`, `tensors_loaded`, `buffers`
- **ServerMetrics**`loaded`, `tps`, `prompt_tps`, `cpu_usage`, `gpu_mem_used/total`, `ram_used`, `ctx_used/max`, `total_vram_used`, `decoded_tokens`, `throughput`, `latency_per_token_ms`
- **ModelSettings** — 70+ fields covering loading, GPU, sampling, repetition, RoPE, server config
- **BenchTuneConfig/Param/ParamValue/Result/Metrics/Status/Progress/Mode** — benchmark tuning types

## Benchmark Tuning (`src/backend/benchmark.rs`)

Two modes:
- **RuntimeOnly**: Single server, params sent in request body (no server restarts)
- **Full**: New server spawned for each parameter combination

Tunable parameters: temperature, top_p, top_k, repeat_penalty, flash_attn, threads, batch_size, expert_count.

Output formats: Markdown table, JSON, YAML, HTML report.

## Profiles, System Prompt Presets, RPC Workers, TLS (`src/config/`)

### Profiles (`config/profiles.rs`)

Named profiles of settings presets. Built-in: Qwen, Gemma, Llama, Mistral, Phi.

### System Prompt Presets (`config/presets.rs`)

Named system prompts. Built-in: General, Coder, Thinker, Mathematician.

### RPC Workers (`config.rs`)

Remote workers for distributed inference. Port default: 50052.

### TLS (`src/backend/tls.rs`)

Self-signed certificate generation (CA + server cert), persisted in `~/.config/llm-manager/tls/`.

### Config struct

```rust
pub struct Config {
    pub models_dirs: Vec<PathBuf>,
    pub llama_server: PathBuf,
    pub default: DefaultParams,
    pub model_overrides: ModelConfigStore,
    pub profiles: ProfileStore,
    pub system_prompt_presets: PresetStore,
    pub rpc_workers: Vec<RpcWorker>,
    pub ws_server: WsServer,
    pub search_limit: u32,
}
```

## Serve mode and API proxy (`src/serve.rs`, `src/serve_api.rs`)

### Standalone serve CLI

```bash
./build.sh serve --model /path/to/model.gguf --api-port 49222 --api-key secret
```

### API Proxy Server

Axum-based HTTP proxy. Key endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, `/v1/messages`, `/v1/embeddings`, `/v1/models`, `/health`, `/metrics`, `/models/load`, `/models/unload`, `/api/status`, and many more. All unmatched paths are auto-proxied.

## Coding rules

### Planning

1. Identify root cause, not just symptoms
2. List affected files and functions
3. Use a `todowrite` tool to track work as a numbered TODO list
4. Mark each item as `in_progress` before starting, `completed` when done
5. Keep the TODO list visible

### Dependencies

- No new dependencies without asking
- Prefer `ratatui` widgets over custom rendering

### Error handling

- Use `anyhow::Result` for async/API functions
- Use `thiserror` for application-specific error types
- Log errors with `app.add_log()` in the TUI

### Naming conventions

- `snake_case` for functions, variables, modules
- `PascalCase` for types, enums, variants
- Module names are lowercase (`backend`, `panel`)

### Async

- `handle_key` is async (for search queries)
- Download is spawned as a tokio task; progress flows through `mpsc`
- Main loop uses `crossterm::event::poll()` with a 100ms timeout

### TUI specifics

- Use `ratatui` widgets when possible
- Style with `Style` / `Color` / `Modifier` — prefer semantic colors:
  - Yellow: headers, active elements
  - Cyan: navigation hints
  - Green: success/completed
  - Red: errors/failure
- Avoid hardcoding terminal dimensions

### Configuration

- Config is YAML-based, stored in `~/.config/llm-manager/`
- New config fields go in `config.rs`; add defaults in `Default` impls

### Testing

- Unit tests in `mod tests` blocks
- Integration testing is manual (run the app)

### Git commits

- **Never commit changes yourself.** Always ask the user before committing.
- If the user explicitly asks you to commit, then do it.
- If the user wants you to stage changes, do that — but still wait for explicit permission to commit.

## Common tasks

### Adding a new panel

1. Create `src/tui/panel/name.rs` with a `render(f, area, app)` function
2. Add `mod name;` to `src/tui/panel/mod.rs`
3. Add to `ActivePanel` enum in `app.rs`
4. Dispatch in `render.rs` and `event.rs`

### Adding a new keyboard shortcut

1. Add to `handle_key()` in `event.rs`
2. Update the status bar in `render_status_bar()` in `render.rs`
3. If it changes state, update `App` fields in `app.rs`

### Adding a new API endpoint

1. Add the function in `src/backend/hub.rs`
2. Call from `event.rs` (usually in the search/files branch)
3. Update `SearchResult` or other types in `models.rs` if needed

### Adding a new backend

1. Add variant to `Backend` enum in `models.rs` with serde/Display impl
2. Add `llama_cpp_version_{backend}` field to `DefaultParams` and `ModelSettings` in `config.rs` and `models.rs`
3. Update `from_settings()` / `apply()` in `config.rs`
4. Update `resolve_backend_binary()` in `hub.rs` for asset name
5. Update `spawn_server()` in `server.rs` for version lookup
6. Update `refresh_cached_versions()` in `app.rs` for directory detection
7. Update version picker in `models.rs` and event handling in `event.rs`