manta-cli 1.64.3

Another CLI for ALPS
# Analysis: Should `site_name` Be Added to HTTP API Requests?

**Scope**: Evaluate whether the manta HTTP server should accept a `site_name` parameter per request and route to the matching backend dynamically, given the assumption that a JWT token is valid for any configured site.

---

## JWT Validity: Gate Passes

There is a centralized Keycloak that issues JWT tokens. Each site's Keycloak is configured to accept tokens from the centralized issuer. A single JWT is therefore valid across all configured sites.

**How token validation works today**: The `BearerToken` Axum extractor only parses the `Authorization: Bearer <token>` header — no validation at the manta layer. Every handler passes the raw token string to `backend.validate_api_token()`, which dispatches to csm-rs or ochami-rs and presents the token to that site's backend API. Because site Keycloaks federate with the central one, the token is accepted at every site.

This confirms the assumption. Multi-site routing is viable at the authentication layer.

---

## What "Adding `site_name`" Actually Means

Adding `site_name` to the API is only useful in a **multi-site server** model where one running `manta serve` process handles requests for several backend sites simultaneously.

In the alternative **one-server-per-site** model, the site is implied by which server the client connects to. The CLI already reads `site = "alps"` from config; in HTTP mode it would read `manta_url = "https://manta.alps.example.com"` from the same config block and direct all requests there. No `site_name` in the API is needed.

The value of the multi-site model is operational: fewer processes to run, one TLS certificate, one URL. The cost is a non-trivial refactor described below.

---

## Network Topology

```
[CLI / caller]
      │
      │  (caller may need SOCKS5 to reach the manta server — irrelevant to server internals)
      ▼
[manta HTTP server]
      │
      │  (server needs SOCKS5 to reach the backend: CSM / OpenCHAMI APIs)
      ▼
[Backend: CSM / OpenCHAMI]
```

The two SOCKS5 legs are independent. Only the server-to-backend leg is at issue for multi-site.

---

## SOCKS5: Hard Architectural Constraint

`main.rs` sets the proxy before the tokio runtime starts:

```rust
// main.rs:73–83
unsafe {
    std::env::set_var("SOCKS5", socks_proxy);
}
// then: rt.block_on(run(...))
```

csm-rs reads this env var on every `reqwest::Client` build (106 call sites). This is process-global.

`std::env::set_var` is marked `unsafe` since Rust 1.66 precisely because it is unsound in a multi-threaded context. For a multi-site server, there is no way to set different per-site proxies via env var — the fix is to remove the env-var coupling entirely and thread `socks5_proxy: Option<String>` explicitly through `InfraContext` into csm-rs. This requires centralizing `reqwest::Client` construction in csm-rs.

This refactor is a **hard prerequisite** for multi-site support regardless of whether sites happen to use the same proxy today. The current call to `set_var` at startup must be removed before `manta serve` is safe to run with multiple sites in its config.

---

## `site_name` Touch Points

### Direct consumers requiring explicit `site_name`

| Location | Purpose |
|---|---|
| `handlers::create_session` | Vault path: `manta/data/{site_name}` |
| `handlers::get_session_logs` | Vault path: `manta/data/{site_name}` |
| `handlers::post_sat_file` | Vault path: `manta/data/{site_name}` |
| `handlers::apply_session` | Vault path: `manta/data/{site_name}` |
| `handlers::console_node_ws` | K8s pod console streaming — needs site's K8s credentials |
| `handlers::console_session_ws` | K8s CFS session console streaming — needs site's K8s credentials |
| `service::configuration` (×2) | Vault path construction |
| `service::sat_file` (×2) | Vault path construction |

### Implicit consumers (all 47 handlers)

Every handler receives an `InfraContext` whose `backend: &StaticBackendDispatcher` carries a fixed `base_url`. Switching sites is not a string change — it requires a different `StaticBackendDispatcher` instance pointing at a different URL with a different root certificate.

---

## `ServerState` Refactor Scope

Current (single-site):
```rust
pub struct ServerState {
    pub backend: StaticBackendDispatcher,
    pub site_name: String,
    pub shasta_base_url: String,
    pub shasta_root_cert: Vec<u8>,
    pub vault_base_url: Option<String>,
    pub gitea_base_url: Option<String>,
    pub k8s_api_url: Option<String>,
    pub console_inactivity_timeout: u64,
}
```

Required for multi-site:
```rust
pub struct SiteBackend {
    pub backend: StaticBackendDispatcher,
    pub shasta_base_url: String,
    pub shasta_root_cert: Vec<u8>,
    pub socks5_proxy: Option<String>,   // must be explicit, not env-var
    pub vault_base_url: Option<String>,
    pub gitea_base_url: Option<String>,
    pub k8s_api_url: Option<String>,
    pub k8s_auth: Option<K8sAuth>,      // needed by console WebSocket handlers
}

pub struct ServerState {
    pub sites: HashMap<String, SiteBackend>,
    pub default_site: String,
    pub console_inactivity_timeout: u64,
}
```

Note: `k8s_auth` must be included — `console_node_ws` and `console_session_ws` stream directly from Kubernetes pods and need the site's K8s credentials, which can be certificate-based (`K8sAuth::Native`) or Vault-delegated (`K8sAuth::Vault`). These are not just a URL.

Every handler currently calls `state.infra_context()` once. With multi-site, the lookup must happen at request time. An Axum extractor can centralize this:

```rust
// Sketch — extracts the right SiteBackend from ?site= (defaults to default_site)
pub struct SiteContext(pub Arc<SiteBackend>);

impl FromRequestParts<Arc<ServerState>> for SiteContext { ... }
```

Handler signatures gain one parameter; the lookup and 400 error for unknown sites are in one place.

**Estimate**: ~80 lines of extractor code + one added parameter per handler.

---

## Protocol Placement

| Option | GET? | WebSocket? | Body collision? | Breaking change? |
|---|---|---|---|---|
| Query param `?site=` | Yes | Yes | No | No (optional) |
| Header `X-Manta-Site` | Yes | Yes | No | No (optional) |
| JSON body field | No | No | Yes | Yes |
| URL path `/api/v1/{site}/...` | Yes | Yes | No | Yes (all routes change) |

**Query param** is the best fit: works across all HTTP verbs and WebSocket upgrades, can be made optional with the server's `default_site` as fallback (zero breaking change for existing single-site clients), and is visible in logs.

---

## Security Implication

A multi-site server that accepts `?site=eiger` from any authenticated caller gives every token-holder access to every configured site. Because the centralized Keycloak issues tokens valid at all sites, this is expected behaviour — any holder of a valid JWT can already access any site directly. The manta server does not widen the attack surface.

What the server should still do: validate the token against the *target* site's backend, not the default site. Currently `validate_api_token` is called implicitly on the first real API call — there is no explicit pre-flight check in the manta layer. For multi-site this matters because a request could fail midway through a handler on a site whose backend happens to be unreachable, rather than failing fast at token validation. An explicit early `validate_api_token` call against the resolved `SiteBackend` before entering handler logic is the right fix.

---

## Startup Behavior in Multi-site Mode

Currently `manta serve` reads one site at startup and fails fast if the site is unknown. In multi-site mode:
- Does startup fail if any configured site is unreachable? (strict — bad for partial outages)
- Or does it start with whatever sites are reachable? (lenient — harder to detect misconfig)
- What happens when a client requests an unconfigured site name — 400 or 503?

These are operational questions that need answers before the implementation, not after.

---

## Recommendation

The JWT gate passes. Multi-site routing is architecturally viable. The remaining work breaks into two tracks:

**Track 1 — required regardless of multi-site (do first)**

Remove the env-var SOCKS5 coupling in csm-rs. This is a pre-existing correctness issue: `set_var` before the tokio runtime is a workaround, not a design. Centralise `reqwest::Client` construction in csm-rs so `socks5_proxy` is passed explicitly. This unblocks multi-site and also makes the single-site server cleaner.

**Track 2 — the multi-site feature itself**

1. Add `socks5_proxy: Option<String>` and `k8s_auth: Option<K8sAuth>` to `InfraContext` / `SiteBackend` (after Track 1).
2. Refactor `ServerState` from a single backend to `HashMap<String, SiteBackend>`.
3. Add an Axum extractor that reads `?site=` (optional, defaults to `default_site`), looks up the site, and produces the resolved `InfraContext`. Add an explicit `validate_api_token` call against the target site inside the extractor.
4. Thread the extractor into all 47 handler signatures (one-line change each).
5. Update `manta serve` startup to load all configured sites, not just one.

**Deployment model decision**

With the shared-Keycloak confirmation, both models are now valid:

| Model | Trade-off |
|---|---|
| **One server per site** | Simplest. CLI picks the right manta URL per site from config (same pattern as `shasta_base_url`). No `site_name` in the API. No multi-site refactor. Appropriate if each site is managed independently. |
| **One shared multi-site server** | One process, one TLS cert, one URL. Operators pass `?site=alps`. Requires Track 1 + Track 2 above. Appropriate if the team manages all sites centrally and wants a single control plane. |

**Suggested order**: ship Track 1 (SOCKS5 refactor in csm-rs) first as a standalone improvement, then decide which deployment model to target before starting Track 2.