innisfree 0.4.3

# innisfree — analysis notes

Survey of the v0.4.1 codebase, focused on the three user-supplied
themes: maintainability hygiene, the Kubernetes operator pattern,
and a possible move from Debian + cloud-init to NixOS.

## 1. Codebase shape (as of v0.4.1)

```
src/
├── lib.rs              re-exports the public modules
├── main.rs             clap CLI; subcommands: up, ssh, ip, doctor, clean, proxy
├── config.rs           ServicePort, Protocol, clean_name
├── manager.rs          TunnelManager: orchestrates server + wg + state
├── state.rs            TunnelStateDir: per-service on-disk layout
├── net.rs              /30 subnet picker (10.50.0.1/28 parent)
├── proxy.rs            tokio TCP forwarder (UDP not implemented)
├── ssh.rs              ED25519 keypair generation + write_to(path)
├── doctor.rs           CAP_NET_ADMIN + DIGITALOCEAN_API_TOKEN preflight
├── server.rs           Provider + InnisfreeServer traits, ServerSpec
├── server/
│   ├── cloudinit.rs    CloudConfig YAML builder (Tera templates)
│   └── digitalocean/   only concrete provider
│       ├── client.rs   DoClient: shared reqwest + token + envelope unwrap
│       ├── provider.rs DigitalOceanProvider: implements Provider
│       ├── server.rs   Droplet: implements InnisfreeServer
│       ├── floating_ip.rs
│       └── ssh_key.rs
└── wg/
    ├── mod.rs          WireguardManager / Device / Host / Keypair
    └── runtime.rs      LocalWg: in-process boringtun + rtnetlink
files/                  cloudinit.cfg, wg0.conf.j2, stream.conf.j2 (include_str!'d)
tests/                  integration_test.rs (live DO), static_linkage.rs
```

The trait split (`Provider` factory → `InnisfreeServer` handle) is well
shaped, the `DoClient` is a clean shared HTTP layer, and `TunnelStateDir`
is a single funnel for every path the binary writes. The recent 0.4.x
refactors have left the code in good shape; what follows are the
remaining sharp edges and the structural changes the new use cases
will require.

## 2. Maintainability findings

### 2.1 XDG paths (the prompt's example)

`src/state.rs:107-110` resolves the per-service state dir as:

```rust
home::home_dir()?.join(".config").join("innisfree")
```

This is wrong on two axes:

* **Wrong category.** Everything under that dir is *state*, not *config*:
  ephemeral SSH keypairs, the wg config rendered from runtime data, the
  `ip` readiness marker, the `known_hosts` pinned to a freshly-booted
  cloud node. Per the XDG Base Directory spec, that belongs under
  `$XDG_STATE_HOME` (default `~/.local/state/`), not `$XDG_CONFIG_HOME`.
  If real configuration is ever introduced (e.g. saved provider creds,
  default region), *that* belongs under `$XDG_CONFIG_HOME`.
* **Wrong resolver.** `home::home_dir()` ignores `XDG_*` overrides
  entirely, so users who run with a custom `HOME` get the wrong path
  silently.

The state.rs doc comment already flags the migration as a single-edit
site (state.rs:9-10, 105-106), and `home` is only used here — replacing
it with `directories` (or `etcetera` / `xdg`) is genuinely a 10-line
change. The migration should also write a one-shot symlink/copy from
the old location to keep existing `innisfree@<service>.service` units
working without operator action.

### 2.2 No locking around the per-service state dir

`TunnelManager::new` (manager.rs:84) calls
`remove_state_for_service(tunnel_name)` unconditionally before
re-creating the dir. That is the only safety net against stale state
from a crashed prior run, but it is also a foot-gun: a second
`innisfree up --name foo` invoked while the first is alive will wipe
the first's keypairs and `known_hosts` out from under it, then race
to create a second droplet of the same name. A `flock`-style
`<dir>/.lock` would catch this cheaply.

### 2.3 Library / binary boundary leaks toward `main.rs`

`src/lib.rs` re-exports every module as `pub`, so the crate is shipped
as a library too — but `src/main.rs` re-declares `mod config; mod
manager; …` rather than `use innisfree::*`. Both copies compile, but
the binary's view is the canonical one for behaviour: the
`DigitalOceanProvider` is constructed in main.rs:148-149, not exposed
anywhere from the library. A second consumer of the library (the
operator binary, see §3) would have to either re-import the DO module
directly or reach back through main.rs. Pull provider construction
into a small library helper (`innisfree::providers::digitalocean()`)
and have main.rs call into it.

### 2.4 The `Provider` trait can only `create` and `destroy`

`src/server.rs:46-69` defines the surface:

```rust
trait Provider { async fn create(&self, spec: &ServerSpec) -> ...; }
trait InnisfreeServer {
    fn ipv4_address(&self) -> Result<IpAddr>;
    async fn assign_floating_ip(&self, ip: IpAddr) -> Result<()>;
    async fn destroy(&self) -> Result<()>;
}
```

There is no `update`, no `lookup_by_name`, no `list`, no `status`. The
CLI happens to be fine with this — every `up` is a green-field
provision — but a reconcile loop needs all four. Specifically:

* **`update_services(&[ServicePort])`** to push a new nginx stream
  config and reload, without rebuilding the droplet.
* **`reattach(name) -> Option<Box<dyn InnisfreeServer>>`** so the
  operator can recover state after a pod restart.
* **`status() -> ServerStatus`** so the operator can populate
  `Service.status.loadBalancer.ingress[]` with both the IP and a
  health flag.

These all map cleanly onto DO API endpoints we already use; the work
is in the trait + the manager glue, not the provider.

### 2.5 Cloud-init is one-shot

The remote nginx stream config is baked into the cloud-init at droplet
creation (server/cloudinit.rs:90-97 → `/etc/nginx/conf.d/stream/innisfree.conf`).
Adding or removing a service today requires `innisfree clean && innisfree up`
— i.e. a new droplet, a new public IP, and a `known_hosts` rewrite.
That is fine for ad-hoc `up`/`down` use but unworkable for either
operator reconciliation or NixOS-style ongoing config management. The
remote needs an in-band reconfigure path: render the template locally,
SSH-push the file, `nginx -s reload`. The bones for this exist in
`run_ssh_cmd` (manager.rs:245).

### 2.6 Hard-coded provisioning knobs

`server/digitalocean/server.rs:20-26` hard-codes region (`sfo2`), size
(`s-1vcpu-1gb`), image (`debian-13-x64`). None are exposed via CLI.
For a homelab in a different region this is a recompile. Promote them
to CLI flags / env vars (clap already wires both) with the current
values as defaults.

### 2.7 `proxy.rs` only handles TCP

`Protocol::Udp` parses, the nginx remote template renders UDP
correctly, but the local `--dest-ip` proxy uses
`tokio::net::TcpStream` exclusively (proxy.rs:18-37). A user who
declares a UDP service today gets nginx forwarding remote-side, but
the local proxy hop will silently no-op. Either (a) fix `transfer()`
to dispatch on `Protocol`, or (b) reject UDP services in the CLI
when `--dest-ip != 127.0.0.1`.

### 2.8 Signal handling

`TunnelManager::block` (manager.rs:196-204) only awaits `signal::ctrl_c`
(SIGINT). The systemd unit hacks around this with `KillSignal=SIGINT`
(innisfree@.service:11). `tokio::signal::unix::signal(SignalKind::terminate())`
plus a `select!` on both is the standard fix. Already on TODO.md.

### 2.9 Spawned proxy task is not joined

main.rs:195-200 does `tokio::spawn(manager::run_proxy(…))` and then
falls through to `mgr.block().await?`. The spawn handle is dropped, so
if the proxy returns an error early — say, port already in use — the
user only sees a `tracing::warn!` from inside `run_proxy`. The exit
code stays zero. Hold the `JoinHandle` and `select!` it against
`block()` so an early proxy failure tears the tunnel down.

### 2.10 `get_server_ip` is a string file

`state.rs:67` writes `ip` as a bare `"<addr>\n"`. It works, but it has
no room to grow — operator status reporting wants more than an IP
(droplet ID, status, last-reconciled timestamp, observed services).
Make this a small JSON document (`status.json`) and keep `ip` as a
backward-compat symlink for one release.

### 2.11 Misc minor

* `serde_yaml = "0.8"` is unmaintained. Migrate to `serde_yml` (the
  community fork) or hand-render — the cloud-init shape is small.
* `lib.rs` doc comment (`Right now, only TCP traffic is supported`) is
  stale; UDP is supported in the templates.
* `WireguardManager::new` uses `remote_name = "innisfree-<svc>-remote"`
  which is purely cosmetic (it never reaches the kernel; only the
  *local* iface name has a length constraint). Worth a comment, since
  the asymmetry with the local-side `pick_local_iface_name` looks like
  a bug at first glance.
* `config.rs:105-115` `clean_name` is a string-replace, not a regex.
  `"innisfreeinnisfreeinnisfree"` becomes `"innisfree-"`; harmless,
  but write a test or replace with a `strip_prefix` loop.

## 3. Kubernetes operator considerations

### 3.1 What you actually want

Stated minimally: when a `Service` of `type=LoadBalancer` appears in
the cluster, an `innisfree-controller` Pod should provision a remote
ingress (DO droplet + tunnel) and write the public IP back into
`status.loadBalancer.ingress[].ip`. Reconcile changes (added/removed
ports, deleted Service) without recreating the droplet. Survive its
own restart — re-attach to existing droplets rather than orphaning
and re-provisioning.

### 3.2 Topology choice

There are two reasonable shapes:

**(A) Controller-only, droplet drives traffic into the cluster.** The
controller runs as a single Deployment; for each Service it stands up
(or updates) a remote droplet whose nginx forwards each public port
back over wg to a *node IP : nodePort* combination. The wg tunnel
terminates on the controller pod (which holds CAP_NET_ADMIN), and the
local proxy hop forwards into the cluster network. This is the closest
mapping to today's `innisfree up`.

**(B) Controller + per-Service tunnel pod.** The controller acts as a
pure reconciler; for each Service it manages, it creates a small
DaemonSet/Deployment of "tunnel pods" that each hold one wg interface
and proxy into the Service ClusterIP. This is closer to how
[klipperlb](https://github.com/k3s-io/klipper-lb) works.

**Recommendation: start with (A).** It needs one CAP_NET_ADMIN pod
total; it reuses the existing single-process `innisfree up` flow with
minor refactoring (managing N tunnels in one process); it punts the
per-pod orchestration question. (B) is the cleaner answer at scale,
but you are running one homelab cluster, not a multi-tenant SaaS.

### 3.3 Multi-tenancy on the cloud side

With (A) there is still a question of *one droplet per Service* vs.
*one droplet for the whole cluster*. At ~$4–6/mo per DO droplet and
hobbyist scale, the shared droplet is right by default:

* nginx happily hosts many `server { listen ...; }` stanzas, one per
  port across all Services.
* Port collisions across Services *must* fail fast — surface as an
  Event on the offending Service.
* Add `loadBalancerClass` discrimination so users can opt a specific
  Service into "give me my own droplet" by selecting a different class
  (e.g. `innisfree.ruin.dev/dedicated`).

Either way, the operator needs a *cluster-scoped backend object* that
holds the DO API token (Secret reference) and the droplet sizing
defaults. A `ClusterInnisfreeBackend` CRD is the standard shape; that
keeps the per-Service annotation surface minimal.

### 3.4 API surface

Avoid a per-Service CRD if possible — users get a much better
experience if they just write `type: LoadBalancer` and pick a
`loadBalancerClass`. Annotations cover the rest:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: minecraft
  annotations:
    innisfree.ruin.dev/floating-ip: "203.0.113.42"
spec:
  type: LoadBalancer
  loadBalancerClass: innisfree.ruin.dev/digitalocean
  ports:
    - port: 25565
      targetPort: 25565
      protocol: TCP
```

Plus one cluster-scoped CRD for backend config:

```yaml
apiVersion: innisfree.ruin.dev/v1alpha1
kind: ClusterInnisfreeBackend
metadata:
  name: default
spec:
  provider: digitalocean
  region: sfo2
  size: s-1vcpu-1gb
  apiTokenSecretRef: { name: do-token, key: token }
  sharing: shared    # or "per-service"
```

### 3.5 Where state lives

The CLI today persists per-tunnel state in `~/.config/innisfree/<svc>/`.
The operator must move that into Kubernetes objects, so a controller
restart can re-attach:

* A per-Service `Secret` holds wg keypairs + the SSH client keypair.
* A per-Service `ConfigMap` (or a status subresource on the
  `ClusterInnisfreeBackend`) holds the droplet ID + assigned IP.
* `TunnelStateDir` becomes one of two implementations behind a small
  `StateStore` trait — local-fs for the CLI, k8s-objects for the
  operator. Same `for_service` / `open` shape; different backing.

### 3.6 Reconciliation loop, sketched

For each watched `Service` with the matching `loadBalancerClass`:

1. **Diff observed vs. desired services** by comparing `spec.ports`
   to the per-Service ConfigMap.
2. **If no droplet exists**, call `Provider::create` (existing path).
3. **If droplet exists and ports changed**, call the new
   `Provider::update_services` to push a new nginx config + reload.
4. **If droplet exists and is healthy**, no-op.
5. **On Service deletion** (caught via finalizer): destroy droplet,
   then remove finalizer.
6. **Always** write `status.loadBalancer.ingress[0].ip` and emit
   Events on the Service for transitions.

Backoff: standard `kube::runtime::controller` exponential backoff;
poll droplet health every minute via a separate background tick.

### 3.7 Where dest_ip comes from

In CLI mode, `--dest-ip` points at "wherever your local service is
bound." In operator mode the controller is in-cluster, so it can
choose:

* **NodePort** (works without CNI assumptions): the controller picks
  any Ready node IP, the Service must auto-allocate a NodePort, and
  the remote nginx forwards `<public_port>` → `<wg_local>:<nodePort>`
  via wg into the controller pod, which then proxies to
  `<node_ip>:<nodePort>`. Simple but adds a hop.
* **ClusterIP** (cleaner, requires the controller pod to participate
  in the cluster network normally): the local proxy targets
  `<service.clusterIP>:<port>`. This works because the controller pod
  *is* in the cluster network. **This is the right default**; fall
  back to NodePort only if the user opts in.

### 3.8 Library shape required

To make all of the above buildable as a separate `innisfree-operator`
crate (or `innisfree --operator` subcommand) without forking, the
library needs:

* `Provider::update_services(&self, server: &dyn InnisfreeServer, &[ServicePort])`
* `Provider::lookup_by_name(&self, name) -> Option<Box<dyn InnisfreeServer>>`
* `InnisfreeServer::status() -> ServerStatus { id, ipv4, healthy, observed_ports }`
* `StateStore` trait + `LocalFs` impl (today's behaviour) and a
  `Kubernetes` impl in the operator crate.
* `TunnelManager` parameterised over `StateStore`.

These are all extensions, not breaks; the CLI keeps working.

## 4. NixOS for the cloud node

### 4.1 The reconciliation problem this solves

Right now a port change is a green-field redeploy: cloud-init only
runs at first boot, and the rendered nginx + wg files are inputs to
that one-shot. An operator reconcile loop has to invent its own
*push the new file, reload the service* path. NixOS short-circuits
that: a port change becomes "regenerate the system configuration,
push the new closure, run `switch-to-configuration`." Idempotent,
declarative, and the same machinery handles initial provisioning
and incremental updates.

### 4.2 Provisioning options (initial install)

Options for getting NixOS onto a fresh DO droplet, in increasing order
of operational weight:

1. **Custom uploaded image.** Upload a NixOS image to DO once
   ("Custom Images"), then `image: innisfree-base` instead of
   `debian-13-x64`. Single API call to provision; the image is
   versioned alongside the controller. Best UX, modest setup cost.
2. **`nixos-anywhere`** ([nix-community/nixos-anywhere](https://github.com/nix-community/nixos-anywhere)).
   Boots Debian via cloud-init, then ssh-kexecs into a NixOS installer
   and writes the desired flake config to disk. ~5 min from
   `do create` to a working NixOS box. No image upload needed.
   Operationally heavier (the controller needs a working nix store),
   but the canonical answer if the controller already runs nix.
3. **`nixos-infect`-style script.** A shell script in cloud-init that
   converts the running Debian to NixOS in place. Older, fragile, do
   not recommend over the two above.

For the homelab + flake-built-controller setup, **(2) is the right
answer for now, with a path to (1)** once the image-build pipeline
is stable. Both keep the existing "spec.user_data" call for initial
SSH key injection — only the post-install path differs.

### 4.3 Reconciliation options (incremental update)

Once the box is NixOS, three tools can push updates:

1. **`nixos-rebuild switch --target-host innisfree@<ip> --flake .#node`.**
   Plain stdlib. The controller runs this via `xshell`. Works fine
   for one host; awkward to parallelise across many.
2. **[colmena](https://github.com/zhaofengli/colmena).** Multi-host
   declarative deploys. Reads a `colmena.nix`, computes per-host
   closures, pushes in parallel, runs `switch`. Designed for fleets;
   minor overkill for one cluster but reads well.
3. **[deploy-rs](https://github.com/serokell/deploy-rs).** Similar to
   colmena, written in Rust, with magic-rollback on broken activations.
   The rollback story is the strongest argument for it — a bad nginx
   config gets reverted automatically.

For a single droplet per cluster: **`nixos-rebuild --target-host`** is
sufficient and means no extra dep. For per-Service droplets,
**deploy-rs** is the better fit (parallel + rollback). Either way,
the controller's job is to render the flake input, kick the tool, and
update Service status — it does not need a long-lived `colmena`-style
process.

### 4.4 What the remote configuration looks like

A thin `nixosModules.innisfree` exported by the flake, parameterised
over `services`:

```nix
{ config, lib, ... }: {
  options.services.innisfree.ports = lib.mkOption {
    type = with lib.types; listOf (submodule { … });
  };

  config = let s = config.services.innisfree.ports; in {
    services.nginx = { enable = true; … };
    networking.firewall.allowedTCPPorts =
      map (p: p.publicPort) (lib.filter (p: p.protocol == "TCP") s);
    services.wireguard.interfaces.wg0 = { … };
  };
}
```

Per-Service rendering becomes "build a `services.innisfree.ports = […]`
attribute set" — pure data, easy to generate from Rust. The flake
itself is committed; only the runtime input changes.

### 4.5 Costs / reasons to think twice

* **Cold start** is slower: a fresh droplet pulling the activation
  closure can be 1–3 minutes vs. ~30s for cloud-init. Worth measuring
  before committing.
* **Disk:** NixOS keeps generations; on a 25GB droplet with frequent
  reconciles, you'll want a `nix-collect-garbage --delete-older-than 7d`
  cron.
* **Controller dependency:** the operator pod must ship nix. That
  means a much larger image (several hundred MB at minimum) than the
  current scratch-based static binary. Acceptable, but no longer "tiny."
* **Debuggability shifts:** operators familiar with Debian + systemd
  will need to learn nix store layout to triage live issues.

### 4.6 Verdict

NixOS is the right destination *if and only if* the operator pattern
proceeds — the win is reconciliation, and the CLI does not need it.
For the standalone `innisfree up` use case, Debian + cloud-init keeps
working fine and the migration cost has no payoff.

The cleanest sequencing: do the maintainability + operator work
**first**, against the existing Debian remote (since the in-band
reconfigure path is small once the trait extensions land), and treat
the NixOS migration as a Phase 3 swap — by then the `Provider`
abstraction will have the seams to make it a `NixosProvider` next to
`DigitalOceanProvider` (or, more likely, a different *bootstrap layer*
within the DO provider).

## 5. Open questions

* **Crates.io for the operator?** Ship as a separate `innisfree-operator`
  crate in the same workspace, or as a `--operator` subcommand on the
  existing binary? Workspace is cleaner; subcommand is fewer artefacts.
* **Floating IP semantics in shared-droplet mode** are awkward — the
  IP belongs to the droplet, not to a particular Service. Annotation
  works for "give this Service its own droplet"; otherwise ignore.
* **IPv6.** DO droplets get a v6 by default; the trait, the wg config,
  and the nginx stream all assume v4 only. Worth scoping into the
  operator work — a `Service` may want both.
* **Per-Service vs shared droplets** — confirm the default. Shared
  is what I've assumed throughout.
* **Region selection per Service** would matter for multi-region
  homelabs (ha!). Probably YAGNI; backend-level region only.