innisfree 0.4.3

Exposes local services on public IPv4 address, via cloud server.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
# innisfree — analysis notes

Survey of the v0.4.1 codebase, focused on the three user-supplied
themes: maintainability hygiene, the Kubernetes operator pattern,
and a possible move from Debian + cloud-init to NixOS.

## 1. Codebase shape (as of v0.4.1)

```
src/
├── lib.rs              re-exports the public modules
├── main.rs             clap CLI; subcommands: up, ssh, ip, doctor, clean, proxy
├── config.rs           ServicePort, Protocol, clean_name
├── manager.rs          TunnelManager: orchestrates server + wg + state
├── state.rs            TunnelStateDir: per-service on-disk layout
├── net.rs              /30 subnet picker (10.50.0.1/28 parent)
├── proxy.rs            tokio TCP forwarder (UDP not implemented)
├── ssh.rs              ED25519 keypair generation + write_to(path)
├── doctor.rs           CAP_NET_ADMIN + DIGITALOCEAN_API_TOKEN preflight
├── server.rs           Provider + InnisfreeServer traits, ServerSpec
├── server/
│   ├── cloudinit.rs    CloudConfig YAML builder (Tera templates)
│   └── digitalocean/   only concrete provider
│       ├── client.rs   DoClient: shared reqwest + token + envelope unwrap
│       ├── provider.rs DigitalOceanProvider: implements Provider
│       ├── server.rs   Droplet: implements InnisfreeServer
│       ├── floating_ip.rs
│       └── ssh_key.rs
└── wg/
    ├── mod.rs          WireguardManager / Device / Host / Keypair
    └── runtime.rs      LocalWg: in-process boringtun + rtnetlink
files/                  cloudinit.cfg, wg0.conf.j2, stream.conf.j2 (include_str!'d)
tests/                  integration_test.rs (live DO), static_linkage.rs
```

The trait split (`Provider` factory → `InnisfreeServer` handle) is well
shaped, the `DoClient` is a clean shared HTTP layer, and `TunnelStateDir`
is a single funnel for every path the binary writes. The recent 0.4.x
refactors have left the code in good shape; what follows are the
remaining sharp edges and the structural changes the new use cases
will require.

## 2. Maintainability findings

### 2.1 XDG paths (the prompt's example)

`src/state.rs:107-110` resolves the per-service state dir as:

```rust
home::home_dir()?.join(".config").join("innisfree")
```

This is wrong on two axes:

* **Wrong category.** Everything under that dir is *state*, not *config*:
  ephemeral SSH keypairs, the wg config rendered from runtime data, the
  `ip` readiness marker, the `known_hosts` pinned to a freshly-booted
  cloud node. Per the XDG Base Directory spec, that belongs under
  `$XDG_STATE_HOME` (default `~/.local/state/`), not `$XDG_CONFIG_HOME`.
  If real configuration is ever introduced (e.g. saved provider creds,
  default region), *that* belongs under `$XDG_CONFIG_HOME`.
* **Wrong resolver.** `home::home_dir()` ignores `XDG_*` overrides
  entirely, so users who run with a custom `HOME` get the wrong path
  silently.

The state.rs doc comment already flags the migration as a single-edit
site (state.rs:9-10, 105-106), and `home` is only used here — replacing
it with `directories` (or `etcetera` / `xdg`) is genuinely a 10-line
change. The migration should also write a one-shot symlink/copy from
the old location to keep existing `innisfree@<service>.service` units
working without operator action.

### 2.2 No locking around the per-service state dir

`TunnelManager::new` (manager.rs:84) calls
`remove_state_for_service(tunnel_name)` unconditionally before
re-creating the dir. That is the only safety net against stale state
from a crashed prior run, but it is also a foot-gun: a second
`innisfree up --name foo` invoked while the first is alive will wipe
the first's keypairs and `known_hosts` out from under it, then race
to create a second droplet of the same name. A `flock`-style
`<dir>/.lock` would catch this cheaply.

### 2.3 Library / binary boundary leaks toward `main.rs`

`src/lib.rs` re-exports every module as `pub`, so the crate is shipped
as a library too — but `src/main.rs` re-declares `mod config; mod
manager; …` rather than `use innisfree::*`. Both copies compile, but
the binary's view is the canonical one for behaviour: the
`DigitalOceanProvider` is constructed in main.rs:148-149, not exposed
anywhere from the library. A second consumer of the library (the
operator binary, see §3) would have to either re-import the DO module
directly or reach back through main.rs. Pull provider construction
into a small library helper (`innisfree::providers::digitalocean()`)
and have main.rs call into it.

### 2.4 The `Provider` trait can only `create` and `destroy`

`src/server.rs:46-69` defines the surface:

```rust
trait Provider { async fn create(&self, spec: &ServerSpec) -> ...; }
trait InnisfreeServer {
    fn ipv4_address(&self) -> Result<IpAddr>;
    async fn assign_floating_ip(&self, ip: IpAddr) -> Result<()>;
    async fn destroy(&self) -> Result<()>;
}
```

There is no `update`, no `lookup_by_name`, no `list`, no `status`. The
CLI happens to be fine with this — every `up` is a green-field
provision — but a reconcile loop needs all four. Specifically:

* **`update_services(&[ServicePort])`** to push a new nginx stream
  config and reload, without rebuilding the droplet.
* **`reattach(name) -> Option<Box<dyn InnisfreeServer>>`** so the
  operator can recover state after a pod restart.
* **`status() -> ServerStatus`** so the operator can populate
  `Service.status.loadBalancer.ingress[]` with both the IP and a
  health flag.

These all map cleanly onto DO API endpoints we already use; the work
is in the trait + the manager glue, not the provider.

### 2.5 Cloud-init is one-shot

The remote nginx stream config is baked into the cloud-init at droplet
creation (server/cloudinit.rs:90-97 → `/etc/nginx/conf.d/stream/innisfree.conf`).
Adding or removing a service today requires `innisfree clean && innisfree up`
— i.e. a new droplet, a new public IP, and a `known_hosts` rewrite.
That is fine for ad-hoc `up`/`down` use but unworkable for either
operator reconciliation or NixOS-style ongoing config management. The
remote needs an in-band reconfigure path: render the template locally,
SSH-push the file, `nginx -s reload`. The bones for this exist in
`run_ssh_cmd` (manager.rs:245).

### 2.6 Hard-coded provisioning knobs

`server/digitalocean/server.rs:20-26` hard-codes region (`sfo2`), size
(`s-1vcpu-1gb`), image (`debian-13-x64`). None are exposed via CLI.
For a homelab in a different region this is a recompile. Promote them
to CLI flags / env vars (clap already wires both) with the current
values as defaults.

### 2.7 `proxy.rs` only handles TCP

`Protocol::Udp` parses, the nginx remote template renders UDP
correctly, but the local `--dest-ip` proxy uses
`tokio::net::TcpStream` exclusively (proxy.rs:18-37). A user who
declares a UDP service today gets nginx forwarding remote-side, but
the local proxy hop will silently no-op. Either (a) fix `transfer()`
to dispatch on `Protocol`, or (b) reject UDP services in the CLI
when `--dest-ip != 127.0.0.1`.

### 2.8 Signal handling

`TunnelManager::block` (manager.rs:196-204) only awaits `signal::ctrl_c`
(SIGINT). The systemd unit hacks around this with `KillSignal=SIGINT`
(innisfree@.service:11). `tokio::signal::unix::signal(SignalKind::terminate())`
plus a `select!` on both is the standard fix. Already on TODO.md.

### 2.9 Spawned proxy task is not joined

main.rs:195-200 does `tokio::spawn(manager::run_proxy(…))` and then
falls through to `mgr.block().await?`. The spawn handle is dropped, so
if the proxy returns an error early — say, port already in use — the
user only sees a `tracing::warn!` from inside `run_proxy`. The exit
code stays zero. Hold the `JoinHandle` and `select!` it against
`block()` so an early proxy failure tears the tunnel down.

### 2.10 `get_server_ip` is a string file

`state.rs:67` writes `ip` as a bare `"<addr>\n"`. It works, but it has
no room to grow — operator status reporting wants more than an IP
(droplet ID, status, last-reconciled timestamp, observed services).
Make this a small JSON document (`status.json`) and keep `ip` as a
backward-compat symlink for one release.

### 2.11 Misc minor

* `serde_yaml = "0.8"` is unmaintained. Migrate to `serde_yml` (the
  community fork) or hand-render — the cloud-init shape is small.
* `lib.rs` doc comment (`Right now, only TCP traffic is supported`) is
  stale; UDP is supported in the templates.
* `WireguardManager::new` uses `remote_name = "innisfree-<svc>-remote"`
  which is purely cosmetic (it never reaches the kernel; only the
  *local* iface name has a length constraint). Worth a comment, since
  the asymmetry with the local-side `pick_local_iface_name` looks like
  a bug at first glance.
* `config.rs:105-115` `clean_name` is a string-replace, not a regex.
  `"innisfreeinnisfreeinnisfree"` becomes `"innisfree-"`; harmless,
  but write a test or replace with a `strip_prefix` loop.

## 3. Kubernetes operator considerations

### 3.1 What you actually want

Stated minimally: when a `Service` of `type=LoadBalancer` appears in
the cluster, an `innisfree-controller` Pod should provision a remote
ingress (DO droplet + tunnel) and write the public IP back into
`status.loadBalancer.ingress[].ip`. Reconcile changes (added/removed
ports, deleted Service) without recreating the droplet. Survive its
own restart — re-attach to existing droplets rather than orphaning
and re-provisioning.

### 3.2 Topology choice

There are two reasonable shapes:

**(A) Controller-only, droplet drives traffic into the cluster.** The
controller runs as a single Deployment; for each Service it stands up
(or updates) a remote droplet whose nginx forwards each public port
back over wg to a *node IP : nodePort* combination. The wg tunnel
terminates on the controller pod (which holds CAP_NET_ADMIN), and the
local proxy hop forwards into the cluster network. This is the closest
mapping to today's `innisfree up`.

**(B) Controller + per-Service tunnel pod.** The controller acts as a
pure reconciler; for each Service it manages, it creates a small
DaemonSet/Deployment of "tunnel pods" that each hold one wg interface
and proxy into the Service ClusterIP. This is closer to how
[klipperlb](https://github.com/k3s-io/klipper-lb) works.

**Recommendation: start with (A).** It needs one CAP_NET_ADMIN pod
total; it reuses the existing single-process `innisfree up` flow with
minor refactoring (managing N tunnels in one process); it punts the
per-pod orchestration question. (B) is the cleaner answer at scale,
but you are running one homelab cluster, not a multi-tenant SaaS.

### 3.3 Multi-tenancy on the cloud side

With (A) there is still a question of *one droplet per Service* vs.
*one droplet for the whole cluster*. At ~$4–6/mo per DO droplet and
hobbyist scale, the shared droplet is right by default:

* nginx happily hosts many `server { listen ...; }` stanzas, one per
  port across all Services.
* Port collisions across Services *must* fail fast — surface as an
  Event on the offending Service.
* Add `loadBalancerClass` discrimination so users can opt a specific
  Service into "give me my own droplet" by selecting a different class
  (e.g. `innisfree.ruin.dev/dedicated`).

Either way, the operator needs a *cluster-scoped backend object* that
holds the DO API token (Secret reference) and the droplet sizing
defaults. A `ClusterInnisfreeBackend` CRD is the standard shape; that
keeps the per-Service annotation surface minimal.

### 3.4 API surface

Avoid a per-Service CRD if possible — users get a much better
experience if they just write `type: LoadBalancer` and pick a
`loadBalancerClass`. Annotations cover the rest:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: minecraft
  annotations:
    innisfree.ruin.dev/floating-ip: "203.0.113.42"
spec:
  type: LoadBalancer
  loadBalancerClass: innisfree.ruin.dev/digitalocean
  ports:
    - port: 25565
      targetPort: 25565
      protocol: TCP
```

Plus one cluster-scoped CRD for backend config:

```yaml
apiVersion: innisfree.ruin.dev/v1alpha1
kind: ClusterInnisfreeBackend
metadata:
  name: default
spec:
  provider: digitalocean
  region: sfo2
  size: s-1vcpu-1gb
  apiTokenSecretRef: { name: do-token, key: token }
  sharing: shared    # or "per-service"
```

### 3.5 Where state lives

The CLI today persists per-tunnel state in `~/.config/innisfree/<svc>/`.
The operator must move that into Kubernetes objects, so a controller
restart can re-attach:

* A per-Service `Secret` holds wg keypairs + the SSH client keypair.
* A per-Service `ConfigMap` (or a status subresource on the
  `ClusterInnisfreeBackend`) holds the droplet ID + assigned IP.
* `TunnelStateDir` becomes one of two implementations behind a small
  `StateStore` trait — local-fs for the CLI, k8s-objects for the
  operator. Same `for_service` / `open` shape; different backing.

### 3.6 Reconciliation loop, sketched

For each watched `Service` with the matching `loadBalancerClass`:

1. **Diff observed vs. desired services** by comparing `spec.ports`
   to the per-Service ConfigMap.
2. **If no droplet exists**, call `Provider::create` (existing path).
3. **If droplet exists and ports changed**, call the new
   `Provider::update_services` to push a new nginx config + reload.
4. **If droplet exists and is healthy**, no-op.
5. **On Service deletion** (caught via finalizer): destroy droplet,
   then remove finalizer.
6. **Always** write `status.loadBalancer.ingress[0].ip` and emit
   Events on the Service for transitions.

Backoff: standard `kube::runtime::controller` exponential backoff;
poll droplet health every minute via a separate background tick.

### 3.7 Where dest_ip comes from

In CLI mode, `--dest-ip` points at "wherever your local service is
bound." In operator mode the controller is in-cluster, so it can
choose:

* **NodePort** (works without CNI assumptions): the controller picks
  any Ready node IP, the Service must auto-allocate a NodePort, and
  the remote nginx forwards `<public_port>``<wg_local>:<nodePort>`
  via wg into the controller pod, which then proxies to
  `<node_ip>:<nodePort>`. Simple but adds a hop.
* **ClusterIP** (cleaner, requires the controller pod to participate
  in the cluster network normally): the local proxy targets
  `<service.clusterIP>:<port>`. This works because the controller pod
  *is* in the cluster network. **This is the right default**; fall
  back to NodePort only if the user opts in.

### 3.8 Library shape required

To make all of the above buildable as a separate `innisfree-operator`
crate (or `innisfree --operator` subcommand) without forking, the
library needs:

* `Provider::update_services(&self, server: &dyn InnisfreeServer, &[ServicePort])`
* `Provider::lookup_by_name(&self, name) -> Option<Box<dyn InnisfreeServer>>`
* `InnisfreeServer::status() -> ServerStatus { id, ipv4, healthy, observed_ports }`
* `StateStore` trait + `LocalFs` impl (today's behaviour) and a
  `Kubernetes` impl in the operator crate.
* `TunnelManager` parameterised over `StateStore`.

These are all extensions, not breaks; the CLI keeps working.

## 4. NixOS for the cloud node

### 4.1 The reconciliation problem this solves

Right now a port change is a green-field redeploy: cloud-init only
runs at first boot, and the rendered nginx + wg files are inputs to
that one-shot. An operator reconcile loop has to invent its own
*push the new file, reload the service* path. NixOS short-circuits
that: a port change becomes "regenerate the system configuration,
push the new closure, run `switch-to-configuration`." Idempotent,
declarative, and the same machinery handles initial provisioning
and incremental updates.

### 4.2 Provisioning options (initial install)

Options for getting NixOS onto a fresh DO droplet, in increasing order
of operational weight:

1. **Custom uploaded image.** Upload a NixOS image to DO once
   ("Custom Images"), then `image: innisfree-base` instead of
   `debian-13-x64`. Single API call to provision; the image is
   versioned alongside the controller. Best UX, modest setup cost.
2. **`nixos-anywhere`** ([nix-community/nixos-anywhere]https://github.com/nix-community/nixos-anywhere).
   Boots Debian via cloud-init, then ssh-kexecs into a NixOS installer
   and writes the desired flake config to disk. ~5 min from
   `do create` to a working NixOS box. No image upload needed.
   Operationally heavier (the controller needs a working nix store),
   but the canonical answer if the controller already runs nix.
3. **`nixos-infect`-style script.** A shell script in cloud-init that
   converts the running Debian to NixOS in place. Older, fragile, do
   not recommend over the two above.

For the homelab + flake-built-controller setup, **(2) is the right
answer for now, with a path to (1)** once the image-build pipeline
is stable. Both keep the existing "spec.user_data" call for initial
SSH key injection — only the post-install path differs.

### 4.3 Reconciliation options (incremental update)

Once the box is NixOS, three tools can push updates:

1. **`nixos-rebuild switch --target-host innisfree@<ip> --flake .#node`.**
   Plain stdlib. The controller runs this via `xshell`. Works fine
   for one host; awkward to parallelise across many.
2. **[colmena]https://github.com/zhaofengli/colmena.** Multi-host
   declarative deploys. Reads a `colmena.nix`, computes per-host
   closures, pushes in parallel, runs `switch`. Designed for fleets;
   minor overkill for one cluster but reads well.
3. **[deploy-rs]https://github.com/serokell/deploy-rs.** Similar to
   colmena, written in Rust, with magic-rollback on broken activations.
   The rollback story is the strongest argument for it — a bad nginx
   config gets reverted automatically.

For a single droplet per cluster: **`nixos-rebuild --target-host`** is
sufficient and means no extra dep. For per-Service droplets,
**deploy-rs** is the better fit (parallel + rollback). Either way,
the controller's job is to render the flake input, kick the tool, and
update Service status — it does not need a long-lived `colmena`-style
process.

### 4.4 What the remote configuration looks like

A thin `nixosModules.innisfree` exported by the flake, parameterised
over `services`:

```nix
{ config, lib, ... }: {
  options.services.innisfree.ports = lib.mkOption {
    type = with lib.types; listOf (submodule { … });
  };

  config = let s = config.services.innisfree.ports; in {
    services.nginx = { enable = true; … };
    networking.firewall.allowedTCPPorts =
      map (p: p.publicPort) (lib.filter (p: p.protocol == "TCP") s);
    services.wireguard.interfaces.wg0 = { … };
  };
}
```

Per-Service rendering becomes "build a `services.innisfree.ports = […]`
attribute set" — pure data, easy to generate from Rust. The flake
itself is committed; only the runtime input changes.

### 4.5 Costs / reasons to think twice

* **Cold start** is slower: a fresh droplet pulling the activation
  closure can be 1–3 minutes vs. ~30s for cloud-init. Worth measuring
  before committing.
* **Disk:** NixOS keeps generations; on a 25GB droplet with frequent
  reconciles, you'll want a `nix-collect-garbage --delete-older-than 7d`
  cron.
* **Controller dependency:** the operator pod must ship nix. That
  means a much larger image (several hundred MB at minimum) than the
  current scratch-based static binary. Acceptable, but no longer "tiny."
* **Debuggability shifts:** operators familiar with Debian + systemd
  will need to learn nix store layout to triage live issues.

### 4.6 Verdict

NixOS is the right destination *if and only if* the operator pattern
proceeds — the win is reconciliation, and the CLI does not need it.
For the standalone `innisfree up` use case, Debian + cloud-init keeps
working fine and the migration cost has no payoff.

The cleanest sequencing: do the maintainability + operator work
**first**, against the existing Debian remote (since the in-band
reconfigure path is small once the trait extensions land), and treat
the NixOS migration as a Phase 3 swap — by then the `Provider`
abstraction will have the seams to make it a `NixosProvider` next to
`DigitalOceanProvider` (or, more likely, a different *bootstrap layer*
within the DO provider).

## 5. Open questions

* **Crates.io for the operator?** Ship as a separate `innisfree-operator`
  crate in the same workspace, or as a `--operator` subcommand on the
  existing binary? Workspace is cleaner; subcommand is fewer artefacts.
* **Floating IP semantics in shared-droplet mode** are awkward — the
  IP belongs to the droplet, not to a particular Service. Annotation
  works for "give this Service its own droplet"; otherwise ignore.
* **IPv6.** DO droplets get a v6 by default; the trait, the wg config,
  and the nginx stream all assume v4 only. Worth scoping into the
  operator work — a `Service` may want both.
* **Per-Service vs shared droplets** — confirm the default. Shared
  is what I've assumed throughout.
* **Region selection per Service** would matter for multi-region
  homelabs (ha!). Probably YAGNI; backend-level region only.