# innisfree — analysis notes
Survey of the v0.4.1 codebase, focused on the three user-supplied
themes: maintainability hygiene, the Kubernetes operator pattern,
and a possible move from Debian + cloud-init to NixOS.
## 1. Codebase shape (as of v0.4.1)
```
src/
├── lib.rs re-exports the public modules
├── main.rs clap CLI; subcommands: up, ssh, ip, doctor, clean, proxy
├── config.rs ServicePort, Protocol, clean_name
├── manager.rs TunnelManager: orchestrates server + wg + state
├── state.rs TunnelStateDir: per-service on-disk layout
├── net.rs /30 subnet picker (10.50.0.1/28 parent)
├── proxy.rs tokio TCP forwarder (UDP not implemented)
├── ssh.rs ED25519 keypair generation + write_to(path)
├── doctor.rs CAP_NET_ADMIN + DIGITALOCEAN_API_TOKEN preflight
├── server.rs Provider + InnisfreeServer traits, ServerSpec
├── server/
│ ├── cloudinit.rs CloudConfig YAML builder (Tera templates)
│ └── digitalocean/ only concrete provider
│ ├── client.rs DoClient: shared reqwest + token + envelope unwrap
│ ├── provider.rs DigitalOceanProvider: implements Provider
│ ├── server.rs Droplet: implements InnisfreeServer
│ ├── floating_ip.rs
│ └── ssh_key.rs
└── wg/
├── mod.rs WireguardManager / Device / Host / Keypair
└── runtime.rs LocalWg: in-process boringtun + rtnetlink
files/ cloudinit.cfg, wg0.conf.j2, stream.conf.j2 (include_str!'d)
tests/ integration_test.rs (live DO), static_linkage.rs
```
The trait split (`Provider` factory → `InnisfreeServer` handle) is well
shaped, the `DoClient` is a clean shared HTTP layer, and `TunnelStateDir`
is a single funnel for every path the binary writes. The recent 0.4.x
refactors have left the code in good shape; what follows are the
remaining sharp edges and the structural changes the new use cases
will require.
## 2. Maintainability findings
### 2.1 XDG paths (the prompt's example)
`src/state.rs:107-110` resolves the per-service state dir as:
```rust
home::home_dir()?.join(".config").join("innisfree")
```
This is wrong on two axes:
* **Wrong category.** Everything under that dir is *state*, not *config*:
ephemeral SSH keypairs, the wg config rendered from runtime data, the
`ip` readiness marker, the `known_hosts` pinned to a freshly-booted
cloud node. Per the XDG Base Directory spec, that belongs under
`$XDG_STATE_HOME` (default `~/.local/state/`), not `$XDG_CONFIG_HOME`.
If real configuration is ever introduced (e.g. saved provider creds,
default region), *that* belongs under `$XDG_CONFIG_HOME`.
* **Wrong resolver.** `home::home_dir()` ignores `XDG_*` overrides
entirely, so users who run with a custom `HOME` get the wrong path
silently.
The state.rs doc comment already flags the migration as a single-edit
site (state.rs:9-10, 105-106), and `home` is only used here — replacing
it with `directories` (or `etcetera` / `xdg`) is genuinely a 10-line
change. The migration should also write a one-shot symlink/copy from
the old location to keep existing `innisfree@<service>.service` units
working without operator action.
### 2.2 No locking around the per-service state dir
`TunnelManager::new` (manager.rs:84) calls
`remove_state_for_service(tunnel_name)` unconditionally before
re-creating the dir. That is the only safety net against stale state
from a crashed prior run, but it is also a foot-gun: a second
`innisfree up --name foo` invoked while the first is alive will wipe
the first's keypairs and `known_hosts` out from under it, then race
to create a second droplet of the same name. A `flock`-style
`<dir>/.lock` would catch this cheaply.
### 2.3 Library / binary boundary leaks toward `main.rs`
`src/lib.rs` re-exports every module as `pub`, so the crate is shipped
as a library too — but `src/main.rs` re-declares `mod config; mod
manager; …` rather than `use innisfree::*`. Both copies compile, but
the binary's view is the canonical one for behaviour: the
`DigitalOceanProvider` is constructed in main.rs:148-149, not exposed
anywhere from the library. A second consumer of the library (the
operator binary, see §3) would have to either re-import the DO module
directly or reach back through main.rs. Pull provider construction
into a small library helper (`innisfree::providers::digitalocean()`)
and have main.rs call into it.
### 2.4 The `Provider` trait can only `create` and `destroy`
`src/server.rs:46-69` defines the surface:
```rust
trait Provider { async fn create(&self, spec: &ServerSpec) -> ...; }
trait InnisfreeServer {
fn ipv4_address(&self) -> Result<IpAddr>;
async fn assign_floating_ip(&self, ip: IpAddr) -> Result<()>;
async fn destroy(&self) -> Result<()>;
}
```
There is no `update`, no `lookup_by_name`, no `list`, no `status`. The
CLI happens to be fine with this — every `up` is a green-field
provision — but a reconcile loop needs all four. Specifically:
* **`update_services(&[ServicePort])`** to push a new nginx stream
config and reload, without rebuilding the droplet.
* **`reattach(name) -> Option<Box<dyn InnisfreeServer>>`** so the
operator can recover state after a pod restart.
* **`status() -> ServerStatus`** so the operator can populate
`Service.status.loadBalancer.ingress[]` with both the IP and a
health flag.
These all map cleanly onto DO API endpoints we already use; the work
is in the trait + the manager glue, not the provider.
### 2.5 Cloud-init is one-shot
The remote nginx stream config is baked into the cloud-init at droplet
creation (server/cloudinit.rs:90-97 → `/etc/nginx/conf.d/stream/innisfree.conf`).
Adding or removing a service today requires `innisfree clean && innisfree up`
— i.e. a new droplet, a new public IP, and a `known_hosts` rewrite.
That is fine for ad-hoc `up`/`down` use but unworkable for either
operator reconciliation or NixOS-style ongoing config management. The
remote needs an in-band reconfigure path: render the template locally,
SSH-push the file, `nginx -s reload`. The bones for this exist in
`run_ssh_cmd` (manager.rs:245).
### 2.6 Hard-coded provisioning knobs
`server/digitalocean/server.rs:20-26` hard-codes region (`sfo2`), size
(`s-1vcpu-1gb`), image (`debian-13-x64`). None are exposed via CLI.
For a homelab in a different region this is a recompile. Promote them
to CLI flags / env vars (clap already wires both) with the current
values as defaults.
### 2.7 `proxy.rs` only handles TCP
`Protocol::Udp` parses, the nginx remote template renders UDP
correctly, but the local `--dest-ip` proxy uses
`tokio::net::TcpStream` exclusively (proxy.rs:18-37). A user who
declares a UDP service today gets nginx forwarding remote-side, but
the local proxy hop will silently no-op. Either (a) fix `transfer()`
to dispatch on `Protocol`, or (b) reject UDP services in the CLI
when `--dest-ip != 127.0.0.1`.
### 2.8 Signal handling
`TunnelManager::block` (manager.rs:196-204) only awaits `signal::ctrl_c`
(SIGINT). The systemd unit hacks around this with `KillSignal=SIGINT`
(innisfree@.service:11). `tokio::signal::unix::signal(SignalKind::terminate())`
plus a `select!` on both is the standard fix. Already on TODO.md.
### 2.9 Spawned proxy task is not joined
main.rs:195-200 does `tokio::spawn(manager::run_proxy(…))` and then
falls through to `mgr.block().await?`. The spawn handle is dropped, so
if the proxy returns an error early — say, port already in use — the
user only sees a `tracing::warn!` from inside `run_proxy`. The exit
code stays zero. Hold the `JoinHandle` and `select!` it against
`block()` so an early proxy failure tears the tunnel down.
### 2.10 `get_server_ip` is a string file
`state.rs:67` writes `ip` as a bare `"<addr>\n"`. It works, but it has
no room to grow — operator status reporting wants more than an IP
(droplet ID, status, last-reconciled timestamp, observed services).
Make this a small JSON document (`status.json`) and keep `ip` as a
backward-compat symlink for one release.
### 2.11 Misc minor
* `serde_yaml = "0.8"` is unmaintained. Migrate to `serde_yml` (the
community fork) or hand-render — the cloud-init shape is small.
* `lib.rs` doc comment (`Right now, only TCP traffic is supported`) is
stale; UDP is supported in the templates.
* `WireguardManager::new` uses `remote_name = "innisfree-<svc>-remote"`
which is purely cosmetic (it never reaches the kernel; only the
*local* iface name has a length constraint). Worth a comment, since
the asymmetry with the local-side `pick_local_iface_name` looks like
a bug at first glance.
* `config.rs:105-115` `clean_name` is a string-replace, not a regex.
`"innisfreeinnisfreeinnisfree"` becomes `"innisfree-"`; harmless,
but write a test or replace with a `strip_prefix` loop.
## 3. Kubernetes operator considerations
### 3.1 What you actually want
Stated minimally: when a `Service` of `type=LoadBalancer` appears in
the cluster, an `innisfree-controller` Pod should provision a remote
ingress (DO droplet + tunnel) and write the public IP back into
`status.loadBalancer.ingress[].ip`. Reconcile changes (added/removed
ports, deleted Service) without recreating the droplet. Survive its
own restart — re-attach to existing droplets rather than orphaning
and re-provisioning.
### 3.2 Topology choice
There are two reasonable shapes:
**(A) Controller-only, droplet drives traffic into the cluster.** The
controller runs as a single Deployment; for each Service it stands up
(or updates) a remote droplet whose nginx forwards each public port
back over wg to a *node IP : nodePort* combination. The wg tunnel
terminates on the controller pod (which holds CAP_NET_ADMIN), and the
local proxy hop forwards into the cluster network. This is the closest
mapping to today's `innisfree up`.
**(B) Controller + per-Service tunnel pod.** The controller acts as a
pure reconciler; for each Service it manages, it creates a small
DaemonSet/Deployment of "tunnel pods" that each hold one wg interface
and proxy into the Service ClusterIP. This is closer to how
[klipperlb](https://github.com/k3s-io/klipper-lb) works.
**Recommendation: start with (A).** It needs one CAP_NET_ADMIN pod
total; it reuses the existing single-process `innisfree up` flow with
minor refactoring (managing N tunnels in one process); it punts the
per-pod orchestration question. (B) is the cleaner answer at scale,
but you are running one homelab cluster, not a multi-tenant SaaS.
### 3.3 Multi-tenancy on the cloud side
With (A) there is still a question of *one droplet per Service* vs.
*one droplet for the whole cluster*. At ~$4–6/mo per DO droplet and
hobbyist scale, the shared droplet is right by default:
* nginx happily hosts many `server { listen ...; }` stanzas, one per
port across all Services.
* Port collisions across Services *must* fail fast — surface as an
Event on the offending Service.
* Add `loadBalancerClass` discrimination so users can opt a specific
Service into "give me my own droplet" by selecting a different class
(e.g. `innisfree.ruin.dev/dedicated`).
Either way, the operator needs a *cluster-scoped backend object* that
holds the DO API token (Secret reference) and the droplet sizing
defaults. A `ClusterInnisfreeBackend` CRD is the standard shape; that
keeps the per-Service annotation surface minimal.
### 3.4 API surface
Avoid a per-Service CRD if possible — users get a much better
experience if they just write `type: LoadBalancer` and pick a
`loadBalancerClass`. Annotations cover the rest:
```yaml
apiVersion: v1
kind: Service
metadata:
name: minecraft
annotations:
innisfree.ruin.dev/floating-ip: "203.0.113.42"
spec:
type: LoadBalancer
loadBalancerClass: innisfree.ruin.dev/digitalocean
ports:
- port: 25565
targetPort: 25565
protocol: TCP
```
Plus one cluster-scoped CRD for backend config:
```yaml
apiVersion: innisfree.ruin.dev/v1alpha1
kind: ClusterInnisfreeBackend
metadata:
name: default
spec:
provider: digitalocean
region: sfo2
size: s-1vcpu-1gb
apiTokenSecretRef: { name: do-token, key: token }
sharing: shared # or "per-service"
```
### 3.5 Where state lives
The CLI today persists per-tunnel state in `~/.config/innisfree/<svc>/`.
The operator must move that into Kubernetes objects, so a controller
restart can re-attach:
* A per-Service `Secret` holds wg keypairs + the SSH client keypair.
* A per-Service `ConfigMap` (or a status subresource on the
`ClusterInnisfreeBackend`) holds the droplet ID + assigned IP.
* `TunnelStateDir` becomes one of two implementations behind a small
`StateStore` trait — local-fs for the CLI, k8s-objects for the
operator. Same `for_service` / `open` shape; different backing.
### 3.6 Reconciliation loop, sketched
For each watched `Service` with the matching `loadBalancerClass`:
1. **Diff observed vs. desired services** by comparing `spec.ports`
to the per-Service ConfigMap.
2. **If no droplet exists**, call `Provider::create` (existing path).
3. **If droplet exists and ports changed**, call the new
`Provider::update_services` to push a new nginx config + reload.
4. **If droplet exists and is healthy**, no-op.
5. **On Service deletion** (caught via finalizer): destroy droplet,
then remove finalizer.
6. **Always** write `status.loadBalancer.ingress[0].ip` and emit
Events on the Service for transitions.
Backoff: standard `kube::runtime::controller` exponential backoff;
poll droplet health every minute via a separate background tick.
### 3.7 Where dest_ip comes from
In CLI mode, `--dest-ip` points at "wherever your local service is
bound." In operator mode the controller is in-cluster, so it can
choose:
* **NodePort** (works without CNI assumptions): the controller picks
any Ready node IP, the Service must auto-allocate a NodePort, and
the remote nginx forwards `<public_port>` → `<wg_local>:<nodePort>`
via wg into the controller pod, which then proxies to
`<node_ip>:<nodePort>`. Simple but adds a hop.
* **ClusterIP** (cleaner, requires the controller pod to participate
in the cluster network normally): the local proxy targets
`<service.clusterIP>:<port>`. This works because the controller pod
*is* in the cluster network. **This is the right default**; fall
back to NodePort only if the user opts in.
### 3.8 Library shape required
To make all of the above buildable as a separate `innisfree-operator`
crate (or `innisfree --operator` subcommand) without forking, the
library needs:
* `Provider::update_services(&self, server: &dyn InnisfreeServer, &[ServicePort])`
* `Provider::lookup_by_name(&self, name) -> Option<Box<dyn InnisfreeServer>>`
* `InnisfreeServer::status() -> ServerStatus { id, ipv4, healthy, observed_ports }`
* `StateStore` trait + `LocalFs` impl (today's behaviour) and a
`Kubernetes` impl in the operator crate.
* `TunnelManager` parameterised over `StateStore`.
These are all extensions, not breaks; the CLI keeps working.
## 4. NixOS for the cloud node
### 4.1 The reconciliation problem this solves
Right now a port change is a green-field redeploy: cloud-init only
runs at first boot, and the rendered nginx + wg files are inputs to
that one-shot. An operator reconcile loop has to invent its own
*push the new file, reload the service* path. NixOS short-circuits
that: a port change becomes "regenerate the system configuration,
push the new closure, run `switch-to-configuration`." Idempotent,
declarative, and the same machinery handles initial provisioning
and incremental updates.
### 4.2 Provisioning options (initial install)
Options for getting NixOS onto a fresh DO droplet, in increasing order
of operational weight:
1. **Custom uploaded image.** Upload a NixOS image to DO once
("Custom Images"), then `image: innisfree-base` instead of
`debian-13-x64`. Single API call to provision; the image is
versioned alongside the controller. Best UX, modest setup cost.
2. **`nixos-anywhere`** ([nix-community/nixos-anywhere](https://github.com/nix-community/nixos-anywhere)).
Boots Debian via cloud-init, then ssh-kexecs into a NixOS installer
and writes the desired flake config to disk. ~5 min from
`do create` to a working NixOS box. No image upload needed.
Operationally heavier (the controller needs a working nix store),
but the canonical answer if the controller already runs nix.
3. **`nixos-infect`-style script.** A shell script in cloud-init that
converts the running Debian to NixOS in place. Older, fragile, do
not recommend over the two above.
For the homelab + flake-built-controller setup, **(2) is the right
answer for now, with a path to (1)** once the image-build pipeline
is stable. Both keep the existing "spec.user_data" call for initial
SSH key injection — only the post-install path differs.
### 4.3 Reconciliation options (incremental update)
Once the box is NixOS, three tools can push updates:
1. **`nixos-rebuild switch --target-host innisfree@<ip> --flake .#node`.**
Plain stdlib. The controller runs this via `xshell`. Works fine
for one host; awkward to parallelise across many.
2. **[colmena](https://github.com/zhaofengli/colmena).** Multi-host
declarative deploys. Reads a `colmena.nix`, computes per-host
closures, pushes in parallel, runs `switch`. Designed for fleets;
minor overkill for one cluster but reads well.
3. **[deploy-rs](https://github.com/serokell/deploy-rs).** Similar to
colmena, written in Rust, with magic-rollback on broken activations.
The rollback story is the strongest argument for it — a bad nginx
config gets reverted automatically.
For a single droplet per cluster: **`nixos-rebuild --target-host`** is
sufficient and means no extra dep. For per-Service droplets,
**deploy-rs** is the better fit (parallel + rollback). Either way,
the controller's job is to render the flake input, kick the tool, and
update Service status — it does not need a long-lived `colmena`-style
process.
### 4.4 What the remote configuration looks like
A thin `nixosModules.innisfree` exported by the flake, parameterised
over `services`:
```nix
{ config, lib, ... }: {
options.services.innisfree.ports = lib.mkOption {
type = with lib.types; listOf (submodule { … });
};
config = let s = config.services.innisfree.ports; in {
services.nginx = { enable = true; … };
networking.firewall.allowedTCPPorts =
map (p: p.publicPort) (lib.filter (p: p.protocol == "TCP") s);
services.wireguard.interfaces.wg0 = { … };
};
}
```
Per-Service rendering becomes "build a `services.innisfree.ports = […]`
attribute set" — pure data, easy to generate from Rust. The flake
itself is committed; only the runtime input changes.
### 4.5 Costs / reasons to think twice
* **Cold start** is slower: a fresh droplet pulling the activation
closure can be 1–3 minutes vs. ~30s for cloud-init. Worth measuring
before committing.
* **Disk:** NixOS keeps generations; on a 25GB droplet with frequent
reconciles, you'll want a `nix-collect-garbage --delete-older-than 7d`
cron.
* **Controller dependency:** the operator pod must ship nix. That
means a much larger image (several hundred MB at minimum) than the
current scratch-based static binary. Acceptable, but no longer "tiny."
* **Debuggability shifts:** operators familiar with Debian + systemd
will need to learn nix store layout to triage live issues.
### 4.6 Verdict
NixOS is the right destination *if and only if* the operator pattern
proceeds — the win is reconciliation, and the CLI does not need it.
For the standalone `innisfree up` use case, Debian + cloud-init keeps
working fine and the migration cost has no payoff.
The cleanest sequencing: do the maintainability + operator work
**first**, against the existing Debian remote (since the in-band
reconfigure path is small once the trait extensions land), and treat
the NixOS migration as a Phase 3 swap — by then the `Provider`
abstraction will have the seams to make it a `NixosProvider` next to
`DigitalOceanProvider` (or, more likely, a different *bootstrap layer*
within the DO provider).
## 5. Open questions
* **Crates.io for the operator?** Ship as a separate `innisfree-operator`
crate in the same workspace, or as a `--operator` subcommand on the
existing binary? Workspace is cleaner; subcommand is fewer artefacts.
* **Floating IP semantics in shared-droplet mode** are awkward — the
IP belongs to the droplet, not to a particular Service. Annotation
works for "give this Service its own droplet"; otherwise ignore.
* **IPv6.** DO droplets get a v6 by default; the trait, the wg config,
and the nginx stream all assume v4 only. Worth scoping into the
operator work — a `Service` may want both.
* **Per-Service vs shared droplets** — confirm the default. Shared
is what I've assumed throughout.
* **Region selection per Service** would matter for multi-region
homelabs (ha!). Probably YAGNI; backend-level region only.