rayfish 0.1.4

P2P mesh VPN powered by iroh — connect peers by cryptographic identity, not IP address
# Rayfish

P2P mesh VPN powered by [iroh](https://iroh.computer). Connects peers by cryptographic identity (EndpointId), not IP address. Dual-stack addressing: stable IPv4 in 100.64.0.0/10 (CGNAT, FNV-1a of identity) and stable IPv6 in 200::/7 (blake3 of identity, 120-bit, never rotates).

## Build

```bash
cargo -q build                 # add --features tor for Tor transport, --features otel for OTLP span export
cargo -q check
cargo -q test
cargo -q clippy
cargo bench                    # Criterion microbenchmarks of the per-packet data path (benches/forward.rs)
```

The crate splits into a library (`src/lib.rs`, daemon modules as `pub mod`) and a thin binary (`src/main.rs`, the `ray` CLI/IPC client, `use rayfish::…`). The split lets benchmarks (`benches/`) and integration tests reach the internal data path; `cargo install` builds the binary against the in-package library unchanged.

## Run

The daemon (`ray daemon`) owns the TUN device and iroh endpoint and runs as a system service. CLI commands talk to it over Unix-socket IPC.

```bash
sudo ray up                    # install+start the service, then activate the VPN
ray create [--open] [--name n] [--hostname h] [--tor]   # closed by default; --open = public network. Prints room id (public key)
ray join <room-id-or-invite> [--name alias] [--hostname h] [--auto-accept-firewall] [--tor]  # join by room id or one-time invite code; --auto-accept-firewall auto-installs suggested rules (managed node/server)
ray leave <net> | nuke <net>   # nuke = publish empty record then leave
ray hostname <net> <name>      # change hostname on existing network
ray status                     # all networks (works without daemon); per-host traffic, member count excludes self. Ends with a `pending` summary of things awaiting the user (firewall suggestions, join requests, file offers, connection requests) each with the command that clears it
ray <cmd> --json               # global flag: machine-readable JSON for status/firewall show/files/invite list/requests/admin list/ping/netcheck/identityof (color + spinners off)
ray report                     # bundle logs+metrics, open a pre-filled GitHub issue
ray ping <peer> [-c N] [-i ms] # active mesh probe: per-probe RTT + loss + direct/relay path for a peer (hostname/mesh IP/short id). Sends live echo probes (ControlMsg::Ping/Pong over the mesh connection), unlike status's passive snapshot. -c/--count (default 3), -i/--interval ms (default 1000); --json emits the per-probe array
ray netcheck                   # local endpoint diagnostics: bound UDP port (+ fixed-forwardable vs ephemeral fallback), home relay + its latency, public IPv4/IPv6, UDP reachability (iroh net report). --json
ray up [--hostname h] | down   # activate / standby. down takes the data plane (TUN + Magic DNS) offline but stays connected to peers (still online); --hostname sets your default name

ray invite <net> [--expires 7d] [--hostname H] [--qr]   # coordinator-only: mint single-use invite; --qr prints a scannable QR; --hostname binds an authoritative name (overrides joiner choice, rejected on collision)
ray invite <net> --reusable [--expires 30d]          # mint a reusable (multi-use, expiring) key for unattended fleets; rides the signed blob, no hostname binding. Servers: ray join <key> --hostname H --auto-accept-firewall
ray invite <net> list|revoke <id>          # list / revoke invites (reusable keys tagged; revoke propagates via the blob)
ray requests <net>             # coordinator-only: peers awaiting live approval
ray accept <net> <id> | deny <net> <id>    # admit / reject a pending join request
ray connect <contact-id> [--hostname h]    # request a direct 2-peer connection by the peer's contact id (no room id/invite); blocks as pending until they approve
ray connections [approve <id>]             # list incoming connect requests (default) / approve one → mints a 2-peer network with the requester pre-approved
ray contact [id|rotate]        # print (default) or rotate your shareable contact id (also shown at the top of `ray status`)
ray admin <net> add <id> | list            # coordinator-only: grant the network key (co-coordinator) / list key-holders
ray firewall show|default|add|remove ...               # per-device local firewall. Default posture: inbound TCP/UDP denied, inbound ICMP allowed, outbound allowed. `firewall default allow|deny` sets the inbound default. `--port`/`-P` takes a single port, a `start-end` range, or a comma list (`80,443`, `22,8000-9000`) that expands to one rule per item
ray firewall reject on|off     # "fail fast" REJECT mode (opt-in, default off). On = a denied packet gets a TCP RST / ICMP-unreachable reply (both directions) so the initiator fails immediately ("connection refused") instead of hanging; off = silent drop (stealthy). Surfaced in `firewall show`
ray apply <spec> [--prune] [--dry-run] [--invite-missing] [--example]   # declarative deploy (YAML only): create closed nets + suggest firewall + report membership gap. Optional top-level `aliases:` (name → identity string) and `groups:` (name → [alias|hostname]) are coordinator-side shorthand, expanded **client-side** at apply time into a plain hostname-keyed firewall (never reach the blob). An alias names a user (its `user_identity`, or the device endpoint id if unpaired) and expands to all that user's currently-joined device hostnames; a group is a named set of aliases/literal hostnames usable as a rule subject or peer. Resolution precedence subject/peer = group → alias → literal. Aliases resolve only for already-joined members (warn + skip if zero joined); literal hostnames remain the pre-join identifier
ray identityof <net> <host> [--json]   # print a host's identity string to paste into a spec's `aliases:` (user_identity if paired, else device endpoint id). Open read; errors if the host isn't currently joined
ray firewall suggest <net> --subject H [--allow peer:proto:ports] [--deny peer:proto:ports]  # coordinator-only: suggest rules on any network (rides the signed blob). Subject/peer `*` = all hosts / any peer. `--allow`/`--deny` value is `[peer:]proto:ports` — the `peer:` prefix is optional, so a bare `tcp:22` (or `icmp`) means "any peer" (parsed by `main::parse_suggest_token`: a leading protocol keyword ⇒ peer `*`). Token grammar: `proto:ports` (tcp:22, udp:53, tcp:*, any:*) or bare proto (icmp, any, tcp). Suggestions are **additive** — each token becomes one allow/deny rule; an allow-list relies on the node's own inbound default-deny to block the rest (no catch-all is synthesized), denies-only ⇒ blacklist
ray firewall pending <net> | accept <net> | deny <net>  # review/accept/discard queued suggested rules. On a TTY, `pending` is an interactive picker (↑↓ · enter accept · d deny · a all · q done); piped/`--json` falls back to a static table
ray firewall auto-accept <net> on|off  # toggle this node's auto-install of suggested rules for a network (on = install current queue)
ray firewall ssh on|off | allow <net> <peer> [--user u,...] | deny <net> <peer> | show [<net>]   # embedded mesh SSH (Tailscale-style, no SSH keys). `ssh on` runs an SSH server on this node's mesh IPs:22 (seeds an `allow in tcp:22` passthrough, origin `Ssh`); `allow <net> <peer>` authorizes a peer (hostname/mesh-ip/short-id, or `*` = any peer on the network) to log in. `--user`/`-u` (comma list) restricts which local unix users the peer may log in as: omitted ⇒ any **non-root** user (secure default), named accounts ⇒ only those, `*` ⇒ any user incl. root; setting a peer's rule replaces its user list. The per-user check is by uid (a uid-0 account under any name is blocked unless root is granted). Connect with a stock client (`ssh user@host.ray`); the peer is identified by its mesh identity (already proven by the QUIC link → `auth_none` is the gate). Global toggle in settings.toml (`ssh_enabled`); per-network allow list in `networks/<name>.toml` (`ssh_allow`, a list of `SshRule { peer, users }`)
ray mdns on|off                # local peer discovery (default on)
ray config [get [key] | set <key> <value> [--replace] | unset <key>]   # global server overrides; keys: relay, discovery-dns, dns-upstreams. Value is a comma list of presets (rayfish/n0), URLs, or IPv4s (multiple custom relays allowed). Default augments n0; --replace swaps them out. `n0`/empty resets. Written client-side to settings.toml (like mdns); all apply on `sudo ray restart`
ray send <file> <peer>         # file sharing; ray files [accept <id> [--output dir]]
ray pair [<ticket>|backup|restore <code>]              # multi-device identity
ray pair backup [--1password [--vault V] [--item T]]   # encrypted key backup; --1password stores the enc1 blob in 1Password (op CLI)
ray pair restore [<code>|--1password [--vault V] [--item T]]   # restore from a code or from 1Password
ray completions <shell>
ray version | ray --version | ray -V        # print the compiled rayfish version + git sha
ray update [--check] [--force] [--nightly] [--list] [--version V]   # self-update from GitHub releases. Default = latest stable; --nightly tracks the rolling nightly pre-release (rebuilt on every commit to master); --version V pins a specific release (downgrades allowed); --list prints available releases; --check reports current vs latest without installing. Prints the release notes of every pending version (stable: each release in (current, latest]; nightly/pinned: the resolved release body) before updating, and in --check output. Replaces this binary, then (if the service is installed) restarts the daemon onto it (needs root). No persisted channel — each run picks its target from the flag
```

**Privilege & access (Tailscale operator model):** the always-root daemon does privileged work; clients are unprivileged. The IPC socket is mode `0666`; authority comes from a per-request `SO_PEERCRED` UID check in `DaemonState::check_authorized()`, not socket permissions. Reads (`status`, `*… show`, `files`) are open to any local user; mutating commands need root or the configured `operator_uid`; `set-operator` is root-only. Only `install`, `restart`, `uninstall`, `start`, `stop`, `set-operator`, and `daemon` need `sudo`; everything else (incl. `up`/`down`) is IPC. `ray up`/`install` auto-grant operator to `$SUDO_USER`.

```bash
sudo ray install | restart | uninstall      # manage the service unit/plist
sudo ray start | stop                        # start / stop the service. stop = fully offline (closes peer connections); start = back online
sudo ray set-operator <user>                 # authorize a user to run ray without sudo
```

### Cross-compile & deploy

```bash
just cross                     # build for x86_64 Linux
just deploy <ip>               # cross-build release + install + start daemon
just deploy-dev <ip>           # same, debug build
```

## Architecture

```
App → TUN (100.64.x.x / 200::x) → rayfish → iroh QUIC datagrams → peer
```

One iroh Endpoint and TUN device are shared across all networks. Each network gets its own ALPN (`rayfish/net/<version>/<pubkey-prefix>`); the `ProtocolRouter` dispatches incoming connections by ALPN to per-network handlers. The leading `<version>` (`transport::MESH_PROTOCOL_VERSION`) makes the ALPN the **mesh protocol-version gate**: iroh negotiates the ALPN during the QUIC handshake, so peers on different mesh versions share no common ALPN and cannot connect — no in-band version handshake exists. Bumping the constant on a breaking mesh change severs old peers automatically (likewise the versioned `connect`/`files`/`pair` ALPNs gate their own protocols).

### Modules

Ownership plus the non-obvious invariants only; read the file for the mechanics.

- `src/main.rs` — thin clap CLI + IPC client (`Cli`/`Command`, `main` dispatch, tracing/panic plumbing, shared presentation helpers). `ray daemon` (hidden) runs the foreground daemon loop; `build.rs` stamps the git short SHA into `RAY_GIT_SHA`. Command handlers live in `src/cli/` (one file per domain); the module wiring lets any handler resolve the others and the shared helpers by bare name (each file opens `use crate::*;`, `cli/mod.rs` re-exports, `main.rs` does `use cli::*`). Handlers and the consts/enums they cross are `pub(crate)`.
- `src/daemon/` — daemon process (directory module). `mod.rs` holds core types + process wiring (`DaemonState`, `NetworkState`, `ProtocolRouter`, accept-state structs, `run_daemon`/`build_daemon`, the `handle_request` IPC dispatcher, spawn_* background tasks); IPC handlers are split by domain into `src/daemon/handlers/` (each an extra `impl DaemonState` block, opening `use super::super::*;`). Every `DaemonState` method is `pub(crate)`. `MeshCtx` bundles the daemon-wide cheap-`Clone` handles (identity, `PeerTable`, `tun_tx`, stats, blob store, firewall, DNS tables, `DeviceUserMap`) so a new daemon-wide dependency is one field, not a param at every spawn/reconverge call site. **Admission gate:** `CoordinatorAcceptState::handle_connection` -> `admit_peer` (open / valid invite / pre-approved) vs queue-as-pending (closed); runs on **any node holding the network key** (`register_coordinator_handler` at startup, `promote_to_coordinator` on `AdminGrant`). Fresh joins dial in `coordinator_dial_order` (minter first, then other `is_coordinator` members). Diagnostics (`ping`/`netcheck`) and `ray connect` handlers live here too; both are open reads.
- `src/ipc.rs` — `IpcMessage` enum (all IPC requests/responses), `MsgpackCodec` (length-prefixed msgpack), socket at `/var/run/rayfish/rayfish.sock`. `NetworkRole::Direct` is display-only.
- `src/identity.rs` — persistent Ed25519 keypair (`<config_dir>/secret_key`, `0600`); device certs.
- `src/onepassword.rs` — `op` CLI wrapper for `ray pair backup/restore --1password`; transports the already-encrypted `enc1…` blob only, CLI-side.
- `src/invite.rs` — coordinator-only **single-use** invite ledger (`invites/<network>.toml`, `0600`); only the blake3 hash is persisted, never the secret. `redeem` burns at admission, `restore` un-burns if `admit_peer` later rejects the join (so a collision doesn't lock the holder out). Cross-coordinator gossip via `record_shared`/`burn_by_hash`. Reusable keys live in the signed blob (`membership::ReusableKey`), not here.
- `src/membership.rs` — IPv4/IPv6 derivation, `MemberList`/`ApprovedList`, `GroupBlob` with canonical msgpack + blake3 hashing. Members carry optional `user_identity`/`device_cert`, `is_coordinator`, and `collision_index`. `assign_ip` picks the lowest free collision index at admission; `resolve_ip_tiebreak` re-seats contested entries in identity order (lowest keeps its index) before a fetched roster is applied. `validate_reusable_key` is the pure reusable-key admission decision. `SuggestedFirewall` lives in `ray-proto` so it crosses IPC, rides the blob, and parses from a spec uniformly.
- `src/transport.rs` — iroh endpoint setup + per-protocol ALPNs. **The version segment in each ALPN is that protocol's compatibility gate** (`network_alpn` = `rayfish/net/<MESH_PROTOCOL_VERSION>/<prefix>`; `CONNECT_ALPN`/`FILES_ALPN`/`PAIR_ALPN` versioned independently) — bump it on a breaking change to sever old peers. Binds a **fixed UDP port** `RAYFISH_LISTEN_PORT` (41383) for stable manual port-forwarding, falling back to ephemeral if in use. One shared endpoint, so the forward benefits only one node per LAN. Relay/discovery overrides from `config::ServerOverride` are no-ops when unset (default bind is byte-for-byte unchanged). Optional Tor transport (`tor` feature).
- `src/tun.rs` — async dual-stack TUN (`TunReader`/`TunWriter`). `route_peer_range()` installs the `200::/7` route and **must run after link-up** (Linux won't install an IPv6 connected route while the link is down, else peer traffic leaks out the host default route). `route_self_loopback` sends our own dual-stack addresses via `lo0` so self-traffic is answered locally (macOS only; a point-to-point `utun` lacks the auto loopback route).
- `src/forward.rs` — TUN <-> peer forwarding, firewall enforcement, labeled drop counters. `run_mesh` intercepts UDP to `MAGIC_DNS_V4:53` and answers in-daemon, so Magic DNS never binds host port 53. On a firewall deny with `reject` mode on, emits a `reject::build_reject` reply. **Ingress anti-spoofing:** `evaluate_inbound` drops a datagram whose source IP isn't the sending peer's assigned mesh IP — this is what lets `ssh.rs` trust the socket source IP as peer identity. Also runs the SSH port NAT (see `ssh.rs`).
- `src/ssh.rs` — embedded mesh SSH server, Tailscale-style: a stock `ssh user@host.ray` lands here over the already-authenticated mesh link, so `auth_none` is the gate. **Port handling:** can't bind `<mesh-ip>:22` while a host sshd holds `0.0.0.0:22`, so `russh` binds `:SSH_LISTEN_PORT` (30022) and `forward.rs` runs a userspace `:22`<->`:30022` NAT (portable, leaves the host sshd untouched, gated by `set_ssh_nat_active`). Admits iff the peer (or `*`) is in a shared network's `ssh_allow` **and** the requested unix user passes `UserPolicy::permits` (uid-checked, so a uid-0 account can't bypass the non-root default). Drops privileges in `pre_exec` (fail-closed). `russh` uses the `ring` backend (aws-lc-rs's C build breaks cross-compilation).
- `src/reject.rs` — "fail fast" REJECT reply builder (`ray firewall reject on`): TCP RST for denied TCP, ICMP unreachable otherwise, src/dst swapped. Returns `None` (stay silent) for loop-risk cases: incoming RST/ICMP error, multicast/broadcast source, too-short packet.
- `src/dht.rs` — one pkarr record per network, signed by the per-network secret key so it can't be spoofed (the pkarr address *is* the network public key); carries blob hash + seed peers + `m,<mesh-version>`. Plus a per-user contact record for `ray connect`. Follows the `discovery-dns` config: one client = one URL, so only the **first** discovery URL is used for publish/resolve.
- `src/control.rs` — length-prefixed msgpack control protocol over QUIC streams (`JoinRequest`/Welcome/`JoinPending`/`AdminGrant`/`InviteShare`/`InviteUsed`/`MemberSync`/`Ping`/`Pong`/…). `ConnectMsg` is a separate enum for `ray connect` over `CONNECT_ALPN`. Invite gossip is ignored unless the sending peer is `is_coordinator` in the verified roster. `Ping`/`Pong` reply over a *fresh* `open_bi` stream (readers drop the request stream's send half) and are unknown to old peers (graceful 100% loss, no ALPN bump).
- `src/peers.rs` — `PeerTable` (dual v4/v6 DashMaps), `DeviceUserMap`. A peer keeps one virtual IP across every network, so each `PeerEntry` holds a *set* of connections (`network -> Connection`) and stays reachable while it shares one live connection. `remove_peer_from_network` drops one network's route, `remove` drops it everywhere.
- `src/config.rs` — config storage: **sharded, atomic, per-network** (globals in `settings.toml`, each network in `networks/<name>.toml`), replacing the old single `networks.toml` whose non-atomic rewrites raced and dropped networks. Writes go through `write_file` (temp + `rename`) and are targeted, so a write to one network can't clobber another. `pending_hostname` is the durable rename intent: unlike `my_hostname` it is **not** overwritten by a stale blob, so the rename keeps re-sending until the signed roster confirms it. `config_dir()` = `/etc/rayfish` (Linux) / `~/.config/rayfish` (macOS); secret-bearing files `0600 root:root`, others `0640 root:rayfish`. `load()` runs a one-time `migrate_legacy` split. `ServerOverride` resolver/validator helpers back `ray config`.
- `src/apply.rs` — declarative `ray apply` spec (`DeploySpec`, **YAML only**; the `config` crate lowercases keys, so names must be lowercase). `expand_firewall` is the pure client-side expansion of aliases/groups to concrete hostnames (precedence group -> alias -> literal; `*` untouched) — aliases/groups never reach the blob. Orchestrator is `main::ipc_apply`.
- `src/firewall.rs` — per-device firewall (direction/proto/port/peer + optional arrival-`network`), `ArcSwap` for lock-free reads. Secure-by-default posture via `default_inbound`/`default_outbound` + the seeded removable `allow in icmp` rule. `RuleOrigin` (`Local` | `Network(net)`) records provenance so reconvergence replaces the `Network(net)` set without touching `Local` rules. `materialize_suggestions` is purely additive (one rule per token, no synthesized catch-all: an allow-list relies on the node's own inbound default-deny).
- `src/dns.rs` — Magic DNS responder for the `.ray` TLD, reached via `100.100.100.53` through the TUN (no host port 53 bind). `sync_network_hostnames` rebuilds a network's forward+reverse entries from its roster on every roster update (roster is the single source of truth).
- `src/dns_config.rs` — OS DNS config (`DnsConfigurator`). Points the OS at `100.100.100.53`; Linux detection chain systemd-resolved -> NetworkManager -> resolvectl -> resolvconf -> `/etc/resolv.conf` takeover. **Anti-trample:** inotify re-assert + an NM `dns=none` drop-in; both the drop-in and the `.before-rayfish` backup are marker-guarded so we never touch an operator's own config. **Crash safety:** the panic hook restores resolv.conf synchronously before `abort()`, and `restore_stale_backups()` cleans up on next start (else a crash blackholes DNS).
- `src/hostname.rs` / `src/network_name.rs` — hostname + local-alias generation and collision resolution (`resolve_collision` appends `-1`, `-2`, …).
- `src/stats.rs` — iroh-metrics `ForwardMetrics`/`PeerMetrics`, Prometheus export on `:9090`; `snapshot()` for `ray report`.
- **CLI presentation** (all gated on `style::is_enabled()` = TTY + not `NO_COLOR`/`--json`): `src/style.rs` (ANSI palette + glyphs), `src/layout.rs` (width-aware column aligner; `main::table()` is the shared list helper), `src/progress.rs` (indicatif spinners), `src/picker.rs` (crossterm inline picker for `ray firewall pending`). Firewall rules cross IPC as pre-stringified `ray_proto::ipc::FirewallRuleView`.
- `src/logdir.rs` — daemon log dir (`/var/log/rayfish` Linux, `/Library/Logs/rayfish` macOS); rolling daily files, 7 retained, bundled by `ray report`.
- `src/ratelimit.rs` — `ControlGate`: per-connection token bucket + strike counter over inbound control messages; `check()` returns `Allow`/`Drop`/`Close`. One per control-listener task.
- `src/shutdown.rs` — SIGINT/SIGTERM via `CancellationToken`. `src/audit.rs` — append-only audit log held by `PeerTable`; logs `connect`/`disconnect` on a peer's first/last connection. Best-effort.

### Key flows

- **Create:** generate per-network `SecretKey` → derive addresses → build initial `GroupBlob` → publish blob + signed pkarr record → persist keys + `group_mode` → print public key as the room id. Closed (`Restricted`) by default; `--open` for public.
- **Access modes & admission:** the room id (network public key) is a published discovery key, **never** an admission credential. **Open** networks auto-admit any peer that reaches a coordinator. **Closed** networks gate three ways: a one-time **invite** (coordinator-only local ledger, gossiped via `InviteShare`/`InviteUsed` so any coordinator can redeem a cross-minted one); a **reusable key** (hash rides the signed `GroupBlob.reusable_keys` — multi-use, expiring, revocation propagates via the blob; `validate_reusable_key`; admits non-authoritatively — joiner-chosen hostname, suffix on collision); or **live approval** (unknown peer queued in `NetworkState.pending`, surfaced via `ray requests`, admitted with `ray accept`). The handler is `CoordinatorAcceptState`, run by **any node holding the network key** (`register_coordinator_handler` at startup, `promote_to_coordinator` on `AdminGrant`). The admitting coordinator assigns the joiner's IPv4 via `assign_ip` (lowest free collision index).
- **Join handshake:** resolve pkarr record → fetch + verify `GroupBlob` → dial in `coordinator_dial_order` (invite-pinned minter first, then other `is_coordinator` members, skipping self) until one replies `Welcome` → send `JoinRequest { invite_secret? }` first → coordinator replies `Welcome` (admitted), `JoinPending` (closed, awaiting `ray accept` — the joiner retries with backoff on the *same* coordinator; `JoinPending` is not a fallback trigger), or `JoinDenied`. The secret is matched first against the local single-use ledger, then the verified blob's `reusable_keys`; a single-use match burns, a reusable one does not. `ray join <reusable-key> --hostname H --auto-accept-firewall` is the unattended-server path. Then connect to other members with `MeshHello` and poll pkarr for blob updates. Reconnecting/restoring members use the legacy coordinator-speaks-first handshake (`initial = false`).
- **Gatekeeper:** any coordinator (any network-key holder) can approve identities and broadcast `MemberApproved`; once approved, any peer can welcome that identity. So admitting a fresh joiner survives any single coordinator being offline — the joiner dials the full coordinator set. The coordinator need not be online for *member* reconnects at all.
- **DHT (single-record):** one pkarr record per network signed by the per-network secret key. The pkarr address *is* the network public key, so records can't be spoofed (MITM-resistant). `spawn_group_poller()` refetches the blob every 60s when the hash changes.
- **Reachability model (segmentation-first):** a network is a reachability boundary — two peers exchange packets iff they share ≥1 network (a QUIC connection only exists within a shared network, so connection existence enforces it). Coarse access is the network split; the per-device firewall is the fine-grained layer (directional, port-, and network-scoped). Declarative provisioning of networks + suggested firewalls is `ray apply` (Phase B).
- **Firewall (local + coordinator suggestions):** per-device, first-match-wins, persisted in `firewall.toml`, with a stateful conntrack so return traffic for outbound flows passes under a deny default. **Secure-by-default inbound** (`default_inbound` serde-default `Deny`, `default_outbound` serde-default `Allow`): inbound TCP/UDP denied, inbound ICMP allowed, outbound allowed (conntrack lets return traffic back). ICMP-allow is the seeded, removable `allow in icmp` rule (not a special case) — deleting it makes the deny default cover ICMP. `ray firewall add` inserts at the front (newest wins) and merges by selector (`firewall::same_selector`, ignoring action), so toggling allow↔deny never accumulates dead rules. Applies to **all installs on upgrade** — an older `firewall.toml` missing the new fields deserializes into the secure posture (the seeded ICMP rule ships only with a fresh config, so an existing file keeps exactly its own rules). `ray firewall default allow|deny` flips the inbound default; neither touches outbound. On **any** network the coordinator (any network-key holder) can **suggest** rules — advisory, riding the signed `GroupBlob` (keyed by subject hostname; `*` subject = every node). Each node materializes rules for its own hostname (+ `*`), resolving peer hostnames → identities from the blob's member list (`*` peer = any), expanding each `proto:ports` token into one rule **additively** — no catch-all is synthesized, so an allow-list whitelists only by relying on the node's own inbound default-deny (denies-only = blacklist; empty subject = nothing) — see `src/firewall.rs`. Consent is **per-node, per-network**: **auto-accept** (`ray join --auto-accept-firewall` / `ray firewall auto-accept <net> on`, persisted as `config.auto_accept_firewall`) or manual `ray firewall accept|deny` (`pending_suggestions`). Hostname authority (so "allow from alice" resolves to the real alice) comes from **invite binding**, not a network flag. Rules re-materialize on every verified reconverge — the 60s poller, or a **payload-free** `BlobUpdated`/`MemberSync` *trigger* that reconverges from the network-key-signed pkarr record (`reconverge_and_apply`/`fetch_verified_blob`); `Local` rules are never touched. Trust model: suggestions come only from the verified blob (signed record → hash → blob → rules), never from a control message — those are triggers only. **Fail-fast (REJECT) mode** is an opt-in per-device toggle (`config.reject`, serde-default false, set via `ray firewall reject on|off`, shown in `firewall show`): when on, a denied packet is answered with a TCP RST / ICMP-unreachable (`src/reject.rs`) instead of being silently dropped, so the initiator's socket fails immediately rather than hanging to a timeout. Both deny directions reject (local outbound → injected into our TUN; remote inbound → sent back over the peer connection, where the initiator's conntrack admits the RST and the seeded `allow in icmp` rule admits the ICMP error). Default off keeps the stealthy drop posture.
- **Multiple admins = shared network key.** An admin is any machine holding the per-network secret; `ray admin add <net> <id>` grants the key to a member over the authenticated mesh ALPN (`AdminGrant`), making it a co-coordinator that can publish the signed blob, suggest firewall rules, and **admit fresh joiners**. The granter also sets `is_coordinator = true` on the grantee and republishes so the full coordinator set is visible in the blob — joiners use this for dial-fallback. The grantee persists the key and, on `AdminGrant`, calls `promote_to_coordinator` to swap from `MemberAcceptState` to `CoordinatorAcceptState`. `ray admin list <net>` shows the local node + granted identities (local record; the shared key is not attributable).
- **Declarative apply (`ray apply`):** reconcile networks against a spec — **YAML only**, a `networks:` map of `<name> → SuggestedFirewall` (`*` subject/peer = all hosts / any peer), plus optional top-level `aliases:`/`groups:`. The orchestrator (`main::ipc_apply`) fetches `Status` once, **expands** each network's firewall client-side (`apply::expand_firewall` against a per-network `resolve_identity_hosts` closure built from the roster), then per spec network: `Create` (closed) if absent (never joins), then publishes the *expanded* firewall block as suggestions (idempotent — replaces the live set). `--prune` publishes exactly the spec's subjects, dropping out-of-band suggestions; without it, spec subjects merge over the live set. `--dry-run` echoes the **expanded** spec (or, with no aliases/groups, the normalized spec — that path needs no daemon); `--example` prints a template. **Aliases & groups** are pure spec sugar (never reach the blob): an alias names a *user* by identity and expands to all of that user's currently-joined device hostnames (resolving `user_identity`, else the device endpoint id; the coordinator's own device matches by its `Status` endpoint id); a group is a named set of aliases/literal hostnames usable as a subject or peer. Identity strings are validated/canonicalized CLI-side (`canonicalize_aliases`, errors on a bad value); an alias resolving to zero joined hosts emits a `note:` and no rule (it materializes on a later apply once the user joins). **A user has no mesh identity until a device joins/pairs, so aliases are post-join only; literal hostnames stay the provision-ahead identifier** (invite-bound, authoritative). `ray identityof <net> <host>` prints the string to paste into `aliases:`. **Membership diff:** expected = union of subject + peer hostnames of the *expanded* spec (excluding `*`); joined = this node + peers from `Status`. The gap is reported as `ray invite <net> --hostname <missing>` commands; `--invite-missing` mints them via IPC. Because an invite-bound hostname is authoritative, the spec's hostnames are exactly the names admitted nodes carry — so suggestions always resolve the peers they name. No lock file; the live signed blob is state.
- **Direct connections (`ray connect`):** a friend-request flow linking two peers with no shared room id or invite. Each node has a standing, **rotatable contact key** (`AppConfig.contact_secret_key`, distinct from transport and per-network keys), published to pkarr while active (`dht::publish_contact`, `_rayfish_contact` = `contact_pubkey → endpoint`) and advertised over `CONNECT_ALPN` (`rayfish/connect/1`). `ray connect <contact-id>` resolves → endpoint, dials, sends `ConnectMsg::Request{from_contact_id, from_endpoint, hostname}`; the recipient queues it (`pending_connects`) and replies `Pending`, so the initiator polls with backoff (`spawn_connect_retry`). `ray connections approve <id>` mints a 2-peer network via `create_network_inner(.., direct=true, pre_approve=Some((peer, hostname)))` — restricted, auto-named `me-peer`, requester pre-approved. The minter records `(room_id, coordinator)` in `approved_connects`; the initiator's next poll gets `ConnectMsg::Approved`, joins normally (→ `Welcome`), flags it `direct` (`join_direct`). A direct network is real (firewall/DNS/mesh apply) but `ray status` shows role `[direct]` (`NetworkRole::Direct`) and hides the room id. **Edge cases:** offline recipient → clean "contact offline" (publisher is active-gated); maps keyed by transport endpoint id survive contact-key rotation (old id stops resolving after its 300s TTL); duplicate requests idempotent; if both peers connect *and* approve at once, only the higher `endpoint.id()` mints (the lower defers via `outgoing_connects`), so exactly one network forms.
- **File sharing:** `ray send` adds the file to iroh-blobs and sends a `FileOffer` over `FILES_ALPN`; receiver queues it; `ray files accept` fetches the blob by hash and verifies it.
- **Pairing:** primary issues a ticket (`bs58(endpoint_id || secret)`) over `PAIR_ALPN`; secondary authenticates and receives a `DeviceCert` binding its transport key to the primary's user identity. Backup/restore encrypts the identity key (argon2 + chacha20poly1305) into an `enc1…` base58 blob (`make_backup_blob`). `--1password` (alias `--op`) on backup/restore transports that blob to/from a 1Password item (default title `Rayfish Identity`, optional `--vault`) via the `op` CLI (`src/onepassword.rs`, create-or-update, secret piped via stdin not argv). 1Password is transport only — the blob stays password-encrypted, so a vault compromise alone can't unlock the key. All `op` calls are CLI-side in the user's context, never from the root daemon.
- **Hostname change:** `ray hostname` propagates immediately and is coordinator-authoritative. The coordinator keeps a continuous per-member control reader (`spawn_coordinator_control_reader`); a member's rename re-sends `MeshHello`, the coordinator resolves collisions (`name`/`name-1`/…), updates roster + DNS, republishes the blob, broadcasts a payload-free `MemberSync` *trigger*. The member applies its name optimistically and is corrected when it reconverges from the signed record (on `MemberSync` or the 60s poller). The coordinator renaming itself runs the same republish+broadcast directly. **Reliable delivery:** a member's rename is persisted as `config.pending_hostname` (a durable intent), so it survives a flaky coordinator link or a daemon restart. The node announces the pending name fresh from config (`outgoing_hostname`) on every (re)connect — never a value captured at startup — and `drain_pending_rename` re-sends `MeshHello(pending)` to every roster coordinator after each reconverge until the blob reflects it. Because the drain *dials* coordinators, the coordinator's accept-side control reader always reads the hello regardless of which side first established the mesh link. `apply_roster_to_dns` is pending-aware: while a rename is unconfirmed it keeps showing/persisting the requested name and overrides the node's own DNS entry, instead of letting a stale blob revert it; once confirmed (`rename_satisfied`, which also accepts a coordinator-assigned `name-N` collision suffix) it clears the intent and follows the blob. Receivers rebuild DNS from the roster on every verified reconverge via `apply_roster_to_dns` → `dns::sync_network_hostnames` (the roster is the single source of truth for `*.ray`), clearing stale names. Admission hostname authority follows the **invite binding** (not a network flag): an invite-bound hostname (`ray invite --hostname`) is assigned exactly, and a clash with a different identity is rejected — no silent rename — so no peer can claim another's name to take its suggested firewall rules (`hostname::admission_hostname`). A joiner-chosen (free) hostname keeps collision resolution.
- **Reconnection:** per-peer reader detects drop → coordinator removes the dead peer; joiner reconnects with exponential backoff (1s–30s) then re-sends `MeshHello`.
- **Control-plane abuse defense:** `MemberSync`/`BlobUpdated` triggers (and `MeshHello`/invite gossip) are cheap to send but expensive to process and carry no per-message auth, so both control read loops (member listener in `join_mesh_shared`, `spawn_coordinator_control_reader`) gate each connection with a per-task token bucket — `ratelimit::ControlGate` (`src/ratelimit.rs`, the `ratelimit` crate + a strike counter). Over-budget messages are dropped; a peer that *sustains* a flood trips `Verdict::Close` and the connection is closed with `forward::ABUSE_CODE` (a non-intentional disconnect; the peer may reconnect — no quarantine). To stop a trigger burst from fanning into N reconverges, `MemberSync`/`BlobUpdated` now only `notify_one()` a **per-network debounced reconverge worker** (~300ms coalesce, single-in-flight) instead of awaiting `reconverge_and_apply` inline — so several coordinators broadcasting after one roster change collapse into a single pkarr resolve + reconverge, and a slow reconverge never blocks the accept loop. The pending-join queue is still unbounded (out of scope; `TODO(abuse-hardening)` in the closed-network admission path).
- **Leave:** `ray leave` gracefully closes its connections with `forward::LEAVE_CODE` before local teardown. Peers see `DisconnectEvent.intentional = true`: the coordinator prunes the member, republishes the blob, then broadcasts a payload-free `MemberSync` trigger so other members reconverge from the (already-republished) signed record and drop it immediately; the 60s poller is the backstop. A plain timeout/reset is *not* intentional, so an offline-but-not-departed peer stays a known member.
- **up/down (data plane) vs start/stop (whole daemon):** the daemon connects every saved network at startup (control plane) and keeps those connections for its whole lifetime, dropping them only on `leave`/`nuke`/shutdown. `activate()`/`deactivate()` toggle only the **data plane**: TUN link up/down, peer-range + loopback routes, Magic DNS config, and the inbound forward gate (the shared TUN writer drops packets while `active` is false). So `ray down` is standby: the node stays connected and online to peers (still receiving roster/blob/firewall updates) but carries no traffic and resolves no `.ray` names. `ray up` is near-instant (no re-dial). To go fully offline, `sudo ray stop` exits the daemon (connections close cleanly, peers see offline); `sudo ray start` brings it back with both planes on.
- **Report:** `ray report` → daemon `build_report()` gathers sysinfo + a `ForwardMetrics::snapshot()` + the *sanitized* `StatusResponse` (no secret keys) + recent log files, writes a `.tgz` to `/tmp`, and chowns it to the calling UID. The CLI prints the path and opens a pre-filled GitHub issue (`REPORT_REPO_URL`) to attach the bundle. Local-first, so the user reviews it before sharing; a managed upload service can later replace the GitHub step.
- **Self-update (`ray update`):** queries the GitHub releases API (`rayfish/rayfish`, the repo `install.sh` pulls from), maps host OS/arch to the published asset (`release_asset_name` → `ray-{os}-{arch}`), fetches the asset's `.sha256` sidecar first, decides whether a swap is needed, downloads the binary, **verifies SHA-256** before touching anything, then atomically swaps the running binary via `self-replace`. If the service is installed it goes through the full install path so the daemon comes back on the new binary. Needs root when the service is installed or the binary's dir isn't user-writable (`require_root`); `--check`, `--list`, and `ray version`/`--version` don't. Raw release binaries aren't archived, so no tar/gzip here. **Three targets, chosen per-invocation (no persisted channel):** default **stable** hits `/releases/latest` and gates on `semver` (`version_is_newer`, strictly-newer unless `--force`); **`--nightly`** hits `/releases/tags/nightly` (the rolling pre-release rebuilt on every commit to master by `.github/workflows/nightly.yml`) and — since nightlies share a `CARGO_PKG_VERSION` — decides up-to-date by comparing the published checksum against the **running binary's** SHA-256 (`sha256_hex`), not the version; **`--version X`** hits `/releases/tags/vX` and is "current" only if `X` equals the running version, so it can downgrade. `--list` enumerates `/releases` (newest first, `[pre-release]`/`(installed)` annotated). **Release notes:** before any swap (and inside `--check` when behind), `print_pending_changelog` surfaces what the update brings — stable walks `/releases?per_page=100` and prints the `body` of each non-prerelease in `(current, latest]` newest-first; nightly/pinned print the single resolved release's `body` (the `GhRelease.body` field, git-cliff output from `release.yml`). Best-effort: a fetch/parse failure prints nothing and never blocks the update. `build.rs` stamps the git short SHA into `RAY_GIT_SHA`; `FULL_VERSION` = `CARGO_PKG_VERSION (sha)` is what `ray version`/`--version`/`ray report` print so a nightly build is identifiable.
- **Tor (optional):** `--tor` adds `TorCustomTransport` alongside relay; onion address derived from the iroh `SecretKey`. Needs a Tor daemon (`ControlPort 9051`).

## Conventions

- Use `cargo -q` for all cargo commands; `tracing` for logging. `main::init_tracing` composes the layers (console + file + optional OTLP) with **split filters**: the console (and CLI output) stays at `info`, while the rolling daily files under `src/logdir::log_dir()` capture our crate at `debug` (`info,rayfish=debug` — dependencies stay at `info` so iroh/quinn don't flood the file), so `ray report` bundles carry the verbose detail without the console getting noisy. The global registry gate is the permissive (file) filter; the console layer carries its own `info` per-layer filter. `RUST_LOG` overrides both. Returns a `LogGuard` that must stay alive for the process.
- Tracing carries spans: network lifecycle handlers (`create/join/leave/nuke_network`) use `#[tracing::instrument]`, and the per-peer reader (`forward::spawn_peer_reader`) + reconnect loop wrap their tasks in `info_span!("peer"/"reconnect", net=…, peer=…)` so report-bundle logs are correlatable per peer/network.
- `otel` feature (off by default): adds a `tracing-opentelemetry` layer exporting spans over OTLP/HTTP. Active only when `OTEL_EXPORTER_OTLP_ENDPOINT` (or `..._TRACES_ENDPOINT`) is set; flushed on shutdown via `LogGuard::drop`.
- Panics are fail-fast in the daemon: `main::install_panic_hook` (set only for `ray daemon`) records the panic via `tracing::error!`, synchronously appends it to `panic.log`, restores DNS via `dns_config::emergency_restore_resolv_conf()` (so a crash never blackholes DNS — see dns_config), then calls `std::process::abort()`. The service unit restarts it (`Restart=on-failure` / launchd `KeepAlive`); `panic.log` is bundled by `ray report` (and flags the issue title/body). A live-but-broken daemon wouldn't trip the restart, so we crash cleanly rather than limp.
- Never share I/O resources (TUN, sockets, streams) behind a Mutex — split into read/write halves. Avoid Mutex generally: prefer channels, atomics, or `RwLock`/`ArcSwap` for fast non-async state.
- CLI subcommands carry short `visible_alias`es (clap), so help lists them and completions pick them up: `create`→`new`, `leave`→`rm`, `status`→`st`/`ls`, `version`→`ver`, `update`→`upgrade`; action verbs `list`→`ls`, `remove`→`rm`/`del`, `show`→`ls`/`list`, `add`→`a`, `revoke`→`rm`, `approve`→`ok`. Aliases must be unique within each `#[derive(Subcommand)]` enum.
- ALPN per network: `rayfish/net/<version>/<pubkey-prefix>` (`MESH_PROTOCOL_VERSION` then first 16 hex chars). File ALPN `rayfish/files/1`, pairing ALPN `rayfish/pair/1`, connect ALPN `rayfish/connect/1`. **The version segment in every ALPN is that protocol's compatibility gate.** Each protocol versions independently: bump `MESH_PROTOCOL_VERSION` for a breaking mesh change, `FILES_ALPN`'s `/1` for a breaking file-transfer change, `CONNECT_ALPN`'s `/1` for a breaking `ray connect` change, `PAIR_ALPN`'s `/1` for a breaking pairing change. Because iroh negotiates the ALPN at the QUIC handshake, peers on different versions of a protocol share no common ALPN and can't connect — the gate is transport-enforced, with no in-band version check. **Rule of thumb: when you change one of these wire protocols in a backward-incompatible way, bump its ALPN version in the same change.** **Surfacing the failure:** the ALPN gate fails opaquely (no connection forms, so no reason can be sent). Two things recover a useful message: (1) `ray join` compares the coordinator's signed `m,<mesh-version>` from the network record *before dialing* and bails with a precise "this network runs vX, this build speaks vY — run `ray update`"; (2) `transport::connect_to_peer_with_alpn` maps an ALPN-mismatch connect error (`is_alpn_mismatch`, matches the "no known protocol"/"no application protocol" handshake error) to a "peer may be running an incompatible version (run ray update)" hint — a heuristic, used on every dial path (join/connect/file/pair) as the fallback when there's no signed version to pre-check.
- TUN MTU 1280 (IPv6 minimum link MTU, RFC 8200 §5; matches WireGuard/Tailscale). Wire format (control + IPC): 4-byte BE length + msgpack body.
- Room id = per-network public key string (discovery only). On a closed network, joining needs a one-time invite or operator approval; on an open network the room id alone admits. Invite code = `bs58(pubkey || coordinator || secret)`. Local aliases (adjective-noun-noun) are display-only.
- Config under `config::config_dir()` (`/etc/rayfish` on Linux, `~/.config/rayfish` on macOS): `secret_key`, `device_cert`, `settings.toml`, `networks/<name>.toml` (one per network), `firewall.toml`, `invites/<network>.toml` (coordinator-only). Pre-migration installs auto-split the old `networks.toml` on first load (kept as `networks.toml.bak`). On Linux the tree is `root:rayfish`; secret-bearing files are `0600 root:root`. CLI commands that write identity directly (e.g. `ray pair restore`) need root on Linux since the tree is under `/etc`.
- Keep commit subjects conventional (`feat`/`fix`/`docs`/`style`/`ci`/…): release notes are generated from them by git-cliff (`cliff.toml`). `release.yml` renders the tag's grouped changelog + a `prev...new` compare link; `nightly.yml` lists commits since the last stable tag.
- Always update docs (CLAUDE.md, README.md) after finishing a feature or significant change.
- Keep `CHANGELOG.md` current as part of every change, plan, or implementation (not just at release time). Add a user-facing entry under `## [Unreleased]` in the existing Keep a Changelog format (`Added`/`Changed`/`Fixed`/`Performance`), describing behavior from the user's perspective, not the commit. On release, rename `[Unreleased]` to the new version and add a fresh empty `[Unreleased]` plus the compare-link reference at the bottom. Skip pure-internal churn (refactors, test/CI/chore-only commits) that has no user-visible effect.