1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
//! Single-instance guard for the trusty-memory daemon.
//!
//! Why: macOS launchd `KeepAlive { SuccessfulExit: false }` (i.e. `OnSuccess`)
//! respawns the daemon whenever it exits with a non-zero code. When a second
//! daemon instance fails to bind (EADDRINUSE — the first instance already owns
//! port 7070 and/or the UDS socket), it exits non-zero, which launchd interprets
//! as a crash and spawns yet another copy. The resulting zombie herd (69 observed
//! in the wild) exhausts file descriptors on top of the existing fd-limit bug.
//!
//! The fix: before attempting to bind, probe the discovery files. If a healthy
//! daemon is already responding to `/health`, exit **0** (success). Launchd
//! treats exit-0 as "clean shutdown" and does NOT respawn (SuccessfulExit:false
//! = restart only on non-zero). This collapses the zombie herd immediately on
//! the next invocation without touching launchd config.
//!
//! What: exposes [`single_instance_check`] (async, for real daemon startups)
//! and [`StartupAction`] (pure enum, for unit testing the decision logic).
//!
//! Test: `startup_action_*` unit tests cover every branch including the
//! stale-socket-vs-live-socket distinction.
use Path;
/// What the daemon startup should do after the single-instance check.
///
/// Why: separating the decision from the I/O lets us unit-test the logic
/// with injected probe results rather than spinning up real TCP listeners.
/// What: three variants covering the full decision tree.
/// Test: `startup_action_from_probe_result_*` tests in this module.
/// Decide what to do based on the result of an HTTP health probe.
///
/// Why: the single-instance check reduces to "did the health probe succeed?".
/// Encoding the decision as a pure function (rather than embedding it in the
/// async probe body) makes the logic unit-testable without actual network I/O.
/// What: `probe_ok = true` → [`StartupAction::ExitAlreadyRunning`];
/// `probe_ok = false` → [`StartupAction::Proceed`].
/// Test: `startup_action_from_probe_result_when_alive`,
/// `startup_action_from_probe_result_when_dead`.
/// Perform the single-instance check at daemon startup.
///
/// Why: launchd's `KeepAlive { SuccessfulExit: false }` respawns any non-zero
/// exit, so a second daemon instance that fails to bind causes an endless
/// respawn storm. Exiting 0 (when another healthy instance is detected) short-
/// circuits this because `SuccessfulExit: false` means "restart only on
/// non-zero exits" — exit 0 is treated as a voluntary clean shutdown.
/// What: reads the `http_addr` discovery file; if it contains a reachable
/// address whose `/health` responds with HTTP 200, returns
/// [`StartupAction::ExitAlreadyRunning`]. Otherwise returns
/// [`StartupAction::Proceed`] so the caller continues with normal bind.
/// Errors reading the addr file or the network call are silently treated as
/// "not running" (returns `Proceed`) so a missing or stale file never blocks
/// a cold start.
/// Test: integration — run `trusty-memory serve --foreground` twice in the
/// same session and observe the second exits 0 without trying to bind; the
/// unit tests in this module cover the decision logic.
pub async
/// Single-instance check with up to `max_retries` additional probes.
///
/// Why (issue #1152, Tier 3): a single probe can miss a daemon that is
/// mid-boot — it wrote the addr file but hasn't yet answered `/health`.
/// Retrying with a short sleep lets a slow-boot daemon be detected and
/// this caller exit 0 (stopping the launchd respawn storm) rather than
/// proceeding to open redb, which would trigger `DatabaseAlreadyOpen`.
/// What: calls `single_instance_check` repeatedly up to `1 + max_retries`
/// times, sleeping `delay_ms` between each call, stopping on the first
/// non-`Proceed` result. Returns the final `StartupAction`.
/// Test: covered by the unit tests for `startup_action_from_probe_result`;
/// the retry path is exercised by the integration guard in `main.rs`.
pub async