bairelay 1.1.1

RTSP Relay for Reolink Baichuan cameras
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
# Bairelay — Implementation Notes

Day-to-day implementation knowledge: API gotchas, design decisions, and load-bearing details that any contributor needs in working memory.

Static structure and dependency tables live in `docs/architecture.md`.

---

## `bairelay_neolink_core` API surface

`BcCamera` is the production type. `CameraDriver` is the dyn-compatible trait the binary holds — production code mostly takes `Arc<dyn CameraDriver>`. `CameraHandle` keeps a parallel private `bc_camera_concrete: Arc<BcCamera>` for the two operations that aren't on the trait (`logout()` during shutdown; `StreamSource::start` for video pull loops).

### Methods we rely on in production

| Method                                | Notes                                                     |
|---------------------------------------|-----------------------------------------------------------|
| `BcCamera::new(&BcCameraOpt)`         | Connects via discovery. 2–30 s depending on method.       |
| `login_with_maxenc(MaxEncryption)`    | MD5 challenge-response auth.                              |
| `logout()`                            | Concrete-only (not on `CameraDriver`); call with 5 s timeout. |
| `get_linktype()`                      | Used for keepalive (NOT `ping()`; see "Keepalive").       |
| `get_support()`                       | Capability probe (PTZ, talk, etc.). Cache result.         |
| `battery_info()`                      | Returns `BatteryInfo`. **`voltage` is i32 millivolts**, not volts. |
| `listen_on_motion()`                  | Returns `MotionData` with `next_motion()` async stream.   |
| `listen_on_floodlight()`             | Returns `mpsc::Receiver<FloodlightStatusList>`.           |
| `is_floodlight_tasks_enabled()`      | Polls schedule state — distinct from on/off.              |
| `get_pirstate()`                      | Returns `RfAlarmCfg`. One-shot query, NOT a stream.       |
| `pir_set(bool)`                       | Toggle PIR sensor.                                        |
| `send_ptz(Direction, f32)`            | Always pair with `Direction::Stop` (see Control commands). |
| `set_floodlight_manual(bool, u16)`    | Second arg is duration **seconds**. Use 30.               |
| `zoom_to(u32)`                        | Pre-multiply by 1000 (`level * 1000.0` for MQTT float).   |
| `set_ptz_preset(u8, String)`          | Preset ID + name.                                         |
| `start_video(kind, channel, record)` / `stop_video(kind)` | Returns `StreamData`; covered by `VideoStream` trait. |
| `get_snapshot()`                      | JPEG; each call wakes battery cameras — use sparingly.    |

### Calls that can hang

Every async `BcCamera` method can hang indefinitely if the camera drops the TCP connection without sending RST. Always wrap in `tokio::time::timeout`:

| Operation         | Deadline |
|-------------------|----------|
| Connect           | 30 s     |
| Keepalive         | 5 s      |
| Pollers           | 10 s     |
| Control commands  | 30 s     |
| Logout            | 5 s      |

### Calls that spawn internal tasks

`listen_on_motion()` and `listen_on_floodlight()` spawn internal tokio tasks inside `bairelay_neolink_core` that hold references to the `BcConnection`. Dropping the outer `BcCamera` (Arc refcount → 0) does not clean them up immediately — they're parked on channel reads.

**Shutdown sequence (per session, inside `CameraHandle::teardown_session_tasks`):**

`run_connected_session` is the per-session lifecycle coordinator: it sets up via `spawn_session_tasks`, runs the keepalive loop, then on connection loss / shutdown cancels `session_cancel` and hands off to `teardown_session_tasks`, which performs:

1. Wait up to **2 seconds** for `JoinSet` tasks (motion / battery / floodlight / PIR / preview-aggregator) to exit gracefully.
2. After the deadline, call `abort_all()` on the `JoinSet`.
3. `stop_all_stream_sources().await` — fires cancel + awaits each reader task with a 7 s budget. Reader tasks hold their own `Arc<BcCamera>` clones outside the `JoinSet`; without this await the next session's `start_video` can race a still-running previous reader.
4. `logout()` with a 5 s timeout (`LOGOUT_TIMEOUT`). Uses the local `concrete: Option<Arc<BcCamera>>` parameter — the trait surface intentionally omits `logout()`.
5. Clear `bc_camera` + `bc_camera_concrete`, drop the local `concrete` Arc, transition to `Disconnected`.

Without the abort deadline, shutdown hangs because internal tasks hold references that prevent cleanup. Without step 4's await, a detached reader keeps `Arc<BcCamera>` alive for up to its own `STOP_VIDEO_TIMEOUT` (5 s), which bairelay_neolink_core can't reliably handle when the next `BcCamera::new` is already in flight.

### XML parsing brittleness

Reolink firmware routinely adds and removes XML fields between model generations. `crates/core/src/bc/xml.rs` carries `#[serde(default)]` on every field that could plausibly be omitted. When a new firmware surfaces a parse error in production, the fix is almost always one of:

- Add `#[serde(default)]` to the missing field in the relevant struct.
- Add a regression test capturing the new XML payload shape in `crates/core/src/bc/xml_tests.rs`.

Inbound deserialisation tolerates absent fields; outbound serialisation still writes the field with its current value, so there's no wire-level regression for messages we send.

### Error variants of note

- `MissingAbility` — camera doesn't expose the feature. Maps to CLI exit code 6 ("unsupported"), distinct from a real protocol bug. Display: `camera does not support 'X': requested Y permission, has Z`.
- `UnintelligibleReply { reply, why }` / `UnintelligibleXml { reply, why }` — camera replied but the bytes don't parse. Display surfaces `why` (`unexpected camera reply: {why}`).

### `ptzMode = "none"` is fixed-mount, not PTZ

Reolink fixed-mount cameras (e.g. Argus Altas) report `<ptzMode>none</ptzMode>` in the support XML — non-empty but explicitly NOT a motorised mount. A naive `is_empty()` check classifies them as PTZ-capable and HA discovery emits four phantom Pan buttons. The single point of truth is `crate::capabilities::ptz_mode_indicates_ptz`: `none` / `0` / `false` (case-insensitive) and empty all return `false`. Real motorised cameras advertise `pt`, `ptz`, `3d`. Note this only stops *new* publications — leaked retained discovery topics for cameras previously misclassified must be cleaned up manually with `mosquitto_pub -r -m ""`.

### IR LED is write-only

`crates/core/src/bc_protocol/ledstate.rs::irled_light_set` writes the IR night-vision mode but bairelay_neolink_core has no getter, so bairelay's HA `select` entity (`build_ir`) emits `state_topic: None`. HA shows it as "unknown" forever — fire-and-forget control. Don't confuse with PIR (passive IR motion sensor), which has a separate `get_pirstate` reader.

### Time + DST split across two Bc messages

Reolink cameras autonomously track DST in a Bc message *separate* from `<SystemGeneral>`. Setting the clock requires modelling both. Load-bearing semantics:

- `msg_id = 104` (`GetGeneral`) `<SystemGeneral>`: `<timeZone>` is the **base UTC offset in seconds, Reolink-inverted** (positive = west of UTC, negative = east, **DST excluded**). `<hour>` is **UTC**, not local. `<year>`/`<month>`/`<day>`/`<minute>`/`<second>` are likewise UTC.
- `msg_id = 106` (`GetDst`) `<Dst>`: `<enable>` (0/1), `<offset>` (hours), plus `<startMonth>`/`<startWeekIndex>`/`<startWeekday>`/`<startHour|Minute|Second>` and the symmetric `<end…>` family. Camera applies DST internally on display: `displayed_local = <hour> + (-<timeZone>/3600) + dst_offset_if_in_window`.
- `msg_id = 105` (`SetGeneral`) accepts the same `<SystemGeneral>` shape. Because the camera double-applies DST otherwise, a SET that wants to land at the host's current local time must compute `<timeZone>` as the host's **base** offset (subtract DST if the host is currently in DST) and `<hour>`/etc. as **UTC**. Sending host-local-effective-offset + host-local-wallclock causes a `+dst_offset` drift.
- `msg_id = 107` (presumed `SetDst`) is not currently observed on the wire — the Reolink Mac client only reads DST. Don't write DST unless the operator explicitly asks.

`<SystemGeneral>` also surfaces `<deviceId>`, `<loginLock>`, `<lockTime>`, `<allowedTimes>` on current Argus firmware; these are **not** modelled in `crates/core/src/bc/xml.rs::SystemGeneral` and therefore aren't preserved by the read-modify-write in `set_time`. A SET silently round-trips them to default. Either model them or accept the round-trip.

---

## Wake lock and grace period

The wake-lock counter (`src/wake_lock.rs`) needs **two separate `Notify` instances**:

```rust
struct WakeLockInner {
    count: AtomicUsize,
    notify_release: Notify,   // 1 → 0
    notify_acquire: Notify,   // 0 → 1
}
```

- `notify_acquire` fires on 0 → 1 transition (idle-disconnect cameras need this to wake up when work arrives).
- `notify_release` fires on 1 → 0 transition (the grace-period timer listens for this to start its countdown).

A single-Notify design loses the acquire edge: a wakeup command would acquire the lock but the sleeping camera's `notified().await` (looking for release) would never fire.

**`notify_one()` not `notify_waiters()`.** Both transitions use `Notify::notify_one()`, which stores a single permit when no waiter is registered. A late `.notified().await` therefore still fires. `notify_waiters()` would drop the edge if nobody is currently parked, producing a TOCTOU where an idle-disconnect camera that has logged "waiting for wake lock…" but has not yet polled `wait_for_acquire()` would miss startup-wake's acquire and park forever.

**Stale-permit drain in `wait_for_acquire`.** `notify_one`'s stored permit closes the TOCTOU above but lingers across sessions: a permit set by a probe's 0→1 acquire stays in the slot until consumed. When grace fires and the run loop loops back to `wait_for_acquire().await`, the stale permit resolves the future immediately even though `count == 0` — the camera reconnects with no actual wake-lock holder, the new session has nobody keeping it alive, grace fires again 30 s later, repeat. The fix re-checks `is_idle()` after each notification and parks again on a stale permit; only return when the count is actually held.

**`idle_since` timestamp for the watchdog gate.** `WakeLockCounter` records the wallclock instant of the most recent 1→0 release in a `Mutex<Option<Instant>>`. Cleared on the next 0→1 acquire. The watchdog reads `idle_since().elapsed() >= grace` instead of the instantaneous `is_idle()` it used to check — without the elapsed gate, the watchdog tick races every freshly-arrived `control/wakeup` MQTT publish and can tear a session down a few hundred ms before the wakeup acquire lands. With the gate, `grace_period.rs` handles the normal disconnect path and the watchdog is the safety net its docs always claimed (only fires if grace_period failed to fire, e.g. didn't get spawned or panicked).

The grace period timer is reset by any new `acquire()` before its countdown expires. Default grace is 45 s, configurable via `idle_disconnect_timeout_secs`. Keep this strictly above `stream_prune_grace_secs` (default 30 s) so a cached `StreamSource` can't outlive its Baichuan session.

---

## MQTT bridge

`rumqttc::AsyncClient` is `Clone`. The bridge creates one MQTT connection in `main`, clones the `SharedMqttClient` into each per-camera task, and polls the `MqttEventLoop` in a single background task.

### Subscription strategy

Subscribe to ALL camera topics on `ConnAck` (broker connect/reconnect), **not** when individual cameras connect. Critical: sleeping cameras need to receive `control/wakeup` commands before they can come online.

### Event loop must not block

Control dispatch is spawned as a separate `tokio::spawn` task from the event loop. Dispatching inline would let a 10-second PTZ command block all MQTT processing.

### `rumqttc` 10 KiB packet-size default

`rumqttc 0.24`'s `MqttOptions` defaults `max_incoming_packet_size` and `max_outgoing_packet_size` to **10 KiB** — surprising given MQTT itself has no such default. Any new `MqttOptions` instance must call `set_max_packet_size` (we use 16 MiB) **before** the first JPEG flows.

The error rumqttc emits when the cap is hit reads `broker's maximum packet size of '10240'` — misleading; the cap is client-side. Audit this on every rumqttc bump.

### Last Will limitations

MQTT supports only one Last Will per connection. The bridge sets a global LWT on `{topic_prefix}/status` with payload `"offline"`. Per-camera status is published as `"disconnected"` during graceful shutdown. On crash, only the global LWT fires — per-camera topics retain stale `"connected"` until the next restart.

### Topic prefix

`mqtt.topic_prefix` config knob (default `"bairelay"`, drop-in compat `"neolink"`) drives every owned topic path and every HA discovery `identifier` / `unique_id`. Validation rejects empty / non-alnum-underscore- hyphen values at startup.

### Shutdown ordering

The MQTT event loop is the **last subsystem to shut down**. The Ctrl+C handler cancels the global token (cameras, RTSP, watchdog, startup-wake); each per-camera teardown publishes its final `connection_state = false`; only after `orchestrator.run().await` returns does `main()` cancel a separate `mqtt_cancel: CancellationToken` and await the event-loop task with a 2 s timeout against a wedged broker.

Without the separate token, per-camera teardown races the event-loop exit and hits `Failed to send mqtt requests to eventloop`.

---

## Keepalive

Use `BcCamera::get_linktype()`, **not** `BcCamera::ping()` — matches neolink. Some cameras return `UnintelligibleReply` for link-type queries: treat that as success (camera is alive but doesn't support the query), not failure.

Allow 5 consecutive failures before disconnecting. A single dropped packet shouldn't kill the connection.

Cadence shared between production and tests via `KEEPALIVE_INTERVAL = 5 s` and `KEEPALIVE_MAX_FAILURES = 5` constants in `src/camera.rs`. Two decision-table helpers are exported for testing under a paused virtual clock:

- `classify_keepalive_tick(result)` — folds `Ok(Ok)` / `UnintelligibleReply-still-ok` / `Ok(Err)` / `Elapsed` into a `KeepaliveTickOutcome::{Ok, Failed}` binary outcome.
- `advance_keepalive_counter(state, outcome, max_failures) -> (next_failures, should_break)` — drives the consecutive-failure counter (reset on Ok, saturate at max_failures, signal break at exactly `next == max_failures`).

---

## Pollers

Floodlight and PIR default to **disabled**:

```toml
enable_floodlight = false
enable_pir = false
```

Battery and motion default to enabled. Floodlight and PIR default to disabled so cameras without those features don't generate unsupported-feature errors at connect.

PIR state (`get_pirstate()`) is a configuration query, **not** a live sensor — it changes only when explicitly set via `control/pir`. Publish once on connect, re-publish after control commands. Do NOT poll periodically.

Floodlight has two aspects on separate topics:

- `status/floodlight` — actual on/off state, published via `listen_on_floodlight()` event stream.
- `status/floodlight_tasks` — schedule enabled/disabled, published via periodic `is_floodlight_tasks_enabled()` polling.

These are distinct topics. Don't conflate them.

Update intervals (`battery_update`, `preview_update`, `floodlight_update`) must be ≥ 500 ms. Validation rejects shorter.

---

## Control commands

Match neolink's wire behaviour exactly:

- **PTZ.** Send the move command with speed = 32.0, sleep `amount / speed` seconds (clamped to 0–10 s), then **always** send `Direction::Stop`. Without the stop, the camera keeps moving forever.
- **Zoom.** MQTT payload is a float (e.g. `1.5`). Call `zoom_to((level * 1000.0) as u32)`.
- **Siren.** `bc.siren()` takes no arguments. Only fire on `"on"` payload; ignore `"off"`.
- **Wakeup.** Acquires a wake lock immediately (triggers the camera connection), then waits for the camera to actually connect before starting the countdown. 2-minute connection timeout, respects the cancellation token.
- **Reply convention.** After every control command, publish `"OK"` or `"FAIL"` (non-retained) to the same control topic. Enables HA automations to confirm command success.
- **Query responses.** `query/battery` and `query/pir` serialise the full struct to XML via `quick_xml::se::to_writer` and publish to `status/battery` and `status/pir` respectively (non-retained).

---

## Wire-format gotchas

### ADPCM ser/de asymmetry

`BcMedia::Adpcm` does **not** round-trip byte-exact through `serialize` → `deserialize`. The serialiser pads on `data.len() % 8`; the deserialiser pads on `(data.len() + 4) % 8` (the wire `payload_size` includes the 4-byte sub-header). The capture pipeline writes raw camera-side bytes through a single `serialize` per packet, so on-disk `.bcmedia` files parse cleanly. **Do not** re-serialise captured packets to "normalise" them — the first ADPCM packet desyncs the loop.

### HEVC NAL whitelist (`is_decodable_nal`)

`crates/rtsp/src/codec/nal.rs::is_decodable_nal` is the single point of truth for "what NAL units a standard receiver expects". Outbound `Frame::Video` is filtered through it in `handle_iframe`, `handle_pframe`, and the bridging-replay path.

- **H.264**: `nal_unit_type` in `1..=13` (VCL slices 1..=5, SEI 6, SPS 7, PPS 8, AUD 9, EOS 10, EOB 11, FD 12, SPS-ext 13). Drops type 0 and 14..=31.
- **H.265**: type in `{0..=9, 16..=21, 32..=40}` AND `nuh_layer_id == 0`. Keeps standard VCL (0..=9), IRAP keyframes (16..=21), VPS/SPS/PPS/AUD/EOS/EOB/FD/SEI (32..=40). Drops reserved 10..=15 and 22..=31, reserved non-VCL 41..=47, and unspecified 48..=63 — Reolink Argus emits `UNSPEC62` as proprietary metadata; ffmpeg's RTP-HEVC depacketizer rejects it. Drops anything with `nuh_layer_id != 0` (ffmpeg logs `Multi-layer HEVC coding is not implemented`).

In addition to the whitelist, outbound `Frame::Video` strips VPS / SPS / PPS (H.265) and SPS / PPS (H.264) on every keyframe. SDP `sprop-vps` / `sprop-sps` / `sprop-pps` fmtp attributes already carry them out-of-band and every practical client consumes them during DESCRIBE / SETUP. The strip is load-bearing for HA + go2rtc: putting VPS + SPS + PPS + IDR at the same RTP timestamp would let HA's `ffmpeg:` re-publish aggregate the three small NALs into an HEVC Aggregation Packet (NAL type 48 per RFC 7798 §4.4.2); go2rtc's RTPDepay passes the AP through opaquely and its `/api/frame.jpeg` transcoder exits with status 183.

### HE-AAC AOT branches

`handle_aac` adds 1024 ticks for AAC-LC (AOT=2), 2048 for HE-AAC (AOT=5) and HE-AACv2 (AOT=29), drops + one-shot-warns for unsupported AOTs. The ADTS path can only yield AOT ∈ {1, 2, 3, 4} (2-bit profile field), so AOT=5 / AOT=29 branches are reachable only by direct `aac_samples_per_au` calls — kept defensively in case a non-ADTS carrier ever feeds `handle_aac`.

---

## Audio + video pacers

`bairelay::stream_source::media_pacer_task(rx, broadcast, cancel, max_lead, initial_latency, snap_on_past)` is the generic pacer; `audio_pacer_task` and `video_pacer_task` are thin wrappers. Each per-source `StreamSource` spawns one of each.

Pacer contract:

- Items are `PacedFrame { frame: Frame, duration: Duration }` pushed via bounded `mpsc::Sender::try_send` from the audio / video handlers (overflow logs once and drops the new packet — preserves recent audio).
- The pacer maintains an absolute emit cursor `next_emit_at`. Each iteration:
  - Receives an item.
  - Computes `target = next_emit_at`. Re-anchor on two conditions: (1) `target > now + max_lead` (cursor too far in the future — queue overflow / startup burst) — always snap to `now`; (2) `target < now` AND `snap_on_past = true` (cursor in the past — queue ran dry while upstream was idle) — snap to `now`.
  - If `target > now`, sleep until target. Else (`target == now` after snap, or past with snap disabled) emit immediately.
  - Broadcasts the frame.
  - Sets `next_emit_at = target + item.duration` — absolute, NOT `now_after + duration`. Per-iteration scheduler overhead would otherwise drift the cursor forward by ~1 ms per frame and the long-term slope would diverge from `clock_rate`.

Audio: `max_lead = 2 s`, `initial_latency = 500 ms`, `snap_on_past = true`. AAC-LC AU duration = `1024 / sample_rate` (= 64 ms at 16 kHz). G.711 µ-law = `samples.len() / 8 000`. The `aac_frames` field of the ADTS header (parsed from `number_of_raw_data_blocks_in_frame + 1` per ISO 13818-7 §6.2) multiplies the per-AU sample count when a stream ever packs multiple frames per ADTS packet.

The `snap_on_past = true` choice for audio is load-bearing. Without it, the audio pacer would keep its absolute anchor across upstream idle periods, so when the queue ran dry between Argus's GOP-aligned audio bursts the next iteration's `target` would be hundreds of ms in the past — the pacer would fire the entire just-arrived burst onto the wire back-to-back (zero sleep_until) until the cursor caught up. Receivers like mpv reported `audio end or underrun` every ~10–20 s with audible 0.1 s glitches under that earlier behaviour. Snapping the cursor forward on past-cursor restores 64 ms-spaced wire output even after a silence; the 500 ms `initial_latency` cushion absorbs the BcMedia decoder jitter that arises because the same TCP stream carries 4 K HEVC keyframes ahead of audio packets in the camera-side serialisation. On real Argus: per-RTP-packet inter-arrival p50=64.0 ms / p99=65.1 ms / max=207 ms over 60 s, 0 mpv underruns over 3 min playback.

Video: `max_lead = 3 s`, `initial_latency = 1.5 s`, `snap_on_past = false`. Argus 4K-HEVC bursts each GOP in ~900 ms then idles ~1.1 s; the 1.5 s startup buffer keeps the per-source mpsc queue stocked across each upstream silence so the wire cadence stays at PTS rate continuously, and the cursor never lands in the past in practice. Per-frame duration = `(pts_90khz - state.last_video_pts_90khz) / 90 000`; the very first frame's duration is `Duration::ZERO` and the initial-latency wait dominates. Burst-drain on past-cursor (the policy with `snap_on_past = false`) preserves the long-term wallclock-PTS slope at exactly `clock_rate`, which matters more for video than for audio because per-frame PTS continuity drives downstream re-muxers' DTS reasoning.

### Pacer vs gap-bridging — division of labour

Two mechanisms cooperate to keep the wire stream usable, addressing different scales of upstream irregularity:

- **Audio + video pacers (above)** absorb sub-second bursty delivery (Argus's normal "GOP burst then 1.1 s idle" pattern). The video pacer's 1.5 s pre-buffer means a typical idle period is invisible to the receiver.
- **Gap-bridging** (`GapState::Bridging`, `emit_replay_frame_if_bridging`, the 200 ms ticker in the translator loop) handles **multi-second upstream silences** — camera reconnects, wifi hiccups, motion-pause flapping. When `last_live_frame_at` exceeds `gap_threshold_secs` (default 3 s) the source flips to `Bridging` and re-broadcasts the cached `VideoBurst::iframe_nals` with a synthesised PTS, plus drops live audio so A-V stays correlated on resume. Bridging writes to `broadcast::Sender` directly, bypassing the pacer.

The 3 s threshold sits above the pacer's 1.5 s buffer — a normal inter-burst gap drains the pacer before bridging fires, but the pacer covers it; only a true camera-side silence reaches bridging. With the pacer in place, bridging rarely fires day-to-day; it's still load-bearing for the "show the last frame instead of a buffering spinner" experience during real silences.

## Per-kind session dispatch

`crates/rtsp/src/server/session_task.rs::run` is a small coordinator. After the PLAY gate it spawns two child tasks — `video_dispatch_loop` and `audio_dispatch_loop` — each holding its own `broadcast::Receiver` (via `resubscribe()`). Each child loops on `frames.recv()`, skips frames of the wrong kind, lazily resolves its `RuntimeTrack` from `session_tracks` on track-miss, and writes RTP packets via the track's transport. The coordinator then awaits cancel + the JoinSet's first completion; either path cascades cancel to the surviving child, awaits both joins, closes transports, and removes the session.

Why split: the unified loop the per-kind dispatchers replaced consumed `Frame::Video` and `Frame::Audio` from a single `tokio::select!` arm. A 4 K HEVC IDR (~150–370 RTP packets after FU fragmentation) monopolised the dispatch task's TCP-write loop while audio frames piled up in the broadcast queue; receivers saw audio drain in a burst once the video write completed. Splitting into two parallel consumers means the audio loop's `frames.recv()` is always free to fire, and the per-packet TCP-interleaved write mutex (held only for one `$-framed` packet at a time, ≤ MTU) lets audio FU fragments interleave between video FU fragments at packet granularity.

Audio bursting on the wire ALSO needed the audio pacer's `snap_on_past` fix above — the dispatch split alone is necessary but not sufficient.

Asymmetry: the video child takes ownership of the original `frames` receiver passed into `run()` (preserving its read position so any pre-PLAY buffered frames + lag detection both flow through it), while the audio child gets a fresh `resubscribe()`. Audio losing a few buffered frames at session start is benign — receivers expect to start at "now" anyway.

`tracks_changed` is plumbed through `run`'s signature for ABI compat but the children don't subscribe to it. Late-SETUP append picks up the new track via the per-child lazy resolution on the next matching-kind frame; using `notify_one` for two waiters would race (only one wakes per call).

## RTCP

The session task does not emit periodic Sender Reports. The receiver falls back to RTP-arrival-time A-V sync, which is what live camera feeds actually need at our latency targets — the audio + video pacers hold the wire cadence at `clock_rate` and the receiver derives slope from successive RTP packets directly.

`build_sender_report`, `ntp_now`, and `ntp_minus` are kept in `crates/rtsp/src/server/rtcp.rs` so any future SR-emitting context (e.g. a recording sink that needs precise NTP↔RTP) is one wire-up away. The SR ticker arm has been removed from `run` (per-kind dispatch coordinator no longer hosts a `select!`); an SR-emitting future would spawn its own ticker task.

---

## Authentication

If `login_with_maxenc()` returns an authentication error (`AuthFailed`, `CameraLoginFail`, `Credential error`), **stop the retry loop permanently**. Don't hammer the camera with bad credentials every 2 – 60 seconds forever.

Reconnect on transient network failures still uses `ReconnectBackoff::sleep_with_cancel` (initial 2 s, doubling, capped at 60 s) — the same path production uses and the `drive_reconnect_with_backoff` test seam exercises.

---

## Configuration

### Validation rules

- Camera names: `[A-Za-z0-9_-]` only (MQTT topic safety).
- Password: required (`None` rejected).
- `bind_port = 0` rejected.
- Update intervals (`battery_update`, `preview_update`, `floodlight_update`): ≥ 500 ms.
- `idle_disconnect`: defaults `false`. Must be explicitly enabled for battery cameras.
- `gap_threshold_secs`: must be finite + positive.
- `idle_disconnect_timeout_secs`: must be finite + positive.

### Neolink-compat fields

`PauseConfig` accepts `on_motion` / `on_client` / `on_disconnect` / `motion_timeout` / `mode` as `Option<_>` for drop-in compat. They warn at startup and have no effect.

`pause.timeout` is a soft alias for `idle_disconnect_timeout_secs` (camera-level). Precedence: explicit `idle_disconnect_timeout_secs` wins; otherwise `pause.timeout` (with warn); otherwise default 45 s. Resolution happens once at session entry via `config::resolve_idle_disconnect_timeout`. The default sits 15 s above `stream_prune_grace_secs` (30 s) — see the invariant on `Config::stream_prune_grace_secs`.

`#[serde(deny_unknown_fields)]` on all config structs makes typos and truly-unknown keys fail at startup.

---

## Logging

Default filter: `info` for bairelay, `warn` for `rumqttc`. Override with `RUST_LOG=bairelay=debug`. `rumqttc`'s debug output is extremely noisy (multiline ping messages every 30 s) — suppress it unless specifically debugging MQTT protocol issues.

All MQTT publishes log their values at debug level. All control commands log dispatch and OK / FAIL result.

`-v` / `-vv` / `-vvv` on the CLI maps to `info` / `debug` / `trace` via `run_support::cli_output_mode`.

---

## Test infrastructure

Test helpers in `bairelay_neolink_core` (`FakeCameraBuilder`, `MockConnection`, `BcCamera::from_mock_connection`, `BcCamera::test_set_ability`, `MotionData::test_new`) are gated behind a `test-util` Cargo feature so release builds cannot accidentally substitute a fake for a real camera. The crate's own `cfg(test)` unit tests always see them; downstream test crates opt in via `[dev-dependencies] bairelay_neolink_core = { ..., features = ["test-util"] }` (the binary already does). Helpers in `bairelay_mqtt` (`mock_client()`) and the binary (`PacketSource`, `MockVideoStream`, `CameraHandle::set_driver_for_test`, `StreamSource::start_with_packet_source`) compile unconditionally — they're scoped tightly enough that there's no production-leakage risk.

### `FakeCameraBuilder` (`bairelay_neolink_core::bc_protocol::fake_camera`)

Closure-per-method `CameraDriver` impl:

```rust
let fake = FakeCameraBuilder::new()
    .with_battery_info(|| Ok(info))
    .with_snapshot(|| Ok(jpeg_bytes))
    .with_motion_stream(motion_data)
    .build();
let driver: Arc<dyn CameraDriver> = fake.clone();
```

Side-effecting setters (`reboot`, `pir_set`, `set_floodlight_manual`, etc.) record their args in a public `FakeCalls` struct so tests assert dispatch:

```rust
assert_eq!(*fake.calls().reboot.lock().unwrap(), 1);
assert_eq!(fake.calls().pir_set.lock().unwrap().clone(), vec![true]);
```

Unset reads panic with `FakeCamera: <method> not configured for this test` via a single `unset(method) -> !` helper. `*_pending()` builders return a never-resolving future for testing 30 s timeout error branches under `tokio::time::pause`.

### `MockConnection` (`bairelay_neolink_core::bc_protocol::connection::mock`)

Scriptable request → reply harness for testing `BcCamera` command modules without a real socket:

```rust
let mock = MockConnection::new()
    .expect_msg(MSG_ID_BATTERY_INFO)
    .reply_with(|req| reply_200_xml(req, BcXml { ... }))
    .build()
    .await;
let cam = BcCamera::from_mock_connection(mock).await;
let info = cam.battery_info().await?;
```

Three reply helpers:

| Helper                              | Use                                          |
|-------------------------------------|----------------------------------------------|
| `reply_200_xml(req, xml)`           | Standard 200 OK with `BcXml` payload.        |
| `reply_200_empty(req)`              | 200 OK, no payload (reboot, setters).        |
| `reply_err_code(req, code)`         | Non-200 response code (error-path tests).    |

`reply_with_many(|req| vec![Bc, Bc, ...])` scripts multi-reply exchanges (e.g. snap.rs binary chunks). `reply_none()` drops the request silently — emulates a camera that ack'd via msg-id match but doesn't answer; pair with `tokio::time::timeout(Duration::from_millis(200), op)` so the test doesn't hang. `MockInjector` (`mock.injector()`) supports unsolicited inbound messages for `subscribe_to_id` push-semantics tests.

`reply_with_xml(|req, xml| ...)` and `reply_with_xml_opt(|req, xml| ...)` deserialise the request's `BcXml` payload and hand a borrowed reference to the closure before it builds the reply. Use these for set / control tests that need to pin the wire-shape of the request payload — without them, a regression that mapped (e.g.) `LightState::On` → `"close"` instead of `"open"` would still pass a shallow `reply_200_empty` test. Panics if the request is header-only or carries a binary payload — those shapes are caller errors when the test is trying to inspect a set/control request.

### `bairelay_mqtt::test_support::mock_client()`

Returns `(SharedMqttClient, MockHandle)`. The handle exposes:

- `published()` — `Vec<(topic, payload, retained)>` of all publishes.
- `published_topics()` / `published_payloads()` — projections.
- `count()` — total publish count.

The mock doesn't connect to a broker; `EventLoop::poll` is leaked at test construction time so the request channel stays alive.

### Other test seams

| Seam                                                   | Purpose                                          |
|--------------------------------------------------------|--------------------------------------------------|
| `MotionData::test_new(rx)`                             | Wire a caller-supplied `mpsc::Receiver`.         |
| `CameraHandle::set_driver_for_test(driver)`            | Install fake driver + flip state to Connected.   |
| `CameraHandle::run_connected_session_for_test`         | Drive `run()`'s Connected arm directly.          |
| `StreamSource::start_with_packet_source`               | Drive translator loop with scripted `BcMedia`.   |
| `drive_reconnect_with_backoff` + `ReconnectOutcome`    | Pure reconnect-loop timing contract.             |
| `ScriptedDiscoverer` (`test_support` in discovery.rs)  | Script local/remote/map/relay outcomes.          |
| `MockVideoStream`                                      | Drive `drain_first_iframe` with scripted frames. |

`ReconnectBackoff::sleep_with_cancel(&CancellationToken) -> bool` is shared between production `CameraHandle::run` and the test seam — the "wait for next delay, bail on cancel" contract lives in one place.

### Hang-protection discipline

Every mock-based "camera doesn't answer" error-path test wraps the operation in `tokio::time::timeout(Duration::from_millis(200), op)`. A test that awaits a channel/socket with no guaranteed sender hangs `cargo test` indefinitely.

Every test in the core crate completes in under 1 second of wall time.

---

## TLS / `rtsps://`

### Plain + TLS run as parallel listeners

When `certificate` is set, `main.rs` spawns two `RtspServer::serve` tasks — one with `tls = None` on `bind_port`, one with `tls = Some(...)` on `tls_bind_port` (default 8555). Plain stays running unless the operator explicitly opts into TLS-only by setting `bind_port = 0`. Neolink replaces plain with TLS on the same port; bairelay does not — drop-in users get plain on 8554 plus TLS on 8555.

### `rustls 0.23` crypto provider install

`rustls 0.23` requires a default `CryptoProvider` to be installed once per process before any cert is loaded or any handshake fires. `main.rs` does:

```rust
let _ = rustls::crypto::aws_lc_rs::default_provider().install_default();
```

`install_default()` returns `Err(CryptoProvider)` if a provider is already installed; the binding ignores that — second install is idempotent for our purposes. Tests share a process and call the same function via a `OnceLock` to avoid the `Err` on repeated `#[test]` setup.

### `TlsConfig::build` runs at startup, not on first connection

`bairelay_rtsp::server::TlsConfig` wraps `Arc<rustls::ServerConfig>`. The builder runs `with_single_cert` and (for `Request`/`Require`) `WebPkiClientVerifier::builder(roots).build()` at config-load time, so a bad cert/key pair, empty cert chain, or empty roots store fails fast with a useful operator message instead of surfacing on the first incoming connection.

### Scheme/transport mismatch returns 400

`scheme_matches_transport(uri, is_tls)` rejects a request whose URI scheme contradicts the actual transport (`rtsp://` over TLS or `rtsps://` over plain). Defends against a confused or hostile client cross-routing between the two parallel listeners.

### TLS handshake timeout: 10 s

`tokio::time::timeout(Duration::from_secs(10), acc.accept(stream))` caps each handshake. A slow or malicious client otherwise pins an accept task indefinitely. Same shape as our keepalive bound.

### Test certs

Two paths produce the same PKI shape:

- `tests/scripts/gen-test-certs.sh` produces self-signed CA + server leaf (SAN: `localhost`, `127.0.0.1`, `::1`, `host.docker.internal`) + client leaf, all valid 10 years. Output `tests/test-certs/` is gitignored. Used by `tests/scripts/manual-verify.sh --tls`.
- `crates/rtsp/tests/tls_handshake.rs` uses `rcgen` to mint ephemeral CA + leaves in-memory. `cargo test` is hermetic — no openssl or filesystem state required.

### `client_auth = "require"` rejection timing

Under TLS 1.3 with rustls 0.23, the server's "client cert required" alert can arrive either during the handshake or on the first write/read, depending on whether the client refused to send a cert in `CertificateRequest` mid-handshake. Tests accept either failure mode (`tls_handshake::require_mode_rejects_client_without_cert`).

### TOML key placement: top-level vs section-scoped

`certificate`, `tls_bind_port`, `tls_client_auth`, `tls_client_ca` are **top-level** config keys. Any tooling that builds a `config.toml` programmatically must place them before the first `[section]` header (typically `[mqtt]` or `[[cameras]]`). TOML parses scalar key/value pairs as members of whichever table they appear under, so appending them at the end of an operator's config silently scopes them to the last table — and `[cameras.mqtt]` does not have `#[serde(deny_unknown_fields)]`, so the keys are dropped without any error message. `manual-verify.sh --tls` learned this the hard way; its awk wrapper now inserts before the first `^[[:space:]]*\[` line.

### Client-cert revocation: deliberately not checked

When `tls_client_auth = "request"` or `"require"`, bairelay verifies the client cert against `tls_client_ca` via `WebPkiClientVerifier` — chain validity, SAN/EKU, and signature. **CRL and OCSP are not consulted.** Revoking a leaked client cert is a rotate-the-server-CA operation, not a "publish a CRL entry" one. Acceptable in a single-operator home-LAN deployment; explicitly out of scope for any future fleet / multi-tenant use of the binary. Bumping to a verifier with CRL support (e.g., `rustls::server::WebPkiClientVerifier::builder(roots).with_crls(...)`) is a one-call swap if the threat model ever grows.

### PEM file size: not capped on read

`src/tls_load.rs::load_server_tls` reads the cert / CA PEM via `std::fs::read` with no upfront size cap. A pathological 10 GiB PEM would balloon RAM before parsing fails. Realistically: PEMs are operator-supplied, typically a few KiB; the file system itself caps the read by available memory. Worth a `metadata().len() > 1 MiB` guard if the binary ever ingests untrusted PEMs (CDN cert-bundle download, etc.). Not added today — would clutter the loader for a non-issue at current scope.

---

## Coverage policy

Workspace coverage is measured via `cargo tarpaulin` from project root with `tarpaulin.toml` defaults set to `workspace = true`, `all-targets = true`, `skip-clean = true`, and `fail-under = 87`. Current baseline is **~88.6 %** measured against the full workspace (no source-file exclusions). The 87 % gate is intentionally a hair below the actual baseline so natural coverage drift doesn't trip CI on every PR; raising it requires a coordinated push past the structural under-counters in `src/main.rs` (the `#[tokio::main]` body), `src/cli.rs` (clap-derive macro output), and `src/oneshot/runner.rs` (real-camera socket bind). The per-crate gate stays tighter — each non-binary crate stays ≥ 90 %; `bairelay_wake_server` sits at 94.9 % on its own.

Per-file targets:

- Pure-logic modules: ≥ 95 %.
- Command modules and pollers: ≥ 90 %.
- I/O-adjacent modules (`bc_protocol.rs::find_camera`, `camera.rs::run`, `stream_source.rs`): ≥ 85 % via test seams; the residual is real-socket bind paths.

Files at < 85 % with a documented reason:

- `src/main.rs` — `#[tokio::main]` body, signal handlers, real socket binds. Fundamentally untestable at unit level.
- `src/oneshot/runner.rs` — wraps `BcCamera::new + login_with_maxenc + logout`. Real socket required.
- `src/cli.rs` — `clap` derive macro output.
- `crates/core/src/bc_protocol/connection/{discovery,udpsource}.rs` — UDP socket internals; the trait-level `CameraDiscoverer` fallback chain is fully covered, the per-method UDP frame parsing covered partially.

Run coverage:

```
cargo tarpaulin 2>&1 | tail -5
```

(The flags above come from `tarpaulin.toml`; no extra arguments needed.)

## Wake server

The full wire-level reverse-engineering — every variant, every field, the camera boot sequence, the wake handshake, the cloud topology — lives in `docs/cloud-interception.md` § Part I. The `pushx.reolink.com` motion-push listener (the second cloud surface bairelay intercepts) is documented in § Part II of the same file. This section captures the bairelay-specific implementation notes that aren't in those references.

### Use the UDP source addr, not `<dev>` from D2R_HB

Cameras put their own LAN IP / port in the `<dev>` block of `D2R_HB`. That value is useless for sending replies back to a NAT'd camera — the public-mapped address is what the OS sees on `recv_from`. The handler in `crates/wake-server/src/register.rs::handle_heartbeat` always upserts with `src` (the address from `recv_from`), never `hb.dev`. Reference implementation calls this out explicitly; we matched it.

### Wake burst: fire-and-forget + first-error-then-break

The 10 × `R2D_C` packets are sent from a `tokio::spawn`'d task so the register loop stays responsive to other clients. The burst breaks early on the first send error rather than retrying — a vanished camera or a transient ENETUNREACH means the next heartbeat would fix the address anyway, and a tight retry loop on a stale path floods logs without changing outcomes.

### Stale-on-lookup, not background sweep

`CameraRegistry::lookup_fresh` returns `None` when `now - last_seen > stale_after`. Stale entries are read-as-missing and replied to with `R2C_C_R{rsp = -1}`; the next heartbeat resurrects the entry. Zero background scheduler overhead, identical correctness to a periodic sweep.

### Why no `core::discovery` change

Bairelay's outgoing discovery (`crates/core/src/bc_protocol/connection/discovery.rs`) targets `p2p*.reolink.com:9999` for relay/map modes via DNS. Operator-side DNS redirect makes both cameras and bairelay-itself's discovery resolve to the local box, hitting our wake server. The wake server crate is unaware of `discovery.rs`; `discovery.rs` is unaware of the wake server. Same code path for cloud and self-host — only the DNS answer differs.

### `u32` wraparound in `bcudp::xml_crypto`

`crates/core/src/bcudp/xml_crypto.rs::{encrypt, decrypt}` use `i.wrapping_add(offset)` rather than plain `i + offset` because the `offset` argument is the packet's `tid: u32` and real cameras emit transaction IDs ≥ ~`0x60000000`. A naive add overflows `u32` in debug builds (panic) for any payload long enough to push `i + offset` past `u32::MAX`. The XOR cipher is symmetric over the full `u32` domain, so wrap-around is correct semantics — encrypt-then-decrypt with the same wrapping recovers the plaintext byte-for-byte. Regression tests in `xml_crypto.rs` cover the high-tid path (`tid = 0xFFFF_FFF0` with a multi-byte payload).