duet 0.8.7

bi-directional synchronization
# Duet Hardening Plan

This document tracks the remaining hardening work in Duet. It started as a
permission-failure audit, but the remaining work is now mostly about apply
transactionality, recovery, diagnostics, and platform metadata scope.

Duet has two permission-related responsibilities:

- It synchronizes Unix mode bits for files and directories as metadata.
- It depends on OS permissions to scan, read, write, remove, chmod, set mtimes,
  launch the server, and save state.

The highest-risk bugs around permission failures being mistaken for deletions
have been fixed. The remaining work is broader: make apply failures recoverable,
make diagnostics consistently structured, and document or explicitly scope
platform-specific metadata behavior.

## Fixed And Tested

Most of these issues are covered by active tests in
`tests/permission_failures.rs`; additional unit tests cover protocol and marker
helpers.

- Scanner failures now propagate instead of being accepted as partial scans.
  Unreadable local or remote subdirectories no longer look like deletions.
- Scanner filesystem operations now return path-aware errors for directory
  reads, directory-entry reads, metadata reads, and symlink target reads.
- Checksum, signature, and detail-collection read failures now abort before
  remote mutation in the tested permission-denied cases.
- State loading uses `try_exists` and reports state open/read/decode errors
  instead of treating inaccessible state as missing.
- Local state save errors are no longer ignored, and remote state save errors
  retain path/error context.
- Apply-side filesystem panics for remove/create/symlink/chmod/mtime paths were
  converted to `Result` errors with path context.
- Local and remote apply both run a preflight check before applying detailed
  changes. Preflight now checks source reads, destination metadata targets,
  destination parent availability/writability for common mutations, and local and
  remote state-save paths before filesystem mutation starts.
- Already-synced read-only destination directories can be temporarily made
  writable for child file updates and then restored.
- Remote RPC handlers wrap sync failures in a shared `StructuredSyncError`
  envelope with side, operation, optional path, classified error kind, and source
  message. RPC paths backed by `color_eyre::Report` preserve exposed source
  chains as structured `source:` entries instead of relying only on formatted
  debug text.
- Streamed apply temp files use bounded-length names, so long destination file
  names no longer create overlong temporary path components.
- Streamed and non-streamed file-content writes now use the same side-local
  temporary-output primitive before renaming into place. Non-streamed diff apply
  restores directly into that temporary file instead of materializing the full
  updated file in memory first.
- Setup/orchestration paths that previously used permission-triggerable
  `expect`/`unwrap` calls now return contextual errors.
- SSH/session setup now adds targeted hints for common private-key and SSH config
  permission failures, and server launch errors include command/log context. SSH
  session, profile-loading, local-base expansion, remote-profile parsing, and
  server-launch failures now use the shared structured error renderer even when
  they occur before the RPC server is available. Remote server setup also uses
  the shared renderer for log path, log directory, log file, default state-dir,
  and server task startup failures.
- Applied Unix modes are masked to permission/special bits before `chmod`, so
  file-type bits from `symlink_metadata` are not passed back to the OS.
- `README.md` documents the user-facing metadata model: file contents,
  directories, symlink targets, Unix mode bits, and mtimes are synchronized;
  ownership, ACLs, xattrs, platform-specific permission models, and symlink
  permissions are not.
- Local and remote apply now create a side-local recovery marker before applying
  changes, record a shared attempt id, the apply/state-save phase, affected
  paths, operation summaries, unstaged-operation classifications, and staged
  temporary file paths, append committed-operation records as side-local actions
  complete, preserve committed-step records for file renames and direct
  remove/create/metadata operations, and remove the marker after state save
  succeeds. A later sync refuses to run if the marker remains, with phase- and
  operation-aware recovery instructions instead of silently continuing from an
  unknown partial-apply state.
- New peers prepare the remote apply marker before local mutation starts, so both
  sides have recovery markers before the concurrent apply phase begins.
- Permission tests now cover a representative race where the remote destination
  becomes unwritable after remote apply recovery is prepared; the next sync is
  blocked by the recovery marker instead of proceeding from unknown state.
- The hidden maintenance command `_recover <statefile>` prints any unfinished
  apply marker and the same recovery advice used by sync failure reporting.

## Remaining Work

### 1. Apply Needs A Prepare/Commit Protocol

Current preflight catches common permission failures before mutation, and apply
now records recovery markers while filesystem changes and state saves are in
progress. The markers include the side, base, state file, shared apply attempt
id, current phase, affected paths, compact operation summaries,
unstaged-operation classifications for direct commit operations, plus
committed-operation records for side-local actions that completed before
interruption. While apply is in progress, file content writes also record staged
temporary file paths that may need cleanup after a crash, and staged file writes
record committed-step entries after the temp file is renamed into place. Direct
remove, create, symlink, and metadata operations also record committed-step
entries after each step succeeds. New peers prepare both local and remote markers
before concurrent apply begins. This prevents a later run from silently
continuing after an interrupted apply. Sync is still not a true transaction:
local and remote apply can still mutate files before a later non-preflighted
error, crash, or race is detected.

Target design:

- `prepare`: both sides validate the selected actions again, create a per-sync
  apply attempt id, and stage all content writes into side-local temporary files.
- `prepared`: both sides report staged paths, metadata operations, removals, and
  any operation that cannot be safely staged.
- `commit`: both sides perform the shortest possible rename/remove/chmod/utime
  window. File content replacement should be rename-based wherever the platform
  allows it.
- `finish`: state is saved only after both sides report commit success.
- `recover`: if a process dies after `prepare` or during `commit`, the next run
  detects the attempt id, reports which phase may have partially completed, and
  either cleans abandoned staged files or asks the user to inspect committed
  paths before continuing.

Remaining work:

- Add RPC methods for `prepare`, `commit`, `finish`, and `recover`, advertised by
  a protocol capability. The current protocol has a prepare marker RPC with a
  shared attempt id, but not a full staged prepare/commit/recover sequence.
- Move streamed and non-streamed apply through the same staged apply engine.
  File-content writes now share the same temporary-output primitive, but the full
  staged engine still needs to cover directory, symlink, metadata, and removal
  operations.
- Persist enough committed-operation metadata to resume automatically after a
  crash. The current marker records completed side-local action summaries, but it
  does not yet support automatic replay/rollback.
- Close any remaining gaps in step-level commit records for multi-step
  replacements, chmod, and utime as the staged apply engine is built out.
- Use staged-file records to offer safe automatic cleanup for abandoned temp
  files that were never renamed into place.
- Keep state saving after both sides have committed successfully.

### 2. Preflight And Apply Recovery Are Still Best-Effort

Current preflight catches common permission failures before mutation, including
source reads, destination parent/metadata checks, and local/remote state-save
paths. It is still a snapshot: the filesystem can change after preflight, and
some capabilities cannot be proven without attempting the operation.

When a post-preflight apply or state-save error occurs, Duet now reports recovery
advice explaining that filesystem changes may have been partially applied and
state may not have been saved. If a process exits before cleanup, the recovery
marker blocks the next sync until the user inspects the listed paths and removes
the marker. Marker recovery advice is tailored to the recorded phase and to
planned destructive, metadata, or file-content operations, and it calls out when
staged-file, unstaged-operation, committed-step, or committed-operation records
are present. The hidden `_recover <statefile>` maintenance command can print the
same marker and advice without starting a sync. This is still not a resumable
apply protocol.

Remaining work:

- Improve chmod and utime preflight where ownership/platform support makes that
  possible.
- Expand preflight coverage for less common replacement/remove combinations as
  they are found.
- Add more race tests as new post-preflight failure modes are found.
- Replace broad unstaged-operation guidance with automatic step-level recovery
  once apply attempts can safely replay or roll back recorded commit points.

### 3. Sync Errors Are Only Partly Structured

Remote sync errors now use the shared `StructuredSyncError` envelope inside
`RPCErrorKind::Other`. This is a practical improvement over generic strings and
removes the earlier RPC-only error type boundary, but it is not a complete
end-to-end error model yet.

The client parses this envelope for concise remote error rendering. Local setup
paths can also use the same renderer for classified user-facing diagnostics. The
original RPC payload remains line-oriented and machine-readable. RPC and setup
paths backed by `color_eyre::Report` now preserve exposed source chains as
structured `source:` lines, though some call sites still only provide formatted
messages or non-Report error values.

Remaining work:

- Convert remaining formatted-message and non-Report call sites to typed errors
  with structured source chains where practical.
- Extend structured setup errors to remaining log/state setup paths that fail
  before the normal RPC server is running.

### 4. Metadata Model Is Still Unix Mode Bits Only

Duet records and syncs mode bits, but it does not sync ownership, ACLs, xattrs,
or platform-specific permission models. Symlink permissions are intentionally
ignored. When applying metadata, Duet masks the recorded mode to Unix
permission/special bits (`0o7777`) before calling `chmod`.

Remaining work:

- Decide whether uid/gid support is in scope. If it is, gate it behind explicit
  capability checks and clear behavior for non-root users.

### 5. Server, Profile, And SSH Setup Diagnostics Are Still Evolving

Most permission-triggerable setup `expect`/`unwrap` paths in profile loading,
remote command expansion, RPC setup, and orchestration have been converted to
contextual errors. SSH setup adds hints for common key/config permission
failures, and server launch errors include the command and server log path where
available. SSH session, profile-loading, local-base expansion,
remote-profile parsing, server-launch, and remote server log/state setup
diagnostics now use the shared structured renderer. Some setup failures still
happen before normal sync error handling is established and can produce weak
diagnostics.

Known examples:

- Server startup failures before RPC initialization still depend on child
  process stderr/log context, though their final setup error is now classified
  and rendered consistently when the child reaches Duet setup code.

Remaining work:

- Preserve non-Duet remote process failures, shell failures, and SSH transport
  failures with enough stderr/log context to distinguish remote config, temp, and
  base-path permission problems before RPC starts.

### 6. Restricted Sync Can Still Enumerate Readable Ancestors Broadly

Restricted sync now benefits from propagated scanner errors, so inaccessible
ancestors fail safely. It has not been optimized to walk only the direct path
components needed for the restricted subtree.

Remaining work:

- Optimize restricted scans to avoid broad ancestor enumeration where possible.
- Keep path-aware failure messages for any ancestor that must still be read or
  statted.

### 7. Permission-Denied Skip Is Out Of Scope For Default Sync

The current policy is fail-fast. That is safer than silently skipping unreadable
paths, but it means one inaccessible path aborts the sync.

Decision: keep fail-fast as the default and do not add implicit skip behavior.
Skipping permission-denied paths is only acceptable as a future explicit opt-in
mode, because both sides must know exactly which paths were skipped so they are
not interpreted as deletions or legitimate updates.

Remaining work:

- If an opt-in skip mode is implemented, communicate skipped permission-denied
  paths to both sides so they are not interpreted as removals or legitimate
  updates.
- Make any skipped-path behavior visible in the action plan and final summary.

## Current Priority Order

1. Complete the transactional or resumable apply protocol described above.
2. Add tests and phase/path-specific recovery advice for races and
   post-preflight failures.
3. Extend `StructuredSyncError` into a full end-to-end error model with
   structured source chains and setup-error rendering.