mxsh 0.2.0 - Docs.rs

# Refactor Plan

This document is based on `BUGS.md` and a source pass through the shell
execution, expansion, job, trap, fd, frontend, and embedding layers.  The
current bug list should not be attacked as independent point fixes.  Most of
the failures are places where a shell semantic contract is implicit, spread
across several modules, and therefore enforced differently by foreground
commands, background commands, command substitutions, pipelines, embedded
sessions, and the CLI.

The plan is to create a few explicit contracts and move existing behavior
behind them:

- execution context and session ownership
- child launch and fd mapping
- job ownership and process groups
- signal/trap ownership
- typed word expansion products
- simple-command transactions and control-flow outcomes
- uniform session finalization and byte-safe frontend input

The parser and AST can mostly stay.  The refactor should focus on the boundary
between AST execution and OS/runtime effects.

## Source Diagnosis

The most important cross-cutting problems are visible in these places:

- `src/shell/state.rs` stores unrelated state in one struct: shell variables,
  frame, jobs, traps, cwd, fd maps, process-global signal ownership, transient
  expansion status, active command ids, and embedding definition.  Cloning or
  serializing `ShellState` is therefore never semantically neutral.
- The old fork, detached-session, background-state snapshot, and advanced
  session snapshot/restore paths all created shell variants by mutating a
  clone.  That is the root of several bugs where jobs, traps, signal owners, fd
  maps, and process isolation are kept or dropped by accident.
- Snapshotting `ShellState` must be removed as an implementation technique.
  Public APIs must not expose live session clone, snapshot, or restore
  semantics.  Reset-style behavior must be backed by purpose-built checkpoint
  data with explicit ownership for variables, definitions, cwd, fds, traps,
  jobs, and process globals.  If any execution path still persists or restores
  a whole `ShellState` snapshot after this refactor, the refactor has failed.
- Child process launch is assembled in several sites:
  `run_external_simple_command`, `spawn_external_in_path`,
  `run_prepared_external_stage`, `spawn_external_pipeline_stages`,
  `spawn_background_job`, and `builtin_exec`.  These sites do not share one
  representation of the child fd table, signal dispositions, process group, cwd,
  env, or PATH resolution.
- `src/sys/backend.rs` builds `posix_spawn` file actions directly from stdio
  fds and passed fds.  Because there is no normalized child fd table, close
  actions can be ordered incorrectly, implementation fds can leak, and closed
  shell fds become invalid `dup2` sources.
- `src/shell/expand.rs` collapses expansion results to `String` or
  `Vec<String>` too early.  Quote provenance, glob patterns, tilde eligibility,
  multi-field `"$@"`, and command-substitution status are lost before
  command-name, assignment, redirection, `for`, and parameter-pattern contexts
  can apply their distinct rules.
- `src/shell/simple_command.rs`, `src/shell/resolve.rs`,
  `src/shell/compound.rs`, and `src/shell/builtins.rs` use status integers plus
  mutable side channels (`BranchControl`, `exit_code`, command-substitution
  status, temporary assignment restoration) instead of one simple-command and
  control-flow result type.
- CLI and embedded execution finalize sessions differently.  The CLI calls
  `finalize_shell_run`; embedded APIs return directly from `run` and
  `run_program`.

## Refactoring Thesis

Add explicit shell semantic objects and make every execution path use them.
Then fix behavior by enforcing those objects' invariants, not by patching each
bug's call site.

The main new internal contracts should be:

```rust
enum ExecutionContextKind {
    TopLevel,
    Function,
    DotScript,
    Subshell,
    CommandSubstitution,
    PipelineStage,
    BackgroundJob,
    DetachedSession,
}

struct ExecutionContext {
    kind: ExecutionContextKind,
    logical_shell_pid: u32,
    job_scope: JobScope,
    trap_scope: TrapScope,
    fd_scope: FdScope,
    process_global_scope: ProcessGlobalScope,
    errexit_exempt: bool,
}

struct CommandOutcome {
    status: i32,
    control: ControlFlow,
}
```

All current ad hoc state transforms should become constructors such as
`state.fork_for(ExecutionContextKind::CommandSubstitution)` and
`state.checkpoint_for_background(...)`.  These constructors must be the only
way to create subshell, background, detached, or pipeline-stage state.

Non-negotiable success criterion: this refactor must delete wholesale shell
state snapshotting.  A snapshot or fork API may produce an explicit checkpoint
object, but it must not clone, serialize, persist, or later restore a complete
`ShellState`.  Leaving that pattern in place means the refactor did not solve
the core ownership problem.

## Track 1: Split State By Ownership

Create internal state components inside `src/shell/state.rs` or new sibling
modules:

- `VariableStore`: shell variables, attributes, readonly checks, `allexport`,
  positional frame stack, and private builtin state such as `getopts`.
- `DefinitionStore`: functions and aliases.  Function values should store both
  body and definition-time redirections.
- `PathState`: logical cwd, physical cwd cache if needed, `PWD`/`OLDPWD`
  updates, and path resolution against shell cwd.
- `FdTable`: shell-visible fd mappings, named fds, ambient fd policy, and stdio.
- `TrapTable`: trap actions plus their origin and child disposition intent.
- `JobTable`: jobs owned by this execution context only.
- `ExecutionContext`: the context kind and inherited policy decisions.

This removes the current ambiguity where `ShellState::clone()` copies jobs,
traps, and process-global ownership with no caller-visible policy.

It also removes shell-state snapshotting from the design.  Each checkpoint
constructor must choose the owned components it carries and the owned components
it intentionally drops.  Restoring a checkpoint must rebuild a live
`ShellState` through normal constructors instead of assigning a stored
`ShellState` back into the session.

Expected bug impact:

- The removed advanced session snapshot/restore API could leave the live
  session process-isolated.
- Subshells and command substitutions can reap the parent's background jobs.
- Background jobs run inherited EXIT traps from the parent shell.
- Background jobs get a different `$$` from the invoking shell.
- Background machine jobs drop shell inherited-fd mappings.
- EXIT traps in subshells and command substitutions are skipped.
- Embedded run APIs skip EXIT traps when a session exits.
- CLI `RunOutcome` reports the pre-EXIT-trap status.
- `getopts` state is exposed through user variables and can be made readonly.

Important invariant: a context may inherit variables, aliases, functions, cwd,
and fd mappings, but it must not inherit the parent's job table unless it can
also inherit the parent's wait ownership.  In practice, subshell-style contexts
should receive an empty job table or read-only job view.

## Track 2: Centralize Child Launch And Fd Policy

Add a single `ChildLaunchPlan` builder used by every external command path and
by `exec` replacement:

```rust
struct ChildLaunchPlan {
    program: String,
    argv: Vec<String>,
    env: Vec<(String, String)>,
    cwd: PathBuf,
    process_group: ProcessGroupPlan,
    stdio: [ChildFdDisposition; 3],
    extra_fds: BTreeMap<i32, ChildFdDisposition>,
    signal_dispositions: ChildSignalPlan,
}

enum ChildFdDisposition {
    OpenFrom(sys::FileDescriptor),
    Closed,
}
```

The plan should be created from `FdTable`, redirections, inherited fd policy,
and trap/signal policy.  `src/sys/backend.rs` should receive this normalized
plan or an equivalent lower-level action list, not independently infer fd
semantics from stdio plus `passed_fds`.

Required fd invariants:

- A shell-visible closed fd is represented as a child-side close action, never
  as `dup2(-1, n)`.
- Redirection helper fds and duplicated source fds are close-on-exec in the
  parent and explicitly closed in the child after all `dup2` users are applied.
- If two child fds map from the same parent fd, all `dup2` actions run before
  closing the shared source.
- The shell chooses one ambient-fd policy at startup: import open fds into
  `FdTable`, or close unknown fds before external exec.  It must not reject
  builtins using fd 3 while letting external commands inherit fd 3 anyway.
- `exec` replacement uses the same path lookup, fd plan, environment, cwd, and
  script fallback as normal external command execution.

Expected bug impact:

- Duplicated stdio redirections leak implementation fds into external commands.
- Ambient inherited file descriptors are untracked but still leak to external
  commands.
- Shared inherited parent fds are closed before all child mappings are applied.
- Closed fd redirections break external commands.
- `exec` with redirections only does not persist redirects.
- `exec` replacement commands ignore redirections.
- `exec` command lookup ignores unexported shell `PATH`.
- `exec` replacement stops at non-executable PATH entries and lacks shell-script
  fallback.
- Failed `exec` temporarily changes the host process cwd.
- Background machine jobs drop shell inherited-fd mappings.

Implementation notes:

- Put the builder near shell semantics, for example `src/shell/launch.rs`.
- Keep `src/sys/backend.rs` dumb: it should only validate and apply ordered
  spawn actions.
- Extend `Runtime::exec_replace` carefully.  Prefer adding a default method that
  accepts a launch plan and returns `Unsupported` for runtimes that cannot do
  process replacement.

## Track 3: Make Jobs Own Processes, Not Pids

Replace the current mix of `display_pid`, runtime handles, wrapper machine
processes, and pid-only waits with a job ownership model:

```rust
struct Job {
    job_id: u32,
    command_id: String,
    process_group: Option<ProcessGroupId>,
    leader: sys::ProcessHandle,
    display_pid: Option<u32>,
    members: Vec<JobMember>,
    state: JobState,
}
```

Rules:

- `wait PID` may wait only for a process recorded in the current context's
  `JobTable`.  Remove the fallback to `Runtime::wait_display_pid` for unknown
  pids.
- `$!` is derived from the job's displayed asynchronous command pid.  For a
  simple external background command, avoid the `mxsh --machine` wrapper and
  spawn the command directly.  For a compound asynchronous list, the wrapper is
  the shell job process and must own the process group it creates.
- Foreground interactive jobs get their own process group before they run, and
  the terminal is handed to that group for the duration of the foreground wait.
- Stopped processes remain in `JobTable` as stopped jobs.  A stopped status is
  not collapsed into `128 + signal` unless there is no job-control context.
- `claim_foreground` records `tcgetpgrp`, not `getpgrp`, so release restores the
  previous terminal owner.

Expected bug impact:

- `$!` names the background wrapper instead of the actual asynchronous command.
- Embedded `wait PID` can reap host-owned child processes.
- `wait` without operands returns a job status instead of zero.
- Foreground interactive jobs share the shell's process group and stopped jobs
  are lost.
- Stopped background-machine children are reported as completed jobs.
- Subshells and command substitutions can reap the parent's background jobs.

This track should be done with `Track 2`, because job semantics depend on the
same launch path that sets process groups and signal dispositions.

## Track 4: Replace Trap Globals With A Signal Hub

The current trap code uses process-global arrays:

- `PENDING_TRAPS`
- `TRAP_OWNERS`
- per-state `trap_owner_id`

That model cannot safely support multiple managed embedded sessions and cannot
derive child signal dispositions from process-isolated trap changes.

Introduce a process-global signal hub with explicit session registrations:

- A managed session acquires a `SignalRegistration` token.
- Each signal has a reference-counted installed handler and a set of interested
  session tokens.
- Pending signal counts are per token, not per process-global owner slot.
- Restoring one session removes only that session's registration.
- A process-isolated context can change its `TrapTable` and child disposition
  plan without installing OS handlers in the parent process.
- Child signal defaults/ignores are computed from `TrapTable`, not hard-coded in
  `configure_spawn_attributes`.

Expected bug impact:

- Managed embedded sessions clobber each other's process-global signal traps.
- `trap ''` after a caught trap does not make child commands ignore the signal.
- `trap '' PIPE` is not inherited by external commands.
- Traps set inside subshells and command substitutions are not applied to child
  commands.
- Background jobs run inherited EXIT traps from the parent shell.

For EXIT specifically, store trap origin.  Background and detached contexts
should not run an inherited parent EXIT trap merely because they were rebuilt
from a checkpoint.  Subshell and command-substitution contexts should run an
EXIT trap that was installed inside that context before returning.

## Track 5: Typed Expansion Results

Replace the public-in-module pattern of `expand_word_nosplit -> String` and
`expand_word -> Vec<String>` with one expansion engine that returns structured
fields:

```rust
struct ExpansionResult {
    fields: Vec<ExpandedField>,
    last_command_substitution_status: Option<i32>,
}

struct ExpandedField {
    value: String,
    glob_pattern: PatternText,
    quote_mask: QuoteMask,
    tilde_allowed: bool,
}

enum ExpansionMode {
    CommandName,
    Argument,
    AssignmentValue { assignment_name: String },
    RedirectionTarget,
    ForWordList,
    CasePattern,
    ParameterRemovalPattern,
    ArithmeticText,
}
```

Mode-specific rules should be table-driven instead of encoded in separate
helper paths.  The result must preserve enough quote information to decide
whether glob metacharacters are active and whether tilde expansion is allowed.

Expected bug impact:

- Command-name expansion drops extra fields and skips globbing.
- Empty command-name expansion skips later word expansions and loses
  substitution status.
- `for ... in` word lists skip tilde and pathname expansion.
- Redirection targets do not perform tilde expansion.
- Quoted default/alternate expansions with `"$@"` collapse arguments.
- Quoted pattern-removal metacharacters are treated as active patterns.
- Shortest suffix removal misses the empty suffix.
- Quoted tildes in assignment values are expanded after quote removal.
- `set -u` exempts unset positional parameters and `$!`.
- `${parameter:=word}` assigns to positional parameters instead of failing.
- `${#}` and `${#@}` mishandle special parameters.
- `${#var}` reports UTF-8 bytes instead of characters.
- Field splitting keeps a trailing empty field for mixed whitespace and
  non-whitespace IFS.
- Arithmetic expansion does not perform shell expansions inside the expression.

Do this before a large builtin cleanup.  Many builtin bugs are actually caused
by declaration-style operands being expanded through the ordinary argv path.

## Track 6: Simple-Command Transactions

Create a `SimpleCommandTransaction` that plans and executes exactly the POSIX
simple-command phases:

1. Save expansion status.
2. Process assignment words left to right.
3. Apply redirections.
4. Resolve command using the proper lookup order.
5. Execute the utility.
6. Commit or roll back temporary assignments and redirections according to the
   command kind.
7. Classify errors for non-interactive abort rules.

The transaction should return `CommandOutcome`, not just `i32`.

Important rules:

- Special builtins are resolved before shell functions.
- `command` suppresses function lookup but still honors command policy.
- Assignment-only commands use the last command-substitution status across both
  assignments and redirects.
- Assignment words are applied left to right, so later assignments can see
  earlier ones.
- Prefix assignments before regular builtins are temporary, but restoring them
  must not clobber variables the builtin intentionally wrote.
- `exec` with no command commits its redirections instead of restoring them.
- Redirection and variable-assignment errors for special builtins trigger the
  required non-interactive abort behavior.

Expected bug impact:

- Assignment-only readonly failures do not abort non-interactive scripts.
- Assignment-only command status ignores later redirection command
  substitutions.
- Assignment words expand as a batch instead of left-to-right.
- Prefix-assignment restoration clobbers regular builtin side effects.
- Special-builtin redirection and readonly errors do not abort non-interactive
  scripts.
- Special-builtin utility errors do not abort non-interactive scripts.
- Shell functions incorrectly override special builtins.
- `command` bypasses the unspecified-utility command policy.
- `exec` with redirections only does not persist redirects.
- `export` and `readonly` assignment operands are field-split.
- Attribute-only `export` and `readonly` create empty variables.

This is also the right place to make write failures from output-producing
builtins visible in the builtin status.  `shell_out` and `shell_outln` should
return `io::Result<()>` or a shell status instead of being diagnostics-only
helpers.

## Track 7: Control Flow And Errexit As Values

Replace sticky mutable `BranchControl` plus status rewriting with structured
control flow:

```rust
enum ControlFlow {
    None,
    Break(u32),
    Continue(u32),
    Return(i32),
    Exit(i32),
}
```

Compound command helpers should propagate `CommandOutcome` unchanged unless
they are the context that consumes it.  `!`, `if`, `while`, `until`, and
short-circuit lists should not reinterpret `Return(7)` as an ordinary status to
invert or test.

Add an `ErrexitContext` stack so `set -e` is evaluated with POSIX exemptions:

- conditions of `if`, `while`, and `until`
- non-final commands in `&&` and `||`
- inverted pipelines beginning with `!`
- other contexts already required by the shell language profile

Expected bug impact:

- `break n` and `continue n` can escape past all loops and stop the script.
- `return` status is lost in control-flow conditions and negated pipelines.
- `set -e` exits in POSIX-exempt control-flow contexts.
- `return` is rejected inside sourced scripts.
- Dot scripts ignore positional arguments.
- Function calls overwrite `$0`.
- Function definition redirections are dropped.

Function and dot-script frames should be explicit frame-stack entries:

- function frames replace `$1...` but preserve `$0`
- dot-script frames may temporarily replace positional parameters
- `return` is valid in function and dot-script frames
- function definition redirections are part of the stored function object and
  are applied when the function is invoked

## Track 8: Path, Cwd, And CLI Boundaries

Move path behavior into `PathState` and a small set of path-resolution helpers:

- `resolve_against_shell_cwd`
- `resolve_dot_path`
- `resolve_cd_target`
- `resolve_command_path`
- `resolve_redirection_target`

Rules:

- Relative path entries in `PATH` and `CDPATH` are relative to the shell cwd,
  not the process cwd.
- `cd` supports `CDPATH`, `-L`, and `-P`.
- Logical cwd is the default.  Physical resolution is requested explicitly.
- `PWD` and `OLDPWD` updates go through normal readonly-aware variable APIs.

Expected bug impact:

- Dot-script PATH search uses the process cwd instead of the shell cwd.
- `cd` ignores `CDPATH` and treats `-L` / `-P` as directory names.
- `cd` always resolves physical paths and loses logical `PWD`.
- `cd` bypasses readonly `PWD` / `OLDPWD`.
- Redirection targets do not perform tilde expansion.
- Failed `exec` temporarily changes the host process cwd.

For CLI input, replace `std::env::args()` and `std::env::vars()` with
`args_os` and `vars_os` at the boundary.  The internal shell can still be
UTF-8-first initially, but invalid Unicode must be rejected with a controlled
diagnostic rather than a panic.

Expected bug impact:

- Non-UTF-8 argv or environment entries crash the CLI.
- `-s` treats the first operand as a script file instead of a positional
  parameter.
- `-c` keeps parsing command operands as shell options.
- Bare `--` is rejected instead of reading standard input.

## Track 9: Builtin Semantics After The Core Contracts

Once expansion, simple-command transactions, path handling, and output errors
are centralized, builtin fixes become smaller and less repetitive:

- parse `exit` and `return` operands with too-many-argument checks
- parse `trap --`, numeric reset operands, and reusable trap output
- validate shell identifiers in `export`, `readonly`, `unset`, `read`, and
  `for`
- make `shift` reject invalid and negative counts
- make `set -a` apply to `for`, `read`, `${name:=word}`, and arithmetic
  assignment
- make arithmetic assignment use `env_set` and report readonly failures
- implement shell arithmetic value re-evaluation, octal constants, and
  short-circuiting `&&` / `||`
- add POSIX `printf` conversions
- add symbolic `umask`
- make bare `set -`, `set +`, `command`, and `command --` no-op compatible
- implement `ulimit -f` in 512-byte blocks

These are not all the same size, but they should be handled after the core
contracts so they do not recreate local expansion, assignment, fd, or status
logic.

## Implementation Order

1. Add regression coverage from `BUGS.md`.
   Use the reproducers as named tests grouped by subsystem.  It is acceptable
   to land failing tests behind an ignored/systemic-refactor feature only if the
   next patch immediately starts burning them down, but each refactor track
   should have direct tests before code changes.

2. Add `ExecutionContext`, split job scope from cloned state, and replace
   snapshot/fork constructors with explicit checkpoint/fork constructors.
   This limits accidental parent-state leakage before the launch and job work
   starts.  This step is not complete while any public session API exposes live
   clone, snapshot, or restore semantics backed by a stored `ShellState`.

3. Add `FdTable` and `ChildLaunchPlan`.
   Route all external spawn sites through it, then route `exec` replacement
   through the same command lookup and fd planning path.

4. Replace job ownership and foreground process-group handling.
   Remove unknown-pid wait fallback and make background, foreground, and
   stopped jobs share one state machine.

5. Install the signal hub and derive child signal plans from `TrapTable`.
   This should be integrated with `ChildLaunchPlan` and session finalization.

6. Replace expansion return types with structured fields and modes.
   Convert command resolution, argv construction, assignment values, redirection
   targets, `for` word lists, parameter patterns, and arithmetic text to use the
   new engine.

7. Replace simple-command execution with transactions.
   This is where special builtin abort behavior, assignment restoration,
   declaration utility operands, and `exec` redirection commit become uniform.

8. Replace status/control-flow side channels with `CommandOutcome`.
   Add `ErrexitContext` and frame-stack handling for functions and dot scripts.

9. Clean up builtins, path helpers, and CLI byte handling.
   At this point many BUGS.md entries should be reduced to ordinary utility
   conformance work.

## High-Severity Coverage Map

The current `BUGS.md` contains more than eight `High` entries.  The high
severity items map to the tracks as follows:

- Background jobs run inherited EXIT traps from the parent shell:
  Tracks 1, 3, 4.
- `$!` names the background wrapper instead of the actual asynchronous command:
  Track 3.
- Removed advanced session snapshot/restore semantics could leave the live
  session process-isolated: Track 1.  The required fix is to stop storing
  `ShellState` as an advanced-session snapshot payload.
- Subshells and command substitutions can reap the parent's background jobs:
  Tracks 1, 3.
- Duplicated stdio redirections leak implementation fds into external commands:
  Track 2.
- Ambient inherited file descriptors are untracked but still leak to external
  commands: Track 2.
- Embedded `wait PID` can reap host-owned child processes: Track 3.
- Non-UTF-8 argv or environment entries crash the CLI: Track 8.
- `command` bypasses the unspecified-utility command policy: Track 6.
- Shared inherited parent fds are closed before all child mappings are applied:
  Track 2.
- Foreground interactive jobs share the shell's process group and stopped jobs
  are lost: Track 3.
- Stopped background-machine children are reported as completed jobs:
  Track 3.
- Managed embedded sessions clobber each other's process-global signal traps:
  Track 4.

## Definition Of Done

Each track is done when:

- every affected `BUGS.md` reproducer has a regression test
- the public behavior is the same across CLI, `Shell`, `SessionState`, and
  `Planner` where the API promises the same semantics
- Unix runtime and test runtimes exercise the same shell-level contract
- no external command launch bypasses `ChildLaunchPlan`
- no state clone for execution bypasses `ExecutionContext`
- no word expansion caller receives only a plain string until the caller has
  selected an expansion mode
- `cargo hack --feature-powerset test` passes after code changes

This is deliberately a small number of large contracts.  The important outcome
is that adding a new execution form or builtin becomes forced through the same
state, fd, expansion, trap, job, and finalization paths as the existing shell.