# Refactor Plan
This document is based on `BUGS.md` and a source pass through the shell
execution, expansion, job, trap, fd, frontend, and embedding layers. The
current bug list should not be attacked as independent point fixes. Most of
the failures are places where a shell semantic contract is implicit, spread
across several modules, and therefore enforced differently by foreground
commands, background commands, command substitutions, pipelines, embedded
sessions, and the CLI.
The plan is to create a few explicit contracts and move existing behavior
behind them:
- execution context and session ownership
- child launch and fd mapping
- job ownership and process groups
- signal/trap ownership
- typed word expansion products
- simple-command transactions and control-flow outcomes
- uniform session finalization and byte-safe frontend input
The parser and AST can mostly stay. The refactor should focus on the boundary
between AST execution and OS/runtime effects.
## Source Diagnosis
The most important cross-cutting problems are visible in these places:
- `src/shell/state.rs` stores unrelated state in one struct: shell variables,
frame, jobs, traps, cwd, fd maps, process-global signal ownership, transient
expansion status, active command ids, and embedding definition. Cloning or
serializing `ShellState` is therefore never semantically neutral.
- The old fork, detached-session, background-state snapshot, and advanced
session snapshot/restore paths all created shell variants by mutating a
clone. That is the root of several bugs where jobs, traps, signal owners, fd
maps, and process isolation are kept or dropped by accident.
- Snapshotting `ShellState` must be removed as an implementation technique.
Public APIs must not expose live session clone, snapshot, or restore
semantics. Reset-style behavior must be backed by purpose-built checkpoint
data with explicit ownership for variables, definitions, cwd, fds, traps,
jobs, and process globals. If any execution path still persists or restores
a whole `ShellState` snapshot after this refactor, the refactor has failed.
- Child process launch is assembled in several sites:
`run_external_simple_command`, `spawn_external_in_path`,
`run_prepared_external_stage`, `spawn_external_pipeline_stages`,
`spawn_background_job`, and `builtin_exec`. These sites do not share one
representation of the child fd table, signal dispositions, process group, cwd,
env, or PATH resolution.
- `src/sys/backend.rs` builds `posix_spawn` file actions directly from stdio
fds and passed fds. Because there is no normalized child fd table, close
actions can be ordered incorrectly, implementation fds can leak, and closed
shell fds become invalid `dup2` sources.
- `src/shell/expand.rs` collapses expansion results to `String` or
`Vec<String>` too early. Quote provenance, glob patterns, tilde eligibility,
multi-field `"$@"`, and command-substitution status are lost before
command-name, assignment, redirection, `for`, and parameter-pattern contexts
can apply their distinct rules.
- `src/shell/simple_command.rs`, `src/shell/resolve.rs`,
`src/shell/compound.rs`, and `src/shell/builtins.rs` use status integers plus
mutable side channels (`BranchControl`, `exit_code`, command-substitution
status, temporary assignment restoration) instead of one simple-command and
control-flow result type.
- CLI and embedded execution finalize sessions differently. The CLI calls
`finalize_shell_run`; embedded APIs return directly from `run` and
`run_program`.
## Refactoring Thesis
Add explicit shell semantic objects and make every execution path use them.
Then fix behavior by enforcing those objects' invariants, not by patching each
bug's call site.
The main new internal contracts should be:
```rust
enum ExecutionContextKind {
TopLevel,
Function,
DotScript,
Subshell,
CommandSubstitution,
PipelineStage,
BackgroundJob,
DetachedSession,
}
struct ExecutionContext {
kind: ExecutionContextKind,
logical_shell_pid: u32,
job_scope: JobScope,
trap_scope: TrapScope,
fd_scope: FdScope,
process_global_scope: ProcessGlobalScope,
errexit_exempt: bool,
}
struct CommandOutcome {
status: i32,
control: ControlFlow,
}
```
All current ad hoc state transforms should become constructors such as
`state.fork_for(ExecutionContextKind::CommandSubstitution)` and
`state.checkpoint_for_background(...)`. These constructors must be the only
way to create subshell, background, detached, or pipeline-stage state.
Non-negotiable success criterion: this refactor must delete wholesale shell
state snapshotting. A snapshot or fork API may produce an explicit checkpoint
object, but it must not clone, serialize, persist, or later restore a complete
`ShellState`. Leaving that pattern in place means the refactor did not solve
the core ownership problem.
## Track 1: Split State By Ownership
Create internal state components inside `src/shell/state.rs` or new sibling
modules:
- `VariableStore`: shell variables, attributes, readonly checks, `allexport`,
positional frame stack, and private builtin state such as `getopts`.
- `DefinitionStore`: functions and aliases. Function values should store both
body and definition-time redirections.
- `PathState`: logical cwd, physical cwd cache if needed, `PWD`/`OLDPWD`
updates, and path resolution against shell cwd.
- `FdTable`: shell-visible fd mappings, named fds, ambient fd policy, and stdio.
- `TrapTable`: trap actions plus their origin and child disposition intent.
- `JobTable`: jobs owned by this execution context only.
- `ExecutionContext`: the context kind and inherited policy decisions.
This removes the current ambiguity where `ShellState::clone()` copies jobs,
traps, and process-global ownership with no caller-visible policy.
It also removes shell-state snapshotting from the design. Each checkpoint
constructor must choose the owned components it carries and the owned components
it intentionally drops. Restoring a checkpoint must rebuild a live
`ShellState` through normal constructors instead of assigning a stored
`ShellState` back into the session.
Expected bug impact:
- The removed advanced session snapshot/restore API could leave the live
session process-isolated.
- Subshells and command substitutions can reap the parent's background jobs.
- Background jobs run inherited EXIT traps from the parent shell.
- Background jobs get a different `$$` from the invoking shell.
- Background machine jobs drop shell inherited-fd mappings.
- EXIT traps in subshells and command substitutions are skipped.
- Embedded run APIs skip EXIT traps when a session exits.
- CLI `RunOutcome` reports the pre-EXIT-trap status.
- `getopts` state is exposed through user variables and can be made readonly.
Important invariant: a context may inherit variables, aliases, functions, cwd,
and fd mappings, but it must not inherit the parent's job table unless it can
also inherit the parent's wait ownership. In practice, subshell-style contexts
should receive an empty job table or read-only job view.
## Track 2: Centralize Child Launch And Fd Policy
Add a single `ChildLaunchPlan` builder used by every external command path and
by `exec` replacement:
```rust
struct ChildLaunchPlan {
program: String,
argv: Vec<String>,
env: Vec<(String, String)>,
cwd: PathBuf,
process_group: ProcessGroupPlan,
stdio: [ChildFdDisposition; 3],
extra_fds: BTreeMap<i32, ChildFdDisposition>,
signal_dispositions: ChildSignalPlan,
}
enum ChildFdDisposition {
OpenFrom(sys::FileDescriptor),
Closed,
}
```
The plan should be created from `FdTable`, redirections, inherited fd policy,
and trap/signal policy. `src/sys/backend.rs` should receive this normalized
plan or an equivalent lower-level action list, not independently infer fd
semantics from stdio plus `passed_fds`.
Required fd invariants:
- A shell-visible closed fd is represented as a child-side close action, never
as `dup2(-1, n)`.
- Redirection helper fds and duplicated source fds are close-on-exec in the
parent and explicitly closed in the child after all `dup2` users are applied.
- If two child fds map from the same parent fd, all `dup2` actions run before
closing the shared source.
- The shell chooses one ambient-fd policy at startup: import open fds into
`FdTable`, or close unknown fds before external exec. It must not reject
builtins using fd 3 while letting external commands inherit fd 3 anyway.
- `exec` replacement uses the same path lookup, fd plan, environment, cwd, and
script fallback as normal external command execution.
Expected bug impact:
- Duplicated stdio redirections leak implementation fds into external commands.
- Ambient inherited file descriptors are untracked but still leak to external
commands.
- Shared inherited parent fds are closed before all child mappings are applied.
- Closed fd redirections break external commands.
- `exec` with redirections only does not persist redirects.
- `exec` replacement commands ignore redirections.
- `exec` command lookup ignores unexported shell `PATH`.
- `exec` replacement stops at non-executable PATH entries and lacks shell-script
fallback.
- Failed `exec` temporarily changes the host process cwd.
- Background machine jobs drop shell inherited-fd mappings.
Implementation notes:
- Put the builder near shell semantics, for example `src/shell/launch.rs`.
- Keep `src/sys/backend.rs` dumb: it should only validate and apply ordered
spawn actions.
- Extend `Runtime::exec_replace` carefully. Prefer adding a default method that
accepts a launch plan and returns `Unsupported` for runtimes that cannot do
process replacement.
## Track 3: Make Jobs Own Processes, Not Pids
Replace the current mix of `display_pid`, runtime handles, wrapper machine
processes, and pid-only waits with a job ownership model:
```rust
struct Job {
job_id: u32,
command_id: String,
process_group: Option<ProcessGroupId>,
leader: sys::ProcessHandle,
display_pid: Option<u32>,
members: Vec<JobMember>,
state: JobState,
}
```
Rules:
- `wait PID` may wait only for a process recorded in the current context's
`JobTable`. Remove the fallback to `Runtime::wait_display_pid` for unknown
pids.
- `$!` is derived from the job's displayed asynchronous command pid. For a
simple external background command, avoid the `mxsh --machine` wrapper and
spawn the command directly. For a compound asynchronous list, the wrapper is
the shell job process and must own the process group it creates.
- Foreground interactive jobs get their own process group before they run, and
the terminal is handed to that group for the duration of the foreground wait.
- Stopped processes remain in `JobTable` as stopped jobs. A stopped status is
not collapsed into `128 + signal` unless there is no job-control context.
- `claim_foreground` records `tcgetpgrp`, not `getpgrp`, so release restores the
previous terminal owner.
Expected bug impact:
- `$!` names the background wrapper instead of the actual asynchronous command.
- Embedded `wait PID` can reap host-owned child processes.
- `wait` without operands returns a job status instead of zero.
- Foreground interactive jobs share the shell's process group and stopped jobs
are lost.
- Stopped background-machine children are reported as completed jobs.
- Subshells and command substitutions can reap the parent's background jobs.
This track should be done with `Track 2`, because job semantics depend on the
same launch path that sets process groups and signal dispositions.
## Track 4: Replace Trap Globals With A Signal Hub
The current trap code uses process-global arrays:
- `PENDING_TRAPS`
- `TRAP_OWNERS`
- per-state `trap_owner_id`
That model cannot safely support multiple managed embedded sessions and cannot
derive child signal dispositions from process-isolated trap changes.
Introduce a process-global signal hub with explicit session registrations:
- A managed session acquires a `SignalRegistration` token.
- Each signal has a reference-counted installed handler and a set of interested
session tokens.
- Pending signal counts are per token, not per process-global owner slot.
- Restoring one session removes only that session's registration.
- A process-isolated context can change its `TrapTable` and child disposition
plan without installing OS handlers in the parent process.
- Child signal defaults/ignores are computed from `TrapTable`, not hard-coded in
`configure_spawn_attributes`.
Expected bug impact:
- Managed embedded sessions clobber each other's process-global signal traps.
- `trap ''` after a caught trap does not make child commands ignore the signal.
- `trap '' PIPE` is not inherited by external commands.
- Traps set inside subshells and command substitutions are not applied to child
commands.
- Background jobs run inherited EXIT traps from the parent shell.
For EXIT specifically, store trap origin. Background and detached contexts
should not run an inherited parent EXIT trap merely because they were rebuilt
from a checkpoint. Subshell and command-substitution contexts should run an
EXIT trap that was installed inside that context before returning.
## Track 5: Typed Expansion Results
Replace the public-in-module pattern of `expand_word_nosplit -> String` and
`expand_word -> Vec<String>` with one expansion engine that returns structured
fields:
```rust
struct ExpansionResult {
fields: Vec<ExpandedField>,
last_command_substitution_status: Option<i32>,
}
struct ExpandedField {
value: String,
glob_pattern: PatternText,
quote_mask: QuoteMask,
tilde_allowed: bool,
}
enum ExpansionMode {
CommandName,
Argument,
AssignmentValue { assignment_name: String },
RedirectionTarget,
ForWordList,
CasePattern,
ParameterRemovalPattern,
ArithmeticText,
}
```
Mode-specific rules should be table-driven instead of encoded in separate
helper paths. The result must preserve enough quote information to decide
whether glob metacharacters are active and whether tilde expansion is allowed.
Expected bug impact:
- Command-name expansion drops extra fields and skips globbing.
- Empty command-name expansion skips later word expansions and loses
substitution status.
- `for ... in` word lists skip tilde and pathname expansion.
- Redirection targets do not perform tilde expansion.
- Quoted default/alternate expansions with `"$@"` collapse arguments.
- Quoted pattern-removal metacharacters are treated as active patterns.
- Shortest suffix removal misses the empty suffix.
- Quoted tildes in assignment values are expanded after quote removal.
- `set -u` exempts unset positional parameters and `$!`.
- `${parameter:=word}` assigns to positional parameters instead of failing.
- `${#}` and `${#@}` mishandle special parameters.
- `${#var}` reports UTF-8 bytes instead of characters.
- Field splitting keeps a trailing empty field for mixed whitespace and
non-whitespace IFS.
- Arithmetic expansion does not perform shell expansions inside the expression.
Do this before a large builtin cleanup. Many builtin bugs are actually caused
by declaration-style operands being expanded through the ordinary argv path.
## Track 6: Simple-Command Transactions
Create a `SimpleCommandTransaction` that plans and executes exactly the POSIX
simple-command phases:
1. Save expansion status.
2. Process assignment words left to right.
3. Apply redirections.
4. Resolve command using the proper lookup order.
5. Execute the utility.
6. Commit or roll back temporary assignments and redirections according to the
command kind.
7. Classify errors for non-interactive abort rules.
The transaction should return `CommandOutcome`, not just `i32`.
Important rules:
- Special builtins are resolved before shell functions.
- `command` suppresses function lookup but still honors command policy.
- Assignment-only commands use the last command-substitution status across both
assignments and redirects.
- Assignment words are applied left to right, so later assignments can see
earlier ones.
- Prefix assignments before regular builtins are temporary, but restoring them
must not clobber variables the builtin intentionally wrote.
- `exec` with no command commits its redirections instead of restoring them.
- Redirection and variable-assignment errors for special builtins trigger the
required non-interactive abort behavior.
Expected bug impact:
- Assignment-only readonly failures do not abort non-interactive scripts.
- Assignment-only command status ignores later redirection command
substitutions.
- Assignment words expand as a batch instead of left-to-right.
- Prefix-assignment restoration clobbers regular builtin side effects.
- Special-builtin redirection and readonly errors do not abort non-interactive
scripts.
- Special-builtin utility errors do not abort non-interactive scripts.
- Shell functions incorrectly override special builtins.
- `command` bypasses the unspecified-utility command policy.
- `exec` with redirections only does not persist redirects.
- `export` and `readonly` assignment operands are field-split.
- Attribute-only `export` and `readonly` create empty variables.
This is also the right place to make write failures from output-producing
builtins visible in the builtin status. `shell_out` and `shell_outln` should
return `io::Result<()>` or a shell status instead of being diagnostics-only
helpers.
## Track 7: Control Flow And Errexit As Values
Replace sticky mutable `BranchControl` plus status rewriting with structured
control flow:
```rust
enum ControlFlow {
None,
Break(u32),
Continue(u32),
Return(i32),
Exit(i32),
}
```
Compound command helpers should propagate `CommandOutcome` unchanged unless
they are the context that consumes it. `!`, `if`, `while`, `until`, and
short-circuit lists should not reinterpret `Return(7)` as an ordinary status to
invert or test.
Add an `ErrexitContext` stack so `set -e` is evaluated with POSIX exemptions:
- conditions of `if`, `while`, and `until`
- non-final commands in `&&` and `||`
- inverted pipelines beginning with `!`
- other contexts already required by the shell language profile
Expected bug impact:
- `break n` and `continue n` can escape past all loops and stop the script.
- `return` status is lost in control-flow conditions and negated pipelines.
- `set -e` exits in POSIX-exempt control-flow contexts.
- `return` is rejected inside sourced scripts.
- Dot scripts ignore positional arguments.
- Function calls overwrite `$0`.
- Function definition redirections are dropped.
Function and dot-script frames should be explicit frame-stack entries:
- function frames replace `$1...` but preserve `$0`
- dot-script frames may temporarily replace positional parameters
- `return` is valid in function and dot-script frames
- function definition redirections are part of the stored function object and
are applied when the function is invoked
## Track 8: Path, Cwd, And CLI Boundaries
Move path behavior into `PathState` and a small set of path-resolution helpers:
- `resolve_against_shell_cwd`
- `resolve_dot_path`
- `resolve_cd_target`
- `resolve_command_path`
- `resolve_redirection_target`
Rules:
- Relative path entries in `PATH` and `CDPATH` are relative to the shell cwd,
not the process cwd.
- `cd` supports `CDPATH`, `-L`, and `-P`.
- Logical cwd is the default. Physical resolution is requested explicitly.
- `PWD` and `OLDPWD` updates go through normal readonly-aware variable APIs.
Expected bug impact:
- Dot-script PATH search uses the process cwd instead of the shell cwd.
- `cd` ignores `CDPATH` and treats `-L` / `-P` as directory names.
- `cd` always resolves physical paths and loses logical `PWD`.
- `cd` bypasses readonly `PWD` / `OLDPWD`.
- Redirection targets do not perform tilde expansion.
- Failed `exec` temporarily changes the host process cwd.
For CLI input, replace `std::env::args()` and `std::env::vars()` with
`args_os` and `vars_os` at the boundary. The internal shell can still be
UTF-8-first initially, but invalid Unicode must be rejected with a controlled
diagnostic rather than a panic.
Expected bug impact:
- Non-UTF-8 argv or environment entries crash the CLI.
- `-s` treats the first operand as a script file instead of a positional
parameter.
- `-c` keeps parsing command operands as shell options.
- Bare `--` is rejected instead of reading standard input.
## Track 9: Builtin Semantics After The Core Contracts
Once expansion, simple-command transactions, path handling, and output errors
are centralized, builtin fixes become smaller and less repetitive:
- parse `exit` and `return` operands with too-many-argument checks
- parse `trap --`, numeric reset operands, and reusable trap output
- validate shell identifiers in `export`, `readonly`, `unset`, `read`, and
`for`
- make `shift` reject invalid and negative counts
- make `set -a` apply to `for`, `read`, `${name:=word}`, and arithmetic
assignment
- make arithmetic assignment use `env_set` and report readonly failures
- implement shell arithmetic value re-evaluation, octal constants, and
short-circuiting `&&` / `||`
- add POSIX `printf` conversions
- add symbolic `umask`
- make bare `set -`, `set +`, `command`, and `command --` no-op compatible
- implement `ulimit -f` in 512-byte blocks
These are not all the same size, but they should be handled after the core
contracts so they do not recreate local expansion, assignment, fd, or status
logic.
## Implementation Order
1. Add regression coverage from `BUGS.md`.
Use the reproducers as named tests grouped by subsystem. It is acceptable
to land failing tests behind an ignored/systemic-refactor feature only if the
next patch immediately starts burning them down, but each refactor track
should have direct tests before code changes.
2. Add `ExecutionContext`, split job scope from cloned state, and replace
snapshot/fork constructors with explicit checkpoint/fork constructors.
This limits accidental parent-state leakage before the launch and job work
starts. This step is not complete while any public session API exposes live
clone, snapshot, or restore semantics backed by a stored `ShellState`.
3. Add `FdTable` and `ChildLaunchPlan`.
Route all external spawn sites through it, then route `exec` replacement
through the same command lookup and fd planning path.
4. Replace job ownership and foreground process-group handling.
Remove unknown-pid wait fallback and make background, foreground, and
stopped jobs share one state machine.
5. Install the signal hub and derive child signal plans from `TrapTable`.
This should be integrated with `ChildLaunchPlan` and session finalization.
6. Replace expansion return types with structured fields and modes.
Convert command resolution, argv construction, assignment values, redirection
targets, `for` word lists, parameter patterns, and arithmetic text to use the
new engine.
7. Replace simple-command execution with transactions.
This is where special builtin abort behavior, assignment restoration,
declaration utility operands, and `exec` redirection commit become uniform.
8. Replace status/control-flow side channels with `CommandOutcome`.
Add `ErrexitContext` and frame-stack handling for functions and dot scripts.
9. Clean up builtins, path helpers, and CLI byte handling.
At this point many BUGS.md entries should be reduced to ordinary utility
conformance work.
## High-Severity Coverage Map
The current `BUGS.md` contains more than eight `High` entries. The high
severity items map to the tracks as follows:
- Background jobs run inherited EXIT traps from the parent shell:
Tracks 1, 3, 4.
- `$!` names the background wrapper instead of the actual asynchronous command:
Track 3.
- Removed advanced session snapshot/restore semantics could leave the live
session process-isolated: Track 1. The required fix is to stop storing
`ShellState` as an advanced-session snapshot payload.
- Subshells and command substitutions can reap the parent's background jobs:
Tracks 1, 3.
- Duplicated stdio redirections leak implementation fds into external commands:
Track 2.
- Ambient inherited file descriptors are untracked but still leak to external
commands: Track 2.
- Embedded `wait PID` can reap host-owned child processes: Track 3.
- Non-UTF-8 argv or environment entries crash the CLI: Track 8.
- `command` bypasses the unspecified-utility command policy: Track 6.
- Shared inherited parent fds are closed before all child mappings are applied:
Track 2.
- Foreground interactive jobs share the shell's process group and stopped jobs
are lost: Track 3.
- Stopped background-machine children are reported as completed jobs:
Track 3.
- Managed embedded sessions clobber each other's process-global signal traps:
Track 4.
## Definition Of Done
Each track is done when:
- every affected `BUGS.md` reproducer has a regression test
- the public behavior is the same across CLI, `Shell`, `SessionState`, and
`Planner` where the API promises the same semantics
- Unix runtime and test runtimes exercise the same shell-level contract
- no external command launch bypasses `ChildLaunchPlan`
- no state clone for execution bypasses `ExecutionContext`
- no word expansion caller receives only a plain string until the caller has
selected an expansion mode
- `cargo hack --feature-powerset test` passes after code changes
This is deliberately a small number of large contracts. The important outcome
is that adding a new execution form or builtin becomes forced through the same
state, fd, expansion, trap, job, and finalization paths as the existing shell.