epsh 0.0.6 - Docs.rs

# OS / String Plan

## Goal

Make `epsh` correct for Unix shell data that is not valid UTF-8.

The target is not "use `OsString` everywhere". The target is:

- shell syntax stays as shell syntax
- shell data becomes byte-preserving
- OS boundaries use `OsStr` / `OsString` / `Path` on Rust APIs
- libc boundaries use raw bytes / `CString`

## Core Position

`OsString` is the right Rust boundary type for filesystem and process APIs on Unix.
It is not the right universal runtime type for a POSIX shell.

The shell's internal data model should distinguish:

- Shell syntax
  - reserved words
  - operators
  - variable names
  - function names
  - type: `String`
- Shell data
  - expanded words
  - variable values
  - positional parameters
  - command names
  - argv entries
  - pathnames
  - environment values
  - type: byte-preserving shell-owned type, e.g. `ShellBytes(Vec<u8>)`
- OS interface
  - `Path` / `PathBuf`
  - `OsStr` / `OsString`
  - `CString`

## Why The Current Model Is Wrong

Today `epsh` uses `String` for most shell data and a PUA encoding shim in
`src/encoding.rs` to preserve invalid bytes in some paths. That is a useful
bridge, but it is not a correct runtime model.

Current correctness gaps:

- `execvp` converts shell strings with `CString::new(a.as_bytes())`, which sends
  UTF-8 bytes for PUA codepoints instead of the original bytes.
  - `src/builtins.rs`
- external spawn uses `String` argv/env directly
  - `src/eval.rs`
- environment import uses `std::env::vars()`, which assumes UTF-8
  - `src/var.rs`
- CLI argv import uses `std::env::args()`, which assumes UTF-8
  - `src/main.rs`
- glob drops non-UTF-8 names with `to_str()`
  - `src/glob.rs`
- path-based builtins and tests operate on `&str` paths
  - `src/redirect.rs`
  - `src/builtins.rs`
  - `src/test_cmd.rs`
- `PWD` is written with `to_string_lossy()`, which is not semantically correct
  for shell state
  - `src/builtins.rs`

## Desired Invariants

These invariants define success.

1. Any non-NUL byte sequence accepted by the shell as data remains exact until an
   operation explicitly transforms it.
2. A pathname containing non-UTF-8 bytes can be:
   - passed as an argument
   - redirected to or from
   - globbed
   - used by `cd`
   - checked by `test` / `[`
   - sourced with `.`
3. Environment values with non-UTF-8 bytes survive:
   - import into the shell
   - export from shell variables
   - prefix assignment to an external command
   - exec/spawn to a child process
4. `argv` bytes seen by a child process exactly match the bytes produced by shell
   expansion.
5. `cwd` and path resolution remain byte-correct at the OS boundary.

## Type Design

Introduce a dedicated internal byte string type:

```rust
pub struct ShellBytes(Vec<u8>);
```

Minimum API:

```rust
impl ShellBytes {
    pub fn as_bytes(&self) -> &[u8];
    pub fn from_vec(bytes: Vec<u8>) -> Self;
    pub fn into_vec(self) -> Vec<u8>;
    pub fn from_str_lossless(s: &str) -> Self;
    pub fn to_os_string(&self) -> OsString;
    pub fn from_os_str(s: &OsStr) -> Self;
    pub fn to_cstring(&self) -> Result<CString, NulError>;
}
```

On Unix these conversions should use `std::os::unix::ffi::{OsStrExt, OsStringExt}`.

Do not route new code through the PUA encoding layer except as a temporary
compatibility bridge.

## Data Model Changes

Keep as `String`:

- variable names
- function names
- reserved words
- operator tokens
- parser control structures

Move to `ShellBytes`:

- `WordPart::Literal`
- `WordPart::SingleQuoted`
- heredoc literal bodies
- expanded fields
- command substitution output
- shell variable values
- positional parameters
- `$0`
- command names after expansion
- path arguments after expansion
- environment values

Likely split AST and runtime concerns:

- parser may continue to build syntax-oriented structures first
- expansion output should become byte-based as early as possible
- long term, lexer/parser should become byte-native instead of `Vec<char>`

## Environment Model

Do not treat the inherited process environment as if every entry were valid shell
syntax and valid UTF-8.

Use two concepts:

- shell variables
  - keyed by valid shell identifier
  - values are `ShellBytes`
- inherited external environment
  - raw `name=value` byte entries or equivalent structured form
  - may contain names or values not representable as shell identifiers

Spawn/exec should merge:

- exported shell variables
- preserved inherited environment entries not shadowed by shell exports
- command prefix assignments

This avoids losing environment state during import/export cycles.

## Migration Phases

### Phase 1: Boundary Helpers

Add a small module for Unix conversions between:

- `ShellBytes` and `OsStr`
- `ShellBytes` and `CString`
- `ShellBytes` and current PUA `String` bridge

This phase should not change behavior yet. It creates the conversion seams.

### Phase 2: Runtime Values

Convert runtime-expanded values from `String` to `ShellBytes`:

- expansion results
- variable values
- positional parameters
- command substitution output

Keep parser-facing syntax names as `String`.

### Phase 3: Process Boundaries

Fix argv/env correctness for external commands:

- replace `std::env::args()` with `args_os()` in `src/main.rs`
- replace `std::env::vars()` with `vars_os()` or raw Unix env handling in `src/var.rs`
- feed `Command::new`, `.args`, and `.env` with `OsStr` / `OsString`
- build `CString` for direct `execvp` from raw shell bytes, not UTF-8 string bytes

### Phase 4: Path Boundaries

Change path resolution and path-using builtins to accept byte-preserving values:

- `resolve_path`
- redirections
- `.`
- `cd`
- `test` / `[`
- `which`

`cwd` may remain `PathBuf`, but conversion into it must be byte-correct.

### Phase 5: Glob

Rewrite glob to operate on Unix directory entry bytes.

Requirements:

- do not drop non-UTF-8 names
- preserve exact entry bytes in matches
- keep existing wildcard semantics
- preserve current behavior for dotfiles and escaped glob metacharacters

This is one of the highest-value correctness changes because it materially affects
what filenames the shell can see.

### Phase 6: Public API Review

Review public API surface in `eval.rs` and tests:

- `set_var`
- `get_var`
- `set_args`
- `run_script`
- `resolve_path`

Possible approach:

- keep current UTF-8 convenience APIs
- add byte-safe Unix-specific APIs alongside them
- avoid breaking embedders unnecessarily until the byte model is stable

### Phase 7: Parser / Lexer Rewrite

Move the shell core from `String` / `char` parsing to byte-native parsing.

This is the end-state cleanup phase. It removes the remaining conceptual mismatch
between shell semantics and Unicode-centric tokenization.

## Correctness Measurement

The improvement should be measured at OS-observable boundaries, not only with
unit tests over internal helpers.

The question is not "did conversion code run".
The question is "can `epsh` now preserve bytes that the current build loses".

### Test Philosophy

Each new test should compare old behavior and new behavior in a way that would
have failed materially before the change.

Good tests:

- assert exact child `argv` bytes
- assert exact child env bytes
- assert a non-UTF-8 filename is matched by glob
- assert redirection opens the intended non-UTF-8 filename
- assert `cd` and `test -e` succeed on non-UTF-8 paths
- fail on current `main` or would have failed before the migration step that
  fixes them

Weak tests:

- only checking helper conversion functions
- only round-tripping through PUA strings
- only testing Unicode-valid filenames

## Recommended Test Harness

Add small helper executables for integration tests that print raw bytes in a
stable representation, preferably hex.

Suggested fixtures:

- `tests/fixtures/show_argv.rs`
  - prints each argument as lowercase hex, one line per arg
- `tests/fixtures/show_env.rs`
  - prints selected environment values as lowercase hex
- `tests/fixtures/stat_path.rs`
  - checks whether the path passed in argv exists and reports the raw bytes it saw

If keeping fixtures as Rust binaries is awkward, a small Perl helper is also fine
because Perl handles raw Unix bytes well and the repo already uses Perl in the
conformance flow.

## Tests That Materially Show Improvement

### 1. External argv byte round-trip

Run `epsh -c` with an argument containing bytes like `0x80`, `0xff`, or mixed
valid UTF-8 plus invalid bytes, then invoke the argv-dump helper.

Expected:

- child sees exact original bytes

Current behavior likely fails because direct exec/spawn goes through UTF-8
`String` paths.

### 2. External env byte round-trip

Set a shell variable to non-UTF-8 bytes, export it, and run the env-dump helper.

Expected:

- child sees exact raw bytes for the exported variable

Add a second case for prefix assignment:

- `X=<bytes> helper`

### 3. Inherited env preservation

Launch `epsh` from the integration test with a parent env value containing
non-UTF-8 bytes.

Expected:

- child command run by `epsh` still sees those exact bytes unless explicitly
  overwritten

This exposes the current `std::env::vars()` UTF-8 loss.

### 4. Non-UTF-8 script path

Execute a script file whose pathname contains invalid UTF-8 bytes.

Expected:

- `epsh script-path` works

This forces `main.rs` argv import and file open paths to be correct.

### 5. Redirection to non-UTF-8 filename

Use `printf` or `echo` with output redirection into a filename containing invalid
UTF-8 bytes, then verify the file exists and contains the expected bytes.

Expected:

- redirect succeeds
- opened file is the exact intended path

### 6. Input redirection from non-UTF-8 filename

Create a file with a non-UTF-8 pathname and read it with `< file`.

Expected:

- command receives the file contents

### 7. `cd` into non-UTF-8 directory

Create a directory with invalid UTF-8 bytes and run:

```sh
cd <dir> && pwd
```

Expected:

- `cd` succeeds
- `pwd` and shell state remain consistent

Prefer checking shell behavior over exact display formatting if output escaping is
still under transition.

### 8. `test -e`, `-f`, `-d`, `-L` on non-UTF-8 paths

These should succeed for matching file kinds.

This directly measures fixes in `src/test_cmd.rs`.

### 9. `.` on non-UTF-8 path

Source a script whose filename contains invalid UTF-8 bytes.

Expected:

- script executes successfully

### 10. `which` / PATH search with non-UTF-8 directory entries

Set `PATH` to include a directory whose name contains invalid UTF-8 bytes and
place an executable inside it.

Expected:

- `command -v`
- `type`
- external execution by bare command name

all find the executable.

### 11. Glob matches non-UTF-8 filenames

Create files whose names include invalid UTF-8 bytes and run a glob that should
match them.

Expected:

- matches are returned
- unmatched files are not returned
- ordering remains stable

This is one of the clearest before/after tests because current glob explicitly
drops non-UTF-8 names.

### 12. Command substitution preserves non-UTF-8 bytes

Have a helper emit raw bytes to stdout, capture them with `$(...)`, and pass the
result to another helper that prints argv bytes.

Expected:

- bytes survive command substitution unchanged except for POSIX newline trimming

### 13. Here-doc and variable expansion with raw bytes

Where supported by the parser and runtime model, verify that raw bytes stored in
variables survive expansion into:

- argv
- redirection targets
- environment values

This guards against regressions once runtime values become byte-based.

## Metrics

Track progress with a simple matrix of capabilities.

Rows:

- argv to child
- env to child
- inherited env
- script path
- redirection output
- redirection input
- `cd`
- `test`
- `.`
- PATH search
- glob
- command substitution

Columns:

- ASCII
- valid UTF-8 non-ASCII
- invalid UTF-8 bytes

Expected progression:

- ASCII stays green throughout
- valid UTF-8 stays green throughout
- invalid UTF-8 starts red in many rows and turns green phase by phase

The main success metric is the count of green cells in the `invalid UTF-8 bytes`
column. That is the part of the matrix where correctness is currently missing.

## Minimum Acceptance Bar

The migration is not complete until:

1. child `argv` and env tests prove exact byte preservation
2. glob sees and matches non-UTF-8 filenames
3. redirections and path-based builtins work on non-UTF-8 paths
4. inherited environment is not silently dropped or rewritten because of UTF-8
   assumptions

## Non-Goals

- making shell syntax identifiers support arbitrary bytes
- changing user-facing display/escaping policy before byte correctness is in place
- cross-platform abstraction beyond Unix semantics for this migration