unilang_parser 0.34.0

Parser for Unilang CLI instruction syntax.
# CLI Integration Guide: Avoiding the Argv Misuse Pitfall

## Overview

This guide explains the critical difference between parsing shell arguments and parsing strings, and why using the wrong parsing method leads to broken quote handling in CLI applications.

---

## The Problem: Shell Arguments Are Pre-Tokenized

When your CLI application receives arguments from the shell, the shell has **already performed tokenization**, handling quotes, escapes, and whitespace. The `argv` you receive is a vector of **pre-split tokens**.

### Example

```bash
$ my-app .deploy --name "Production Server" --region "us-east-1"
```

The shell passes to your app:
```rust
argv = [
  "my-app",
  ".deploy",
  "--name",
  "Production Server",    // ← Single token (quotes removed by shell)
  "--region",
  "us-east-1"             // ← Single token
]
```

Notice:
- Quotes are **removed** by the shell
- Values with spaces are **single tokens**
- Whitespace inside quotes is **preserved**

---

## The Pitfall: Re-Tokenizing Pre-Tokenized Arguments

### ❌ Incorrect Pattern (Breaks Quote Handling)

```rust
use unilang_parser :: { Parser, UnilangParserOptions };

fn main() -> Result< (), Box<dyn std ::error ::Error >> {
  let argv : Vec<String> = std ::env ::args().collect();

  // MISTAKE #1: Join pre-tokenized argv back into a string
  let joined = argv.join(" ");
  // joined = "my-app .deploy --name Production Server --region us-east-1"

  // MISTAKE #2: Use parse_repl_input (which splits on whitespace) on shell argv
  let parser = Parser ::new( UnilangParserOptions ::default() );
  let instruction = parser.parse_repl_input(&joined)?;

  // Result: "Production Server" becomes TWO separate arguments: "Production" and "Server"
  // User expectation: BROKEN ❌

  Ok(())
}
```

### What Went Wrong

1. **Shell tokenized:** `"Production Server"` → single token `Production Server`
2. **You joined:** `["--name", "Production Server"]``"--name Production Server"`
3. **You re-split:** `"--name Production Server"``["--name", "Production", "Server"]`
4. **Result:** Original intent lost - user's quoted argument is now split incorrectly

### Why split_whitespace() Fails

`split_whitespace()` has **no quote handling**. It doesn't know that `"Production Server"` was originally quoted. All it sees is whitespace to split on.

---

## The Solution: Use parse_from_argv()

### ✅ Correct Pattern (Preserves Shell Tokenization)

```rust
use unilang_parser :: { Parser, UnilangParserOptions };

fn main() -> Result< (), Box<dyn std ::error ::Error >> {
  let argv : Vec<String> = std ::env ::args().collect();

  // CORRECT: Pass pre-tokenized argv directly
  let parser = Parser ::new( UnilangParserOptions ::default() );
  let instruction = parser.parse_from_argv(&argv)?;

  // Result: "Production Server" remains a SINGLE argument
  // User expectation: PRESERVED ✅

  Ok(())
}
```

### Why This Works

`parse_from_argv()` treats each argv element as a **pre-tokenized unit**. It doesn't re-split on whitespace because the shell already did that work correctly.

---

## When to Use Each Method

### Use `parse_from_argv(&argv)` When:

✅ Building a CLI application that receives arguments from the shell
✅ Receiving `std ::env ::args()` or similar
✅ User invokes your app like: `my-app .command "value with spaces"`

**Why:** Shell has already tokenized. Re-tokenizing breaks quote handling.

### Use `parse_repl_input(input)` When:

✅ Parsing instruction strings from configuration files
✅ Accepting user input from a REPL or interactive prompt
✅ Processing embedded instructions in source code or scripts
✅ Input is a **raw string** that hasn't been shell-tokenized

**Why:** String needs tokenization. Quotes need to be parsed.

---

## Technical Deep Dive

### What parse_from_argv() Does

1. Takes pre-tokenized argv slice
2. Skips the first element (program name)
3. Treats each element as a complete token (no re-splitting)
4. Applies unilang syntax rules (command paths, named args, etc.)

```rust
pub fn parse_from_argv< S >( &self, argv : &[ S ] ) -> Result< GenericInstruction >
where
  S : AsRef< str > + core ::fmt ::Debug,
{
  // Convert argv to ParsedItem slices (no re-tokenization)
  let items = argv
    .iter()
    .skip( 1 )  // Skip program name
    .enumerate()
    .map( | ( idx, token ) | ParsedItem {
      raw : token.as_ref(),
      source_location : SourceLocation ::SliceSegment { slice_index : idx + 1 },
    })
    .collect();

  self.parse_instruction_from_items( items )
}
```

### What parse_repl_input() Does

1. Takes a raw string
2. Uses `strs_tools` to tokenize (split on whitespace, handle quotes)
3. Applies unilang syntax rules

```rust
pub fn parse_repl_input( &self, input : &str ) -> Result< GenericInstruction >
{
  // Tokenize the input string (handles quotes, escapes)
  let items = self.itemize_string( input )?;

  self.parse_instruction_from_items( items )
}
```

### Key Difference

- **parse_from_argv:** No tokenization (assumes pre-tokenized)
- **parse_repl_input:** Full tokenization (assumes raw string)

---

## Common Scenarios

### Scenario 1: Standard CLI Application

```rust
// User runs: my-app .build --output "dist/my app"

fn main() -> Result< (), Box<dyn std ::error ::Error >> {
  let argv : Vec<String> = std ::env ::args().collect();

  let parser = Parser ::new( UnilangParserOptions ::default() );
  let instruction = parser.parse_from_argv(&argv)?;  // ✅ Correct

  // instruction.named_args["output"] = "dist/my app" (single value, space preserved)
  Ok(())
}
```

### Scenario 2: Configuration File Parsing

```rust
// Config file contains: .deploy region::"us-east-1" name::"Production Server"

fn parse_config( config_content : &str ) -> Result< GenericInstruction, Box<dyn std ::error ::Error >> {
  let parser = Parser ::new( UnilangParserOptions ::default() );
  let instruction = parser.parse_repl_input(config_content)?;  // ✅ Correct

  // instruction.named_args["name"] = "Production Server" (quotes parsed correctly)
  Ok(())
}
```

### Scenario 3: Interactive REPL

```rust
// User types: .search query::"rust parser" limit::10

fn handle_repl_input( input : &str ) -> Result< GenericInstruction, Box<dyn std ::error ::Error >> {
  let parser = Parser ::new( UnilangParserOptions ::default() );
  let instruction = parser.parse_repl_input(input)?;  // ✅ Correct

  Ok(())
}
```

### Scenario 4: ❌ WRONG - Re-tokenizing Shell Argv

```rust
// User runs: my-app .deploy --name "Production Server"

fn main() -> Result< (), Box<dyn std ::error ::Error >> {
  let argv : Vec<String> = std ::env ::args().collect();
  let joined = argv[1..].join(" ");  // ❌ WRONG: Loses token boundaries

  let parser = Parser ::new( UnilangParserOptions ::default() );
  let instruction = parser.parse_repl_input(&joined)?;

  // BROKEN: instruction now has TWO args: "Production" and "Server"
  Ok(())
}
```

---

## Summary: Decision Tree

```
Is your input coming from the shell (std::env::args)?
│
├─ YES → Use parse_from_argv(&argv)
│         Reason: Shell already tokenized
│
└─ NO → Is your input a raw string?
        ├─ YES → Use parse_repl_input(input)
        │         Reason: String needs tokenization
        └─ Are you converting argv to a string then re-parsing?
                  → ❌ STOP! This is the pitfall.
                    Use parse_from_argv(&argv) instead.
```

---

## Real-World Production Bug

This pitfall caused a real production bug in the `wflow` project:

**User Command:**
```bash
$ wflow .languages --path "src/my project"
```

**Expected Behavior:**
- `--path` argument receives single value: `"src/my project"`

**Actual Behavior (Bug):**
- argv was joined: `".languages --path src/my project"`
- Re-split on whitespace: `[".languages", "--path", "src/my", "project"]`
- Parser saw TWO positional args: `"src/my"` and `"project"`
- Command failed: "unexpected positional argument 'project'"

**Root Cause:**
Code used `argv.join(" ")` followed by `parse_repl_input()`, destroying the shell's tokenization.

**Fix:**
Changed to `parse_from_argv(&argv)`, preserving shell tokenization.

---

## Prevention Checklist

When integrating unilang_parser into a CLI application:

- [ ] Are you receiving arguments from the shell (`std ::env ::args()`)?
- [ ] Are you using `parse_from_argv(&argv)` (NOT `parse_repl_input`)?
- [ ] Are you avoiding `argv.join(" ")` or similar string concatenation?
- [ ] Have you tested with arguments containing spaces (e.g., `--name "foo bar"`)?
- [ ] Have you verified quote handling is preserved end-to-end?

---

## Related Documentation

- [Task 086: Prevent Argv Misuse Pitfall]../task/086_prevent_argv_misuse_pitfall.md - Original task specification
- [Parser Engine Source]../src/parser_engine.rs - Implementation of parse_from_argv() and parse_repl_input()
- [Argv Multiword Bug Test]../tests/argv_multiword_bug_test.rs - Regression test for this pitfall

---

## Questions?

If you're unsure which method to use, ask yourself:

**"Has the input already been tokenized by something else (like the shell)?"**

- **YES** → Use `parse_from_argv()` (don't re-tokenize)
- **NO** → Use `parse_repl_input()` (needs tokenization)