aether-mcp-utils 0.5.13

# CLAUDE.md - mcp-utils

## Writing Evals

MCP utility evals are ordinary Rust tests that use Crucible helpers. Nextest discovers, filters, schedules, and reports them.

### Eval Directory Structure

Fixtures and prompts can still live on disk:

```
tests/evals/<eval_name>/
├── prompt.md
└── src/
    └── fixture files
```

Load those files from a `#[tokio::test]`, call `run_eval`, then assert on the returned report and filesystem state.

### Creating a New Eval

```rust
use crucible::{Workspace, run_eval};

#[tokio::test]
async fn edit_single_file_eval() -> Result<(), crucible::EvalRunError> {
    let tests_dir = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("tests");
    let prompt = std::fs::read_to_string(tests_dir.join("evals/edit_single_file/prompt.md"))
        .expect("eval prompt should be readable");
    let workspace = Workspace::from_dir(tests_dir.join("evals/edit_single_file/src"))?;

    let report = run_eval(&runner(), prompt, workspace).await?;

    let contents = std::fs::read_to_string(report.path("src/main.rs")).expect("output file should be readable");
    assert!(contents.contains("Hello, World!"), "{}", report.failure_context());
    assert!(report.tool_called("edit_file"), "{}", report.failure_context());
    Ok(())
}
```

Use test names ending in `_eval` only for real provider-backed evals selected by the repository `evals` nextest group. Fake-agent coverage is normal unit or integration testing and should use normal test names.

### Assertion Style

Prefer normal Rust assertions:

- `assert!(report.path("out.txt").exists(), "{}", report.failure_context())` verifies a file or directory exists.
- `assert!(contents.contains("expected text"), "{}", report.failure_context())` checks file contents.
- `assert!(report.tool_called("edit_file"), "{}", report.failure_context())` checks that a tool was called.
- `assert_eq!(report.tool_call_count("edit_file"), 1, "{}", report.failure_context())` checks exact call count.
- `report.tool_calls("edit_file")` returns calls with `arguments` and `arguments_json()` for detailed inspection.
- Use `std::process` or `tokio::process` directly when a shell command is clearer than a state assertion.

### Workspace Options

- `Workspace::empty()` creates a fresh empty temp directory.
- `Workspace::from_dir(path)` copies fixture contents into a fresh temp directory.
- `Workspace::from_git_repo(url, start_sha, gold_sha, subdir)` clones a repository and checks out the start commit.


### LLM Judge Helper

```rust
use crucible::{BinaryMetric, LlmJudgeContext};
use schemars::schema_for;

fn yes_no_prompt(question: &str, _ctx: &LlmJudgeContext) -> String {
    format!(
        "{question}\n\nRespond with JSON matching this schema:\n{}\n\nOnly return JSON.",
        serde_json::to_string_pretty(&schema_for!(BinaryMetric)).unwrap()
    )
}
```

Run LLM judges explicitly on the report:

```rust
let judgment = report.judge(&judge_llm, |ctx| yes_no_prompt("Did this pass?", ctx)).await;
assert!(judgment.passed(), "{}\n\n{}", judgment.reason(), report.failure_context());
```

### Best Practices

- Keep each eval focused on one primary MCP behavior.
- Prefer file and tool-call assertions before adding an LLM judge.
- Use `BinaryMetric` or `NumericMetric` schemas for judge responses.
- Keep eval prompts clear and user-like.
- Assert on `EvalReport` directly with normal Rust assertions.
- Run evals with `cargo nextest run` or `just test`.