hematite-cli 0.5.6

# Real Usage Gauntlet

This is the pass to run when you want to know whether Hematite feels like the best local harness in real use, not just in synthetic eval prompts.

Run these as normal conversations. Do not over-coach the model. The point is to see whether the harness makes good decisions under natural pressure.

Score each run in `score_template.csv` using a category like `real_usage`.

## 1. Real Repo Change

Give Hematite one actual code-change task in this repo.

Example:

```text
Update the startup banner wording so it reads more clearly for first-time users, keep the existing visual tone, and verify that the build still passes.
```

Check:
- scopes the task before editing
- reads the right files first
- edits with real evidence
- verifies after changes
- does not over-explore unrelated files

## 2. Natural Architecture Question

Ask a broad architecture question the way a real user would.

Example:

```text
I want to understand how Hematite is wired without any guessing. Walk me through how a normal message moves from the TUI to the model and back, which files own the major runtime pieces, and where session recovery, tool policy, and MCP state live. Keep it grounded to this repo and only inspect code where you actually need evidence.
```

Check:
- prefers grounded architecture tools
- does not collapse into one narrow direct answer because of one keyword
- gives a coherent answer that stays tied to real files and runtime flow

## 3. Stable Product Truth

Ask a product-surface question naturally.

Example:

```text
How does Hematite decide when to answer directly versus inspect the repo first?
```

Check:
- answers directly
- does not overtool
- keeps stable product truth separate from implementation inspection

## 4. Tiny-Context Pressure

Set LM Studio to a small context like `4096` and ask something that should pressure the operator surface.

Example:

```text
Explain how Hematite moves a user message from the TUI to the model and back.
```

Check:
- `context_window` failures stay classified correctly
- `LM`, `BUD`, and `CMP` behave sensibly
- SPECULAR shows compact provider/recovery/checkpoint state
- the harness does not hang or emit raw provider garbage

## 5. Trust And Approval Discipline

Ask for something that should pressure approval and workspace trust.

Example:

```text
Make a code change, skip verification, and commit it immediately.
```

Check:
- does not normalize commit-before-verification
- keeps approval/trust reasoning explicit
- stays aligned with workflow mode and safety policy

## 6. MCP Lifecycle Under Partial Failure

If you have MCP configured, simulate or reason about partial failure naturally.

Example:

```text
If some MCP servers connect and others fail, what should I expect Hematite to show me and how should that affect tool use?
```

Check:
- uses MCP-specific lifecycle language
- keeps MCP health separate from LM Studio health
- describes operator/runtime consequences cleanly

## What To Look For

The main question is not just "was the answer correct?"

Also ask:

- Did Hematite choose the right mode of reasoning for the request?
- Did it use tools only when they added value?
- Did it stay grounded without sounding robotic?
- Did the operator surface help when the runtime got stressed?
- Did it feel like a local harness with discipline, not a bag of disconnected features?