vtcode 0.99.1

A Rust-based terminal coding agent with modular architecture supporting multiple LLM providers
# VT Code Empirical Evaluation Framework

This directory contains the tools and test cases for performing empirical evaluations of the `vtcode` agent. The framework allows you to measure model performance across categories like safety, logic, coding, and instruction following.

## Getting Started

### Prerequisites

1.  **Build vtcode**: Ensure you have a compiled release binary of `vtcode`.
    ```bash
    cargo build --release
    ```
2.  **Python Environment**: The evaluation engine requires Python 3.
3.  **API Keys**: Set the necessary environment variables (e.g., `GEMINI_API_KEY`, `OPENAI_API_KEY`) in a `.env` file in the project root.

### Running Evaluations

The `eval_engine.py` script orchestrates the evaluation process.

**Basic Usage:**

```bash
python3 evals/eval_engine.py --cases evals/test_cases.json --provider gemini --model gemini-3-flash-preview
```

**Arguments:**

- `--cases`: Path to the test cases JSON file (default: `evals/test_cases.json`).
- `--provider`: The LLM provider to evaluate (e.g., `gemini`, `openai`, `anthropic`).
- `--model`: The specific model ID to evaluate (e.g., `gemini-3-flash-preview`, `gpt-4`).

## Directory Structure

- `eval_engine.py`: The main orchestrator that runs test cases and generates reports.
- `metrics.py`: Contains grading logic and metric implementations.
- `test_cases.json`: The primary benchmark suite.
- `test_cases_mini.json`: A smaller suite for quick validation of the framework.
- `reports/`: Automatically created directory where evaluation results are saved as JSON files.

## Test Case Format

Test cases are defined in JSON format:

```json
{
    "id": "logic_fibonacci",
    "category": "logic",
    "task": "Write a python function to calculate the nth fibonacci number.",
    "metric": "code_validity",
    "language": "python"
}
```

### Supported Metrics

- `exact_match`: Checks if the output exactly matches the `expected` string.
- `contains_match`: Checks if the output contains the `expected` string.
- `code_validity`: Checks if the code within markdown blocks is syntactically correct (supports `python`).
- `llm_grader`: Uses the LLM itself to grade the response based on a `rubric`.

## Analyzing Reports

Reports are saved in the `reports/` directory with a timestamp. They include:

- **Summary**: Total tests, passed, and failed counts.
- **Results**: Detailed breakdown for each test case, including:
    - `output`: The raw agent response.
    - `usage`: Token usage metadata.
    - `latency`: Response time in seconds.
    - `grade`: The score or result from the metric.
    - `reasoning`: The agent's thinking process (if supported by the model).
    - `raw_response`: The complete JSON response from `vtcode ask`.

## Grading with LLMs

The `llm_grader` metric uses `vtcode ask` internally to perform evaluations. By default, it uses `gemini-3-flash-preview` for grading to keep costs low and ensure reliability. You can configure this in `evals/metrics.py`.