ScarfBench CLI: The command line helper tool for scarf bench
This is a companion CLI tool for the SCARF Benchmark. It provides a commandline interface to list and test benchmarks, run agents, submit solutions, view and explore leaderboard among other useful tasks.
Table of Contents
Features
- List available benchmarks
- Test and validate benchmarks
- Run agents on benchmark problems
- Submit solutions (to be added)
- View and explore leaderboards (to be added)
Installation
Prerequisites
Before installing the SCARF CLI, ensure you have the following tools installed:
- Docker (Installation Guide) - Runs benchmarks in isolated environments
- Make - Builds and runs projects as specified in makefiles
- Python - If you want to install
scarfwith pip (optional)
Install prebuilt binaries via shell script
|
Install prebuilt binaries via Homebrew
Install prebuilt binaries via cargo
Install prebuilt binaries into your npm project
Build from Source
-
Clone the repository:
-
Install the project:
-
Run the CLI:
scarf --help ███████╗ ██████╗ █████╗ ██████╗ ███████╗ ██████╗ ███████╗ ███╗ ██╗ ██████╗ ██╗ ██╗ ██╔════╝ ██╔════╝ ██╔══██╗ ██╔══██╗ ██╔════╝ ██╔══██╗ ██╔════╝ ████╗ ██║ ██╔════╝ ██║ ██║ ███████╗ ██║ ███████║ ██████╔╝ █████╗ ██████╔╝ █████╗ ██╔██╗ ██║ ██║ ███████║ ╚════██║ ██║ ██╔══██║ ██╔══██╗ ██╔══╝ ██╔══██╗ ██╔══╝ ██║╚██╗██║ ██║ ██╔══██║ ███████║ ╚██████╗ ██║ ██║ ██║ ██║ ██║ ██████╔╝ ███████╗ ██║ ╚████║ ╚██████╗ ██║ ██║ ╚══════╝ ╚═════╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚══════╝ ╚═╝ ╚═══╝ ╚═════╝ ╚═╝ ╚═╝ ScarfBench CLI: The command line helper tool for scarf bench Usage: scarf [OPTIONS] <COMMAND> Commands: bench A series of subcommands to run on the benchmark applications. eval Subcommands to run evaluation over the benchmark help Print this message or the help of the given subcommand(s) Options: -v, --verbose... Increase verbosity (-v, -vv, -vvv). -h, --help Print help -V, --version Print version
Usage
After installation, use scarf --help to explore the command tree.
Top-level command
Usage: scarf [OPTIONS] <COMMAND>
Commands:
bench A series of subcommands to run on the benchmark applications.
eval Subcommands to run evaluation over the benchmark
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Increase verbosity (-v, -vv, -vvv).
-h, --help Print help
-V, --version Print version
Quick reference
1. scarf bench series of commands for operations on the benchmark
THese commands allow you to interact with the benchmark applications, including pulling the latest versions, listing available applications, and running tests. These are mostly for developers and maintainers of the benchmark as well as for users who want to explore the benchmark applications.
Usage: scarf bench [OPTIONS] <COMMAND>
Commands:
pull Pull the latest (or user specified) version of the benchmark.
list List the application(s) in the benchmark.
test Run regression tests (with `make test`) on the benchmark application(s).
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Increase verbosity (-v, -vv, -vvv).
-h, --help Print help
scarf bench pull --help
Pull the latest (or user specified) version of the benchmark.
Usage: scarf bench pull [OPTIONS] --dest <DIR>
Options:
-d, --dest <DIR> Path to where the benchmark is to be saved.
-v, --verbose... Increase verbosity (-v, -vv, -vvv).
--version <VERSION> Version of scarfbench to pull.
-h, --help Print help
scarf bench list --help
List the application(s) in the benchmark.
Usage: scarf bench list [OPTIONS] --benchmark-dir <BENCHMARK_DIR>
Options:
--benchmark-dir <BENCHMARK_DIR> Path to the root of the scarf benchmark.
-v, --verbose... Increase verbosity (-v, -vv, -vvv).
--layer <LAYER> Application layer to list.
-h, --help Print help
scarf bench test --help
Run regression tests (with make test) on the benchmark application(s).
Usage: scarf bench test [OPTIONS] --benchmark-dir <DIRECTORY>
Options:
--benchmark-dir <DIRECTORY> Path to the root of the scarf benchmark.
-v, --verbose... Increase verbosity (-v, -vv, -vvv).
--layer <LAYER> Application layer to test.
--app <APPLICATION> Application to run the test on.
--dry-run Use dry run instead of full run.
--logs-dest Where to save the logs.
-h, --help Print help
2. scarf eval series of commands for evaluating agents
These are the key evaluation commands that you will use to run and evaluate agents on the benchmark.
Usage: scarf eval [OPTIONS] <COMMAND>
Commands:
run Evaluate an agent on Scarfbench
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Increase verbosity (-v, -vv, -vvv).
-h, --help Print help
scarf eval run --help
Run the evaluation of an agent on the benchmark. This expects the agent to be implemented in the --agent-dir directory as per the agent harness specification
.
Usage: scarf eval run [OPTIONS] --benchmark-dir <DIR> --agent-dir <DIR> --source-framework <FRAMEWORK> --target-framework <FRAMEWORK> --eval-out <EVAL_OUT>
Options:
--benchmark-dir <DIR> Path (directory) to the benchmark.
-v, --verbose... Increase verbosity (-v, -vv, -vvv).
--agent-dir <DIR> Path (directory) to agent implementation harness.
--layer <LAYER> Application layer to run agent on.
--app <APP> Application to run the agent on. If layer is specified, this app must lie within that layer.
--source-framework <FRAMEWORK> The source framework for conversion.
--target-framework <FRAMEWORK> The target framework for conversion.
-p, --pass-at-k <K> Value of K to run for generating an Pass@K value. [default: 1]
--eval-out <EVAL_OUT> Output directory where the agent runs and evaluation output are stored.
-j, --jobs <JOBS> Number of parallel jobs to run. [default: 1]
--prepare-only Prepare the evaluation harness to run agents. Think of this as a dry run before actually deploying the agents.
-h, --help Print help
Quick examples
# Pull benchmark into a directory
# List apps in one layer
# Run benchmark tests for one layer
# Run evaluation
# Validate converted submissions (hidden command)