ScarfBench CLI: The command line helper tool for scarf bench
This is a companion CLI tool for the SCARF Benchmark. It provides a commandline interface to list and test benchmarks, run agents, submit solutions, view and explore leaderboard among other useful tasks.
Table of Contents
Features
- List available benchmarks
- Test and validate benchmarks
- Run agents on benchmark problems
- Submit solutions (to be added)
- View and explore leaderboards (to be added)
Installation
Prerequisites
Before installing the SCARF CLI, ensure you have the following tools installed:
- Docker (Installation Guide) - Runs benchmarks in isolated environments
- Make - Builds and runs projects as specified in makefiles
- Python - If you want to install
scarfwith pip (optional)
Install prebuilt binaries via shell script
|
Install prebuilt binaries via Homebrew
Install prebuilt binaries via cargo
Install prebuilt binaries into your npm project
Build from Source
-
Clone the repository:
-
Install the project:
-
Run the CLI:
scarf --help ███████╗ ██████╗ █████╗ ██████╗ ███████╗ ██████╗ ███████╗ ███╗ ██╗ ██████╗ ██╗ ██╗ ██╔════╝ ██╔════╝ ██╔══██╗ ██╔══██╗ ██╔════╝ ██╔══██╗ ██╔════╝ ████╗ ██║ ██╔════╝ ██║ ██║ ███████╗ ██║ ███████║ ██████╔╝ █████╗ ██████╔╝ █████╗ ██╔██╗ ██║ ██║ ███████║ ╚════██║ ██║ ██╔══██║ ██╔══██╗ ██╔══╝ ██╔══██╗ ██╔══╝ ██║╚██╗██║ ██║ ██╔══██║ ███████║ ╚██████╗ ██║ ██║ ██║ ██║ ██║ ██████╔╝ ███████╗ ██║ ╚████║ ╚██████╗ ██║ ██║ ╚══════╝ ╚═════╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚══════╝ ╚═╝ ╚═══╝ ╚═════╝ ╚═╝ ╚═╝ ScarfBench CLI: The command line helper tool for scarf bench Usage: scarf [OPTIONS] <COMMAND> Commands: bench A series of subcommands to run on the benchmark applications. eval Subcommands to run evaluation over the benchmark help Print this message or the help of the given subcommand(s) Options: -v, --verbose... Increase verbosity (-v, -vv, -vvv). -h, --help Print help -V, --version Print version
Usage
After installation, you can use the SCARF CLI to interact with the SCARF Benchmark. Here are some common commands:
1. List Benchmarks
❯ scarf bench list --help
███████╗ ██████╗ █████╗ ██████╗ ███████╗ ██████╗ ███████╗ ███╗ ██╗ ██████╗ ██╗ ██╗
██╔════╝ ██╔════╝ ██╔══██╗ ██╔══██╗ ██╔════╝ ██╔══██╗ ██╔════╝ ████╗ ██║ ██╔════╝ ██║ ██║
███████╗ ██║ ███████║ ██████╔╝ █████╗ ██████╔╝ █████╗ ██╔██╗ ██║ ██║ ███████║
╚════██║ ██║ ██╔══██║ ██╔══██╗ ██╔══╝ ██╔══██╗ ██╔══╝ ██║╚██╗██║ ██║ ██╔══██║
███████║ ╚██████╗ ██║ ██║ ██║ ██║ ██║ ██████╔╝ ███████╗ ██║ ╚████║ ╚██████╗ ██║ ██║
╚══════╝ ╚═════╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚══════╝ ╚═╝ ╚═══╝ ╚═════╝ ╚═╝ ╚═╝
List the application(s) in the benchmark.
Usage: scarf bench list [OPTIONS] --benchmark-dir <ROOT>
Options:
--benchmark-dir <ROOT> Path to the root of the scarf benchmark.
-v, --verbose... Increase verbosity (-v, -vv, -vvv). If RUST_LOG is set, it takes precedence.
--layer <LAYER> Application layer to list.
-h, --help Print help
This should give you something like below
2. Test Benchmark Layer(s)
You can use the scarf bench test command to test specific benchmark layers or the whole benchmark. Here are some examples:
❯ scarf bench test --help
███████╗ ██████╗ █████╗ ██████╗ ███████╗ ██████╗ ███████╗ ███╗ ██╗ ██████╗ ██╗ ██╗
██╔════╝ ██╔════╝ ██╔══██╗ ██╔══██╗ ██╔════╝ ██╔══██╗ ██╔════╝ ████╗ ██║ ██╔════╝ ██║ ██║
███████╗ ██║ ███████║ ██████╔╝ █████╗ ██████╔╝ █████╗ ██╔██╗ ██║ ██║ ███████║
╚════██║ ██║ ██╔══██║ ██╔══██╗ ██╔══╝ ██╔══██╗ ██╔══╝ ██║╚██╗██║ ██║ ██╔══██║
███████║ ╚██████╗ ██║ ██║ ██║ ██║ ██║ ██████╔╝ ███████╗ ██║ ╚████║ ╚██████╗ ██║ ██║
╚══════╝ ╚═════╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚══════╝ ╚═╝ ╚═══╝ ╚═════╝ ╚═╝ ╚═╝
Run regression tests (with `make test`) on the benchmark application(s).
Usage: scarf bench test [OPTIONS] --benchmark-dir <DIRECTORY>
Options:
--benchmark-dir <DIRECTORY> Path to the root of the scarf benchmark.
-v, --verbose... Increase verbosity (-v, -vv, -vvv).
--layer <LAYER> Application layer to test.
--app <APPLICATION> Application to run the test on.
--dry-run Use dry run instead of full run.
-h, --help Print help
For example, to test the persistence layer:
This will run make tests in all the apps in persistence layer and provide a summary of the results.