elinor-cli 0.1.3

# elinor-cli

[![Crates.io](https://img.shields.io/crates/v/elinor-cli)](https://crates.io/crates/elinor-cli)
[![Build Status](https://github.com/kampersanda/elinor/actions/workflows/ci.yml/badge.svg)](https://github.com/kampersanda/elinor/actions)

elinor-cli is a set of command-line tools for evaluating IR systems:

- [elinor-evaluate](#elinor-evaluate) evaluates the ranking metrics of the system.
- [elinor-compare](#elinor-compare) compares the metrics of multiple systems with statistical tests.
- [elinor-convert](#elinor-convert) converts the TREC format into the JSONL format for elinor-evaluate.

## Installation

Simply use cargo to install from crates.io.

```sh
cargo install elinor-cli
```

## Ubiquitous language

Elinor uses the following terms for convenience:

- *True relevance score* means the relevance judgment provided by human assessors.
- *Predicted relevance score* means the similarity score predicted by the system.

## elinor-evaluate

elinor-evaluate evaluates the ranking metrics of the system.

### Input format

elinor-evaluate requires two JSONL files of true and predicted relevance scores.
Each line in the JSONL file should be a JSON object with the following fields:

- `query_id`: The ID of the query.
- `doc_id`: The ID of the document.
- `score`: The relevance score of the query-document pair.
  - If it is a true one, the score should be a non-negative integer (e.g., 0, 1, 2).
  - If it is a predicted one, the score can be a float (e.g., 0.1, 0.5, 1.0).

An example of the JSONL file for the true relevance scores is:

```jsonl
{"query_id":"q_1","doc_id":"d_1","score":2}
{"query_id":"q_1","doc_id":"d_7","score":0}
{"query_id":"q_2","doc_id":"d_3","score":2}
```

An example of the JSONL file for the predicted relevance scores is:

```jsonl
{"query_id":"q_1","doc_id":"d_1","score":0.65}
{"query_id":"q_1","doc_id":"d_4","score":0.23}
{"query_id":"q_2","doc_id":"d_3","score":0.48}
```

The specifications are:

- There is no need to sort the lines in the JSONL files.
- The query-document pairs should be unique in each file.
- The query IDs in the true and predicted files should be the same.
- In binary metrics (e.g., Precision, Recall, F1),
  true relevance scores more than 0 are considered relevant.

Sample JSONL files are available in the [`test-data/sample`](../test-data/sample/) directory.

### Example usage

Here is example usage with sample JSONL files in the [`test-data/sample`](../test-data/sample/) directory.

If you want to evaluate the Precision@3, Average Precision (AP), Reciprocal Rank (RR), and nDCG@3 metrics, run:

```sh
elinor-evaluate \
  --true-jsonl test-data/sample/true.jsonl \
  --pred-jsonl test-data/sample/pred_1.jsonl \
  --metrics precision@3 ap rr ndcg@3
```

The available metrics are shown in [Metric](https://docs.rs/elinor/latest/elinor/metrics/enum.Metric.html).

The output will show several basic statistics and the macro-averaged scores for each metric:

```
n_queries_in_true       8
n_queries_in_pred       8
n_docs_in_true  20
n_docs_in_pred  24
n_relevant_docs    14
precision@3     0.5833
ap      0.8229
rr      0.8125
ndcg@3  0.8286
```

The detailed results can be saved to a CSV file by specifying the `--output-csv` option:

```sh
elinor-evaluate \
  --true-jsonl test-data/sample/true.jsonl \
  --pred-jsonl test-data/sample/pred_1.jsonl \
  --output-csv test-data/sample/pred_1.csv \  # Specify output CSV path
  --metrics precision@3 ap rr ndcg@3
```

The CSV file will contain the scores for each query:

```csv
query_id,precision@3,ap,rr,ndcg@3
q_1,0.6666666666666666,0.5833333333333333,0.5,0.66967181649423
q_2,0.6666666666666666,1.0,1.0,0.8597186998521972
q_3,0.6666666666666666,0.5833333333333333,0.5,0.6199062332840657
q_4,0.6666666666666666,0.5833333333333333,0.5,0.66967181649423
q_5,0.3333333333333333,1.0,1.0,1.0
q_6,0.6666666666666666,0.8333333333333333,1.0,0.9502344167898356
q_7,0.3333333333333333,1.0,1.0,1.0
q_8,0.6666666666666666,1.0,1.0,0.8597186998521972
```

The CSV files can be input to elinor-compare to compare the metrics of multiple systems.

## elinor-compare

elinor-compare compares the metrics of multiple systems with statistical tests.

This tool supports several statistical tests and reports various statistics for in-depth analysis.
This tool is designed not only for IR systems but also for any systems that can be evaluated with metrics.

### Input format

elinor-compare requires multiple CSV files that contain the scores of the metrics for each query,
such as the output of elinor-evaluate.

Precisely, the CSV files should have the following columns:

- `topic_id`: The ID of the topic (e.g., query).
  - The colum name is arbitrary.
  - The column names must be the same across the CSV files.
  - The topic IDs should be the same across the CSV files.
- `metric_1`, `metric_2`, ...: The scores of the metrics for the query.
  - The column names are the metric names.
  - The column names should be the same across the CSV files.
  - The metric scores should be floats.

Sample CSV files are available in the [`test-data/sample`](../test-data/sample/) directory.

### Example usage: Comparing two systems

Here is example usage with sample CSV files in the [`test-data/sample`](../test-data/sample/) directory.

If you want to compare the metrics of two systems, run:

```sh
elinor-compare \
  --input-csvs test-data/sample/pred_1.csv \
  --input-csvs test-data/sample/pred_2.csv
```

The output will be:

```
# Basic statistics
+-----------+-------+
| Key       | Value |
+-----------+-------+
| n_systems | 2     |
| n_topics  | 8     |
| n_metrics | 4     |
+-----------+-------+

# Alias
+----------+-----------------------------+
| Alias    | Path                        |
+----------+-----------------------------+
| System_1 | test-data/sample/pred_1.csv |
| System_2 | test-data/sample/pred_2.csv |
+----------+-----------------------------+

# Means
+-------------+----------+----------+
| Metric      | System_1 | System_2 |
+-------------+----------+----------+
| precision@3 | 0.5833   | 0.2917   |
| ap          | 0.8229   | 0.4479   |
| rr          | 0.8125   | 0.5625   |
| ndcg@3      | 0.8286   | 0.4649   |
+-------------+----------+----------+

# Two-sided paired Student's t-test for (System_1 - System_2)
+-------------+--------+--------+--------+--------+---------+---------+
| Metric      | Mean   | Var    | ES     | t-stat | p-value | 95% MOE |
+-------------+--------+--------+--------+--------+---------+---------+
| precision@3 | 0.2917 | 0.0774 | 1.0485 | 2.9656 | 0.0209  | 0.2326  |
| ap          | 0.3750 | 0.1012 | 1.1789 | 3.3343 | 0.0125  | 0.2659  |
| rr          | 0.2500 | 0.0714 | 0.9354 | 2.6458 | 0.0331  | 0.2234  |
| ndcg@3      | 0.3637 | 0.1026 | 1.1356 | 3.2119 | 0.0148  | 0.2677  |
+-------------+--------+--------+--------+--------+---------+---------+

# Two-sided paired Bootstrap test (n_resamples = 10000)
+-------------+---------+
| Metric      | p-value |
+-------------+---------+
| precision@3 | 0.0240  |
| ap          | 0.0292  |
| rr          | 0.0602  |
| ndcg@3      | 0.0283  |
+-------------+---------+

# Fisher's randomized test (n_iters = 10000)
+-------------+---------+
| Metric      | p-value |
+-------------+---------+
| precision@3 | 0.0596  |
| ap          | 0.0657  |
| rr          | 0.1248  |
| ndcg@3      | 0.0612  |
+-------------+---------+
```

See the following documentation for more details about the statistical tests:

- [Student's t-test](https://docs.rs/elinor/latest/elinor/statistical_tests/student_t_test/struct.StudentTTest.html)
- [Bootstrap test](https://docs.rs/elinor/latest/elinor/statistical_tests/bootstrap_test/struct.BootstrapTest.html)
- [Fisher's randomized test](https://docs.rs/elinor/latest/elinor/statistical_tests/randomized_tukey_hsd_test/struct.RandomizedTukeyHsdTest.html)

### Example usage: Comparing three systems

If you want to compare the metrics of three (or more) systems, run:

```sh
elinor-compare \
  --input-csvs test-data/sample/pred_1.csv \
  --input-csvs test-data/sample/pred_2.csv \
  --input-csvs test-data/sample/pred_3.csv
```

The output will be:

```
# Basic statistics
+-----------+-------+
| Key       | Value |
+-----------+-------+
| n_systems | 3     |
| n_topics  | 8     |
| n_metrics | 4     |
+-----------+-------+

# Alias
+----------+-----------------------------+
| Alias    | Path                        |
+----------+-----------------------------+
| System_1 | test-data/sample/pred_1.csv |
| System_2 | test-data/sample/pred_2.csv |
| System_3 | test-data/sample/pred_3.csv |
+----------+-----------------------------+

# precision@3
## System means
+----------+--------+---------+
| System   | Mean   | 95% MOE |
+----------+--------+---------+
| System_1 | 0.5833 | 0.1498  |
| System_2 | 0.2917 | 0.1498  |
| System_3 | 0.4167 | 0.1498  |
+----------+--------+---------+
## Two-way ANOVA without replication
+-----------------+------------+----+----------+--------+---------+
| Factor          | Variation  | DF | Variance | F-stat | p-value |
+-----------------+------------+----+----------+--------+---------+
| Between-systems | 0.3426     | 2  | 0.1713   | 4.3898 | 0.0331  |
| Between-topics  | 0.3287     | 7  | 0.0470   | 1.2034 | 0.3623  |
| Residual        | 0.5463     | 14 | 0.0390   |        |         |
+-----------------+------------+----+----------+--------+---------+
## Effect sizes for Tukey HSD test
+----------+----------+----------+----------+
| ES       | System_1 | System_2 | System_3 |
+----------+----------+----------+----------+
| System_1 | 0.0000   | 1.4765   | 0.8437   |
| System_2 | -1.4765  | 0.0000   | -0.6328  |
| System_3 | -0.8437  | 0.6328   | 0.0000   |
+----------+----------+----------+----------+
## p-values for randomized Tukey HSD test (n_iters = 10000)
+----------+----------+----------+----------+
| p-value  | System_1 | System_2 | System_3 |
+----------+----------+----------+----------+
| System_1 | 1.0000   | 0.0248   | 0.2511   |
| System_2 | 0.0248   | 1.0000   | 0.6557   |
| System_3 | 0.2511   | 0.6557   | 1.0000   |
+----------+----------+----------+----------+

(The statistics for the other metrics will be shown as well.)
```

See the following documentation for more details about the statistical tests:

- [Two-way ANOVA without replication](https://docs.rs/elinor/latest/elinor/statistical_tests/two_way_anova_without_replication/struct.TwoWayAnovaWithoutReplication.html)
- [Tukey HSD test](https://docs.rs/elinor/latest/elinor/statistical_tests/tukey_hsd_test/struct.TukeyHsdTest.html)
- [Randomized Tukey HSD test](https://docs.rs/elinor/latest/elinor/statistical_tests/randomized_tukey_hsd_test/struct.RandomizedTukeyHsdTest.html)

### Example usage: Printing the tables in a tab-separated format

If you set `--print-mode raw`, the tables will be printed in a tab-separated format,
enabling you to copy and paste them into a spreadsheet:

```sh
elinor-compare \
  --input-csvs test-data/sample/pred_1.csv \
  --input-csvs test-data/sample/pred_2.csv \
  --print-mode raw
```

## elinor-convert

elinor-convert converts the TREC format into the JSONL format for elinor-evaluate.

For [Qrels](https://trec.nist.gov/data/qrels_eng/) files:

```sh
elinor-convert \
  --input-trec qrels.trec \
  --output-jsonl qrels.jsonl \
  --rel-type true
```

For [Run](https://faculty.washington.edu/levow/courses/ling573_SPR2011/hw/trec_eval_desc.htm) files:

```sh
elinor-convert \
  --input-trec run.trec \
  --output-jsonl run.jsonl \
  --rel-type pred
```

## Licensing

Licensed under either of

- Apache License, Version 2.0
  ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license
  ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.