# elinor-cli
[](https://crates.io/crates/elinor-cli)
[](https://github.com/kampersanda/elinor/actions)
elinor-cli is a set of command-line tools for evaluating IR systems:
- [elinor-evaluate](#elinor-evaluate) evaluates the ranking metrics of the system.
- [elinor-compare](#elinor-compare) compares the metrics of multiple systems with statistical tests.
- [elinor-convert](#elinor-convert) converts the TREC format into the JSONL format for elinor-evaluate.
## Installation
Simply use cargo to install from crates.io.
```sh
cargo install elinor-cli
```
## Ubiquitous language
Elinor uses the following terms for convenience:
- *True relevance score* means the relevance judgment provided by human assessors.
- *Predicted relevance score* means the similarity score predicted by the system.
## elinor-evaluate
elinor-evaluate evaluates the ranking metrics of the system.
### Input format
elinor-evaluate requires two JSONL files of true and predicted relevance scores.
Each line in the JSONL file should be a JSON object with the following fields:
- `query_id`: The ID of the query.
- `doc_id`: The ID of the document.
- `score`: The relevance score of the query-document pair.
- If it is a true one, the score should be a non-negative integer (e.g., 0, 1, 2).
- If it is a predicted one, the score can be a float (e.g., 0.1, 0.5, 1.0).
An example of the JSONL file for the true relevance scores is:
```jsonl
{"query_id":"q_1","doc_id":"d_1","score":2}
{"query_id":"q_1","doc_id":"d_7","score":0}
{"query_id":"q_2","doc_id":"d_3","score":2}
```
An example of the JSONL file for the predicted relevance scores is:
```jsonl
{"query_id":"q_1","doc_id":"d_1","score":0.65}
{"query_id":"q_1","doc_id":"d_4","score":0.23}
{"query_id":"q_2","doc_id":"d_3","score":0.48}
```
The specifications are:
- There is no need to sort the lines in the JSONL files.
- The query-document pairs should be unique in each file.
- The query IDs in the true and predicted files should be the same.
- In binary metrics (e.g., Precision, Recall, F1),
true relevance scores more than 0 are considered relevant.
Sample JSONL files are available in the [`test-data/sample`](../test-data/sample/) directory.
### Example usage
Here is example usage with sample JSONL files in the [`test-data/sample`](../test-data/sample/) directory.
If you want to evaluate the Precision@3, Average Precision (AP), Reciprocal Rank (RR), and nDCG@3 metrics, run:
```sh
elinor-evaluate \
--true-jsonl test-data/sample/true.jsonl \
--pred-jsonl test-data/sample/pred_1.jsonl \
--metrics precision@3 ap rr ndcg@3
```
The available metrics are shown in [Metric](https://docs.rs/elinor/latest/elinor/metrics/enum.Metric.html).
The output will show several basic statistics and the macro-averaged scores for each metric:
```
n_queries_in_true 8
n_queries_in_pred 8
n_docs_in_true 20
n_docs_in_pred 24
n_relevant_docs 14
precision@3 0.5833
ap 0.8229
rr 0.8125
ndcg@3 0.8286
```
The detailed results can be saved to a CSV file by specifying the `--output-csv` option:
```sh
elinor-evaluate \
--true-jsonl test-data/sample/true.jsonl \
--pred-jsonl test-data/sample/pred_1.jsonl \
--output-csv test-data/sample/pred_1.csv \ # Specify output CSV path
--metrics precision@3 ap rr ndcg@3
```
The CSV file will contain the scores for each query:
```csv
query_id,precision@3,ap,rr,ndcg@3
q_1,0.6666666666666666,0.5833333333333333,0.5,0.66967181649423
q_2,0.6666666666666666,1.0,1.0,0.8597186998521972
q_3,0.6666666666666666,0.5833333333333333,0.5,0.6199062332840657
q_4,0.6666666666666666,0.5833333333333333,0.5,0.66967181649423
q_5,0.3333333333333333,1.0,1.0,1.0
q_6,0.6666666666666666,0.8333333333333333,1.0,0.9502344167898356
q_7,0.3333333333333333,1.0,1.0,1.0
q_8,0.6666666666666666,1.0,1.0,0.8597186998521972
```
The CSV files can be input to elinor-compare to compare the metrics of multiple systems.
## elinor-compare
elinor-compare compares the metrics of multiple systems with statistical tests.
This tool supports several statistical tests and reports various statistics for in-depth analysis.
This tool is designed not only for IR systems but also for any systems that can be evaluated with metrics.
### Input format
elinor-compare requires multiple CSV files that contain the scores of the metrics for each query,
such as the output of elinor-evaluate.
Precisely, the CSV files should have the following columns:
- `topic_id`: The ID of the topic (e.g., query).
- The colum name is arbitrary.
- The column names must be the same across the CSV files.
- The topic IDs should be the same across the CSV files.
- `metric_1`, `metric_2`, ...: The scores of the metrics for the query.
- The column names are the metric names.
- The column names should be the same across the CSV files.
- The metric scores should be floats.
Sample CSV files are available in the [`test-data/sample`](../test-data/sample/) directory.
### Example usage: Comparing two systems
Here is example usage with sample CSV files in the [`test-data/sample`](../test-data/sample/) directory.
If you want to compare the metrics of two systems, run:
```sh
elinor-compare \
--input-csvs test-data/sample/pred_1.csv \
--input-csvs test-data/sample/pred_2.csv
```
The output will be:
```
# Basic statistics
+-----------+-------+
| n_systems | 2 |
| n_topics | 8 |
| n_metrics | 4 |
+-----------+-------+
# Alias
+----------+-----------------------------+
| System_1 | test-data/sample/pred_1.csv |
| System_2 | test-data/sample/pred_2.csv |
+----------+-----------------------------+
# Means
+-------------+----------+----------+
| precision@3 | 0.5833 | 0.2917 |
| ap | 0.8229 | 0.4479 |
| rr | 0.8125 | 0.5625 |
| ndcg@3 | 0.8286 | 0.4649 |
+-------------+----------+----------+
# Two-sided paired Student's t-test for (System_1 - System_2)
+-------------+--------+--------+--------+--------+---------+---------+
| precision@3 | 0.2917 | 0.0774 | 1.0485 | 2.9656 | 0.0209 | 0.2326 |
| ap | 0.3750 | 0.1012 | 1.1789 | 3.3343 | 0.0125 | 0.2659 |
| rr | 0.2500 | 0.0714 | 0.9354 | 2.6458 | 0.0331 | 0.2234 |
| ndcg@3 | 0.3637 | 0.1026 | 1.1356 | 3.2119 | 0.0148 | 0.2677 |
+-------------+--------+--------+--------+--------+---------+---------+
# Two-sided paired Bootstrap test (n_resamples = 10000)
+-------------+---------+
| precision@3 | 0.0240 |
| ap | 0.0292 |
| rr | 0.0602 |
| ndcg@3 | 0.0283 |
+-------------+---------+
# Fisher's randomized test (n_iters = 10000)
+-------------+---------+
| precision@3 | 0.0596 |
| ap | 0.0657 |
| rr | 0.1248 |
| ndcg@3 | 0.0612 |
+-------------+---------+
```
See the following documentation for more details about the statistical tests:
- [Student's t-test](https://docs.rs/elinor/latest/elinor/statistical_tests/student_t_test/struct.StudentTTest.html)
- [Bootstrap test](https://docs.rs/elinor/latest/elinor/statistical_tests/bootstrap_test/struct.BootstrapTest.html)
- [Fisher's randomized test](https://docs.rs/elinor/latest/elinor/statistical_tests/randomized_tukey_hsd_test/struct.RandomizedTukeyHsdTest.html)
### Example usage: Comparing three systems
If you want to compare the metrics of three (or more) systems, run:
```sh
elinor-compare \
--input-csvs test-data/sample/pred_1.csv \
--input-csvs test-data/sample/pred_2.csv \
--input-csvs test-data/sample/pred_3.csv
```
The output will be:
```
# Basic statistics
+-----------+-------+
| n_systems | 3 |
| n_topics | 8 |
| n_metrics | 4 |
+-----------+-------+
# Alias
+----------+-----------------------------+
| System_1 | test-data/sample/pred_1.csv |
| System_2 | test-data/sample/pred_2.csv |
| System_3 | test-data/sample/pred_3.csv |
+----------+-----------------------------+
# precision@3
## System means
+----------+--------+---------+
| System_1 | 0.5833 | 0.1498 |
| System_2 | 0.2917 | 0.1498 |
| System_3 | 0.4167 | 0.1498 |
+----------+--------+---------+
## Two-way ANOVA without replication
+-----------------+------------+----+----------+--------+---------+
| Between-systems | 0.3426 | 2 | 0.1713 | 4.3898 | 0.0331 |
| Between-topics | 0.3287 | 7 | 0.0470 | 1.2034 | 0.3623 |
| Residual | 0.5463 | 14 | 0.0390 | | |
+-----------------+------------+----+----------+--------+---------+
## Effect sizes for Tukey HSD test
+----------+----------+----------+----------+
| System_1 | 0.0000 | 1.4765 | 0.8437 |
| System_2 | -1.4765 | 0.0000 | -0.6328 |
| System_3 | -0.8437 | 0.6328 | 0.0000 |
+----------+----------+----------+----------+
## p-values for randomized Tukey HSD test (n_iters = 10000)
+----------+----------+----------+----------+
| System_1 | 1.0000 | 0.0248 | 0.2511 |
| System_2 | 0.0248 | 1.0000 | 0.6557 |
| System_3 | 0.2511 | 0.6557 | 1.0000 |
+----------+----------+----------+----------+
(The statistics for the other metrics will be shown as well.)
```
See the following documentation for more details about the statistical tests:
- [Two-way ANOVA without replication](https://docs.rs/elinor/latest/elinor/statistical_tests/two_way_anova_without_replication/struct.TwoWayAnovaWithoutReplication.html)
- [Tukey HSD test](https://docs.rs/elinor/latest/elinor/statistical_tests/tukey_hsd_test/struct.TukeyHsdTest.html)
- [Randomized Tukey HSD test](https://docs.rs/elinor/latest/elinor/statistical_tests/randomized_tukey_hsd_test/struct.RandomizedTukeyHsdTest.html)
### Example usage: Printing the tables in a tab-separated format
If you set `--print-mode raw`, the tables will be printed in a tab-separated format,
enabling you to copy and paste them into a spreadsheet:
```sh
elinor-compare \
--input-csvs test-data/sample/pred_1.csv \
--input-csvs test-data/sample/pred_2.csv \
--print-mode raw
```
## elinor-convert
elinor-convert converts the TREC format into the JSONL format for elinor-evaluate.
For [Qrels](https://trec.nist.gov/data/qrels_eng/) files:
```sh
elinor-convert \
--input-trec qrels.trec \
--output-jsonl qrels.jsonl \
--rel-type true
```
For [Run](https://faculty.washington.edu/levow/courses/ling573_SPR2011/hw/trec_eval_desc.htm) files:
```sh
elinor-convert \
--input-trec run.trec \
--output-jsonl run.jsonl \
--rel-type pred
```
## Licensing
Licensed under either of
- Apache License, Version 2.0
([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license
([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.