Touchstone
Touchstone is a Rust library for evaluating streaming anomaly detectors on labeled time-series benchmark datasets. Point it at a directory of CSVs, register one or more detectors, call run(), and get back a Polars DataFrame with one row per (dataset, detector) pair.
Touchstone is made in the spirit of TimeEval [2] — a Python benchmarking toolkit for time series anomaly detection algorithms. If you are looking for datasets, the TimeEval evaluation paper [1] provides a large collection already formatted for direct use with Touchstone at the TimeEval Datasets page.
Quickstart
Add to Cargo.toml:
[]
= "0.1"
Implementing the Detector Trait
Your algorithm must implement a single trait:
pointis a slice off32features for the current time step. The length matches the number of feature columns in the dataset.- Return an anomaly score as
f32. Higher values mean more anomalous. - Return
f32::NANduring warmup or whenever a score is not yet meaningful. NaN points are excluded from metric computation. - Scores are minmax-normalized to
[0, 1]before any metric is computed, so the absolute scale of your scores does not matter.
Running an Evaluation
use Path;
use ;
Output DataFrame
run() returns a DataFrame with this schema:
| column | type | description |
|---|---|---|
dataset |
String | dataset filename (without extension) |
detector |
String | name passed to add_detector |
roc_auc |
f64 | ROC-AUC |
pr_auc |
f64 | Precision-Recall AUC |
average_precision |
f64 | Average Precision |
precision |
f64 | Precision at 90th-percentile threshold |
recall |
f64 | Recall at 90th-percentile threshold |
f1 |
f64 | F1 at 90th-percentile threshold |
range_precision |
f64 | Range-based Precision (Tatbul et al., NeurIPS 2018) |
range_recall |
f64 | Range-based Recall |
range_f_score |
f64 | Range-based F-score |
range_auc |
f64 | Range-based AUC |
range_pr_vus |
f64 | PR-VUS (Paparrizos et al., PVLDB 2022) |
range_roc_vus |
f64 | ROC-VUS |
time_sec |
f64 | wall-clock seconds for this detector on this dataset |
If a dataset fails to load or a detector produces only NaN scores, the metric columns for that row contain NaN.
Custom Metrics
If the default metric set does not suit your needs, swap it out entirely by adding metrics before calling run():
use Path;
use ;
use ;
#
#
#
let mut experiment = new;
experiment.add_detector;
experiment.add_metric;
experiment.add_metric;
Implement Metric for fully custom scoring:
use Metric;
;
Dataset Format
Datasets are CSV files with no assumed column names:
timestamp, feature_1, ..., feature_N, label
2016-04-20 10:35:12, 1.2, 3.4, 0
2016-04-20 10:35:13, 5.6, 7.8, 1
- Column 1: timestamp — parsed but ignored
- Columns 2 … N: features — cast to
f32, passed aspointtoupdate() - Last column: binary anomaly label —
0(normal) or1(anomaly)
Touchstone passes every row to update() in order, simulating a streaming environment. Each detector gets a fresh instance per dataset.
Running the Built-in Example
This runs a rolling z-score detector (window = 20) against all datasets in data/ and prints the results.
References
If you use Touchstone or the TimeEval dataset collection in your work, please cite:
[1] Dataset collection and evaluation methodology
[2] TimeEval benchmarking toolkit
[3] Touchstone