Expand description
anomstream-core — core detectors + streaming primitives + cross-cut contracts.
This crate is the math-first floor of the
anomstream workspace.
It implements the Random Cut Forest (RCF) algorithm from Guha et al.
(ICML 2016) conformant with the
AWS SageMaker RCF specification: reservoir sampling without
replacement (Park et al. 2004), random cuts weighted by per-dimension
range, anomaly score averaged across trees, and hyperparameter bounds
matching the AWS reference (feature_dim, num_trees,
num_samples_per_tree).
Beyond the core forest, the crate ships a set of companion
primitives — per-feature drift detectors, normalisers, streaming
stats, frequency sketches — reused across detection pipelines so
callers can compose RandomCutForest + PerFeatureEwma +
PerFeatureCusum + FeatureDriftDetector + Normalizer + …
without reimplementing the underlying math.
§Workspace charter
anomstream-core = detectors + streaming primitives + cross-cut
contracts. Three categories of cross-cutting contracts live here
(not in a sibling crate) because every downstream layer depends
on them:
metrics::MetricsSink— telemetry trait consumed by every detector, the hot-path sampler, and the triage pipeline.severity::Severity+severity::SeverityBands— classification vocabulary used by both the bare forest (domain::AnomalyScore::severity) and the triage layer’sAlertRecord/AlertClusterer.forest::ForestSnapshot— read-only health view that lets downstream triage consume forest state without reaching into reservoir internals.
The scope is kept deliberately tight: streaming multivariate
anomaly primitives only; no protocol parsers, no ONNX runtimes,
no IP-centric trackers, no SOC-specific triage opinion. Those
belong in anomstream-triage
and anomstream-hotpath.
§Architecture
+-----------------+ +------------------+
| ForestBuilder | | RcfConfig |
+--------+--------+ +---------+--------+
| |
v v
+------+---------------------------+------+
| RandomCutForest | <- aggregate root
| |
| PointStore (ring buffer, repository) |
| |
| trees: Vec<(RandomCutTree, Sampler)> |
+-+---------------------+-----------------+
| |
v v
+---+---------+ +-------+--------+
| RandomCutTr | | ReservoirSampl |
+-------------+ +----------------+
|
v via Visitor trait
+---+----------------+ +-----------------------+
| ScalarScoreVisitor | | AttributionVisitor |
+--------------------+ +-----------------------+Modules are organised in layered fashion: domain (value objects),
tree (storage + cut tree), sampler (reservoir), visitor (scoring
strategies), forest (aggregate root), thresholded (adaptive
threshold layer on top of the forest), pool (bounded per-tenant
detector pool with LRU eviction), bootstrap (cold-start replay
helpers for restart resumption from an upstream TSDB),
group_score (named-group decomposition of the per-dim
attribution vector), attribution_stability (inter-tree
dispersion + confidence for attribution), and meta_drift
(two-sided CUSUM change-point detector over the score stream).
The persistence module is gated behind the serde feature;
pool is gated behind std.
§Companion primitives
Reusable streaming primitives that compose with the forest:
| Module | Purpose |
|---|---|
online_stats | Welford streaming mean + variance |
count_min_sketch | Probabilistic frequency sketch (std-gated) |
normalize | MinMax / ZScore / None per-feature transforms |
per_feature_ewma | Parallel univariate EWMA z-score detector |
per_feature_cusum | Parallel two-sided CUSUM change-point detector |
severity | Ordinal severity bands + classification |
The companion layer is policy-free — detectors return raw
statistics (z-scores, CUSUM magnitudes, min-max transforms);
callers map them to alert severity via SeverityBands or a
custom rule. per_feature_cusum intentionally co-exists with
meta_drift::MetaDriftDetector (scalar CUSUM on the score
stream): use the per-feature variant for attribution, the
meta variant for score-level regime change.
§Example
use anomstream_core::ForestBuilder;
let mut forest = ForestBuilder::<4>::new()
.num_trees(100)
.sample_size(256)
.seed(42)
.build()?;
for point in stream_of_points {
forest.update(point.clone())?;
let score = forest.score(&point)?;
if f64::from(score) > 1.5 {
eprintln!("anomaly: {score}");
}
}§Conformance
anomstream-core enforces the AWS SageMaker hyperparameter bounds at build time:
| Parameter | Range | Default |
|---|---|---|
feature_dim | [1, 10000] | required |
num_trees | [50, 1000] | 100 |
num_samples_per_tree | [1, 2048] | 256 |
time_decay | [0, 1] | 0.1 / sample_size (tracks AWS Java CompactSampler; pass 0.0 to disable recency bias) |
See the crate’s README.md for the full conformance matrix and the
comparison against krcf and the AWS Java port.
§References
- Sudipto Guha, Nina Mishra, Gourav Roy, Okke Schrijvers. “Robust Random Cut Forest Based Anomaly Detection on Streams.” International Conference on Machine Learning, pp. 2712–2721. 2016.
- Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, Al Geist. “Reservoir-based random sampling with replacement from data stream.” SIAM International Conference on Data Mining, pp. 492–496. 2004.
- AWS
SageMakerRCF reference.
Re-exports§
pub use adwin::AdwinDetector;pub use adwin::DEFAULT_DELTA as ADWIN_DEFAULT_DELTA;pub use adwin::DEFAULT_WINDOW_CAP as ADWIN_DEFAULT_WINDOW;pub use attribution_stability::AttributionStability;pub use bloom::BloomFilter;pub use bloom::DEFAULT_FALSE_POSITIVE_RATE as BLOOM_DEFAULT_FPR;pub use bloom::MAX_HASHES as BLOOM_MAX_HASHES;pub use bootstrap::BootstrapReport;pub use config::ForestBuilder;pub use config::RcfConfig;pub use count_min_sketch::CountMinSketch;pub use domain::AnomalyScore;pub use domain::BoundingBox;pub use domain::Cut;pub use domain::DiVector;pub use domain::Point;pub use drift_aware::DriftAwareForest;pub use drift_aware::DriftRecoveryConfig;pub use dynamic_forest::DynamicForest;pub use early_term::EarlyTermConfig;pub use early_term::EarlyTermScore;pub use ensemble::chi_squared_survival_even;pub use ensemble::fisher_combine;pub use error::RcfError;pub use error::RcfResult;pub use feature_drift::DriftLevel;pub use feature_drift::FeatureDriftDetector;pub use forensic::ForensicBaseline;pub use forest::ForestSnapshot;pub use forest::PointStore;pub use forest::RandomCutForest;pub use group_score::FeatureGroup;pub use group_score::FeatureGroups;pub use group_score::FeatureGroupsBuilder;pub use group_score::GroupScores;pub use histogram::HistogramConfig;pub use histogram::ScoreHistogram;pub use hyperloglog::DEFAULT_PRECISION as HLL_DEFAULT_PRECISION;pub use hyperloglog::HyperLogLog;pub use hyperloglog::MAX_PRECISION as HLL_MAX_PRECISION;pub use hyperloglog::MIN_PRECISION as HLL_MIN_PRECISION;pub use matrix_profile::MIN_WINDOW as MATRIX_PROFILE_MIN_WINDOW;pub use matrix_profile::MatrixProfile;pub use meta_drift::CusumConfig;pub use meta_drift::DriftKind;pub use meta_drift::DriftVerdict;pub use meta_drift::MetaDriftDetector;pub use metrics::MetricsSink;pub use metrics::NoopSink;pub use normalize::NormParams;pub use normalize::NormStrategy;pub use normalize::Normalizer;pub use online_stats::OnlineStats;pub use per_feature_cusum::DriftDirection;pub use per_feature_cusum::PerFeatureCusum;pub use per_feature_cusum::PerFeatureCusumAccumulator;pub use per_feature_cusum::PerFeatureCusumAlert;pub use per_feature_cusum::PerFeatureCusumConfig;pub use per_feature_cusum::PerFeatureCusumResult;pub use per_feature_ewma::EwmaAccumulator;pub use per_feature_ewma::PerFeatureEwma;pub use per_feature_ewma::PerFeatureEwmaConfig;pub use per_feature_ewma::PerFeatureEwmaResult;pub use pool::ReadinessSummary;pub use pool::TenantForestPool;pub use sampler::ReservoirSampler;pub use sampler::SamplerOp;pub use score_ci::DEFAULT_Z_FACTOR as DEFAULT_CI_Z_FACTOR;pub use score_ci::ScoreWithConfidence;pub use severity::Severity;pub use severity::SeverityBands;pub use shingled::ShingledForest;pub use shingled::ShingledForestBuilder;pub use space_saving::DEFAULT_CAPACITY as SPACE_SAVING_DEFAULT_CAPACITY;pub use space_saving::HeavyHitter;pub use space_saving::HeavyHitterEntry;pub use space_saving::SpaceSaving;pub use tdigest::Centroid;pub use tdigest::DEFAULT_COMPRESSION as TDIGEST_DEFAULT_COMPRESSION;pub use tdigest::TDigest;pub use thresholded::AnomalyGrade;pub use thresholded::EmaStats;pub use thresholded::ThresholdMode;pub use thresholded::ThresholdedConfig;pub use thresholded::ThresholdedForest;pub use thresholded::ThresholdedForestBuilder;pub use tree::InternalData;pub use tree::LeafData;pub use tree::NodeRef;pub use tree::NodeStore;pub use tree::NodeView;pub use tree::NodeViewMut;pub use tree::PointAccessor;pub use tree::RandomCutTree;pub use tsb_ad_m::TsbAdMDataset;pub use univariate_spot::DEFAULT_ALERT_P as SPOT_DEFAULT_ALERT_P;pub use univariate_spot::DEFAULT_QUANTILE as SPOT_DEFAULT_QUANTILE;pub use univariate_spot::PotDetector;pub use visitor::AttributionVisitor;pub use visitor::ScalarScoreVisitor;pub use visitor::ScoreAttributionVisitor;pub use visitor::Visitor;pub use vus_pr::DEFAULT_MAX_BUFFER as VUS_PR_DEFAULT_MAX_BUFFER;pub use vus_pr::range_auc_pr;pub use vus_pr::vus_pr;pub use vus_pr::vus_pr_with_buffer;
Modules§
- adwin
ADWIN(ADaptiveWINdowing) — streaming change-point detector with automatic window sizing.- attribution_
stability - Inter-tree dispersion of the per-dim attribution vector.
- bloom
- Bloom filter — probabilistic set membership with a bounded
false-positive rate (
fpr) and zero false negatives. - bootstrap
- Cold-start bootstrap: warm a detector from historical data before exposing it to live traffic.
- config
- Forest configuration and the
ForestBuilderentry point. - count_
min_ sketch - Count-Min Sketch — probabilistic frequency estimation in constant memory.
- domain
- Domain primitives: pure value objects + dimension helpers.
- drift_
aware - Shadow-forest drift recovery — wraps a live
crate::RandomCutForestplus an optional shadow that warms on the post-drift stream, then atomically replaces the primary once the shadow has seen enough observations. - dynamic_
forest - Runtime-dim wrapper for
RandomCutForest— unblocks heterogeneous multi-tenant deployments where every tenant ships its own feature-vector width (MSSP pools, per-tenant feature extractors). - early_
term - Early-termination scoring — stop traversing trees once the running per-tree mean has converged tightly enough to be actionable.
- ensemble
- Fisher’s method — combine K independent p-values into a joint anomaly score.
- error
- Error types used across the crate.
- feature_
drift - Input-feature drift detector — PSI + KL divergence over a frozen baseline distribution.
- forensic
- Imputation-like forensic baseline — answer “what would this dim have looked like if the point were normal?” by aggregating the per-dim distribution of the forest’s currently-held sample points.
- forest
- Forest aggregate root.
- group_
score - Score decomposition by named feature groups.
- histogram
- Fixed-bin histogram for score / grade / CUSUM streams.
- hyperloglog
HyperLogLog— probabilistic distinct-count (cardinality) estimator.- matrix_
profile - Matrix profile — batch time-series discord / motif detector (STOMP, Zhu et al. 2016).
- meta_
drift - CUSUM change-point detector over the anomaly-score stream.
- metrics
- Metrics sink abstraction for observability wiring.
- normalize
- Per-feature min-max / z-score normalizer.
- online_
stats - Welford online mean + variance accumulator.
- per_
feature_ cusum - Per-feature two-sided CUSUM change-point detector.
- per_
feature_ ewma - Per-feature exponentially-weighted moving average detector.
- persistence
- Optional persistence helpers for
crate::RandomCutForestandcrate::ThresholdedForest. - pool
- Pools of detectors keyed by tenant / stream id.
- sampler
- Streaming reservoir sampling primitives.
- score_
ci - Anomaly score with confidence interval.
- serde_
util - Serde adapters shared across the workspace.
- severity
- Ordinal severity bands derived from a raw anomaly score.
- shingled
- Internal shingling on top of
crate::RandomCutForest. - space_
saving - Space-Saving — deterministic top-K heavy hitters in
O(K)memory. - tdigest
- Streaming quantile estimator — Ted Dunning’s t-digest (Computing Extremely Accurate Quantiles Using t-Digests, 2019).
- thresholded
- Adaptive-threshold layer on top of
crate::RandomCutForest. - tree
- Tree algorithm primitives.
- tsb_
ad_ m - TSB-AD-M — CSV loader for the multivariate split of the
Time-Series Benchmark for Anomaly Detection (Liu & Paparrizos,
NeurIPS2024). - univariate_
spot - Streaming Peaks-Over-Threshold (SPOT / DSPOT) per-dimension anomaly detector — Siffer et al., Anomaly Detection in Streams with Extreme Value Theory, KDD 2017.
- visitor
Visitortrait used bycrate::tree::RandomCutTree::traverseto dispatch per-node callbacks during a root→leaf walk, plus the two production visitors:- vus_pr
- VUS-PR — Volume Under the Surface, Precision-Recall variant. Threshold-free, length-aware quality metric for time-series anomaly detection (Paparrizos et al. VLDB 2022).