Skip to main content

Crate anomstream_core

Crate anomstream_core 

Source
Expand description

anomstream-core — core detectors + streaming primitives + cross-cut contracts.

This crate is the math-first floor of the anomstream workspace. It implements the Random Cut Forest (RCF) algorithm from Guha et al. (ICML 2016) conformant with the AWS SageMaker RCF specification: reservoir sampling without replacement (Park et al. 2004), random cuts weighted by per-dimension range, anomaly score averaged across trees, and hyperparameter bounds matching the AWS reference (feature_dim, num_trees, num_samples_per_tree).

Beyond the core forest, the crate ships a set of companion primitives — per-feature drift detectors, normalisers, streaming stats, frequency sketches — reused across detection pipelines so callers can compose RandomCutForest + PerFeatureEwma + PerFeatureCusum + FeatureDriftDetector + Normalizer + … without reimplementing the underlying math.

§Workspace charter

anomstream-core = detectors + streaming primitives + cross-cut contracts. Three categories of cross-cutting contracts live here (not in a sibling crate) because every downstream layer depends on them:

The scope is kept deliberately tight: streaming multivariate anomaly primitives only; no protocol parsers, no ONNX runtimes, no IP-centric trackers, no SOC-specific triage opinion. Those belong in anomstream-triage and anomstream-hotpath.

§Architecture

+-----------------+        +------------------+
|  ForestBuilder  |        |  RcfConfig       |
+--------+--------+        +---------+--------+
         |                           |
         v                           v
  +------+---------------------------+------+
  |          RandomCutForest                |   <- aggregate root
  |                                         |
  |   PointStore (ring buffer, repository)  |
  |                                         |
  |   trees: Vec<(RandomCutTree, Sampler)>  |
  +-+---------------------+-----------------+
    |                     |
    v                     v
+---+---------+   +-------+--------+
| RandomCutTr |   | ReservoirSampl |
+-------------+   +----------------+
    |
    v  via Visitor trait
+---+----------------+    +-----------------------+
| ScalarScoreVisitor |    | AttributionVisitor    |
+--------------------+    +-----------------------+

Modules are organised in layered fashion: domain (value objects), tree (storage + cut tree), sampler (reservoir), visitor (scoring strategies), forest (aggregate root), thresholded (adaptive threshold layer on top of the forest), pool (bounded per-tenant detector pool with LRU eviction), bootstrap (cold-start replay helpers for restart resumption from an upstream TSDB), group_score (named-group decomposition of the per-dim attribution vector), attribution_stability (inter-tree dispersion + confidence for attribution), and meta_drift (two-sided CUSUM change-point detector over the score stream). The persistence module is gated behind the serde feature; pool is gated behind std.

§Companion primitives

Reusable streaming primitives that compose with the forest:

ModulePurpose
online_statsWelford streaming mean + variance
count_min_sketchProbabilistic frequency sketch (std-gated)
normalizeMinMax / ZScore / None per-feature transforms
per_feature_ewmaParallel univariate EWMA z-score detector
per_feature_cusumParallel two-sided CUSUM change-point detector
severityOrdinal severity bands + classification

The companion layer is policy-free — detectors return raw statistics (z-scores, CUSUM magnitudes, min-max transforms); callers map them to alert severity via SeverityBands or a custom rule. per_feature_cusum intentionally co-exists with meta_drift::MetaDriftDetector (scalar CUSUM on the score stream): use the per-feature variant for attribution, the meta variant for score-level regime change.

§Example

use anomstream_core::ForestBuilder;

let mut forest = ForestBuilder::<4>::new()
    .num_trees(100)
    .sample_size(256)
    .seed(42)
    .build()?;

for point in stream_of_points {
    forest.update(point.clone())?;
    let score = forest.score(&point)?;
    if f64::from(score) > 1.5 {
        eprintln!("anomaly: {score}");
    }
}

§Conformance

anomstream-core enforces the AWS SageMaker hyperparameter bounds at build time:

ParameterRangeDefault
feature_dim[1, 10000]required
num_trees[50, 1000]100
num_samples_per_tree[1, 2048]256
time_decay[0, 1]0.1 / sample_size (tracks AWS Java CompactSampler; pass 0.0 to disable recency bias)

See the crate’s README.md for the full conformance matrix and the comparison against krcf and the AWS Java port.

§References

  1. Sudipto Guha, Nina Mishra, Gourav Roy, Okke Schrijvers. “Robust Random Cut Forest Based Anomaly Detection on Streams.” International Conference on Machine Learning, pp. 2712–2721. 2016.
  2. Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, Al Geist. “Reservoir-based random sampling with replacement from data stream.” SIAM International Conference on Data Mining, pp. 492–496. 2004.
  3. AWS SageMaker RCF reference.

Re-exports§

pub use adwin::AdwinDetector;
pub use adwin::DEFAULT_DELTA as ADWIN_DEFAULT_DELTA;
pub use adwin::DEFAULT_WINDOW_CAP as ADWIN_DEFAULT_WINDOW;
pub use attribution_stability::AttributionStability;
pub use bloom::BloomFilter;
pub use bloom::DEFAULT_FALSE_POSITIVE_RATE as BLOOM_DEFAULT_FPR;
pub use bloom::MAX_HASHES as BLOOM_MAX_HASHES;
pub use bootstrap::BootstrapReport;
pub use config::ForestBuilder;
pub use config::RcfConfig;
pub use count_min_sketch::CountMinSketch;
pub use domain::AnomalyScore;
pub use domain::BoundingBox;
pub use domain::Cut;
pub use domain::DiVector;
pub use domain::Point;
pub use drift_aware::DriftAwareForest;
pub use drift_aware::DriftRecoveryConfig;
pub use dynamic_forest::DynamicForest;
pub use early_term::EarlyTermConfig;
pub use early_term::EarlyTermScore;
pub use ensemble::chi_squared_survival_even;
pub use ensemble::fisher_combine;
pub use error::RcfError;
pub use error::RcfResult;
pub use feature_drift::DriftLevel;
pub use feature_drift::FeatureDriftDetector;
pub use forensic::ForensicBaseline;
pub use forest::ForestSnapshot;
pub use forest::PointStore;
pub use forest::RandomCutForest;
pub use group_score::FeatureGroup;
pub use group_score::FeatureGroups;
pub use group_score::FeatureGroupsBuilder;
pub use group_score::GroupScores;
pub use histogram::HistogramConfig;
pub use histogram::ScoreHistogram;
pub use hyperloglog::DEFAULT_PRECISION as HLL_DEFAULT_PRECISION;
pub use hyperloglog::HyperLogLog;
pub use hyperloglog::MAX_PRECISION as HLL_MAX_PRECISION;
pub use hyperloglog::MIN_PRECISION as HLL_MIN_PRECISION;
pub use matrix_profile::MIN_WINDOW as MATRIX_PROFILE_MIN_WINDOW;
pub use matrix_profile::MatrixProfile;
pub use meta_drift::CusumConfig;
pub use meta_drift::DriftKind;
pub use meta_drift::DriftVerdict;
pub use meta_drift::MetaDriftDetector;
pub use metrics::MetricsSink;
pub use metrics::NoopSink;
pub use normalize::NormParams;
pub use normalize::NormStrategy;
pub use normalize::Normalizer;
pub use online_stats::OnlineStats;
pub use per_feature_cusum::DriftDirection;
pub use per_feature_cusum::PerFeatureCusum;
pub use per_feature_cusum::PerFeatureCusumAccumulator;
pub use per_feature_cusum::PerFeatureCusumAlert;
pub use per_feature_cusum::PerFeatureCusumConfig;
pub use per_feature_cusum::PerFeatureCusumResult;
pub use per_feature_ewma::EwmaAccumulator;
pub use per_feature_ewma::PerFeatureEwma;
pub use per_feature_ewma::PerFeatureEwmaConfig;
pub use per_feature_ewma::PerFeatureEwmaResult;
pub use pool::ReadinessSummary;
pub use pool::TenantForestPool;
pub use sampler::ReservoirSampler;
pub use sampler::SamplerOp;
pub use score_ci::DEFAULT_Z_FACTOR as DEFAULT_CI_Z_FACTOR;
pub use score_ci::ScoreWithConfidence;
pub use severity::Severity;
pub use severity::SeverityBands;
pub use shingled::ShingledForest;
pub use shingled::ShingledForestBuilder;
pub use space_saving::DEFAULT_CAPACITY as SPACE_SAVING_DEFAULT_CAPACITY;
pub use space_saving::HeavyHitter;
pub use space_saving::HeavyHitterEntry;
pub use space_saving::SpaceSaving;
pub use tdigest::Centroid;
pub use tdigest::DEFAULT_COMPRESSION as TDIGEST_DEFAULT_COMPRESSION;
pub use tdigest::TDigest;
pub use thresholded::AnomalyGrade;
pub use thresholded::EmaStats;
pub use thresholded::ThresholdMode;
pub use thresholded::ThresholdedConfig;
pub use thresholded::ThresholdedForest;
pub use thresholded::ThresholdedForestBuilder;
pub use tree::InternalData;
pub use tree::LeafData;
pub use tree::NodeRef;
pub use tree::NodeStore;
pub use tree::NodeView;
pub use tree::NodeViewMut;
pub use tree::PointAccessor;
pub use tree::RandomCutTree;
pub use tsb_ad_m::TsbAdMDataset;
pub use univariate_spot::DEFAULT_ALERT_P as SPOT_DEFAULT_ALERT_P;
pub use univariate_spot::DEFAULT_QUANTILE as SPOT_DEFAULT_QUANTILE;
pub use univariate_spot::PotDetector;
pub use visitor::AttributionVisitor;
pub use visitor::ScalarScoreVisitor;
pub use visitor::ScoreAttributionVisitor;
pub use visitor::Visitor;
pub use vus_pr::DEFAULT_MAX_BUFFER as VUS_PR_DEFAULT_MAX_BUFFER;
pub use vus_pr::range_auc_pr;
pub use vus_pr::vus_pr;
pub use vus_pr::vus_pr_with_buffer;

Modules§

adwin
ADWIN (ADaptive WINdowing) — streaming change-point detector with automatic window sizing.
attribution_stability
Inter-tree dispersion of the per-dim attribution vector.
bloom
Bloom filter — probabilistic set membership with a bounded false-positive rate (fpr) and zero false negatives.
bootstrap
Cold-start bootstrap: warm a detector from historical data before exposing it to live traffic.
config
Forest configuration and the ForestBuilder entry point.
count_min_sketch
Count-Min Sketch — probabilistic frequency estimation in constant memory.
domain
Domain primitives: pure value objects + dimension helpers.
drift_aware
Shadow-forest drift recovery — wraps a live crate::RandomCutForest plus an optional shadow that warms on the post-drift stream, then atomically replaces the primary once the shadow has seen enough observations.
dynamic_forest
Runtime-dim wrapper for RandomCutForest — unblocks heterogeneous multi-tenant deployments where every tenant ships its own feature-vector width (MSSP pools, per-tenant feature extractors).
early_term
Early-termination scoring — stop traversing trees once the running per-tree mean has converged tightly enough to be actionable.
ensemble
Fisher’s method — combine K independent p-values into a joint anomaly score.
error
Error types used across the crate.
feature_drift
Input-feature drift detector — PSI + KL divergence over a frozen baseline distribution.
forensic
Imputation-like forensic baseline — answer “what would this dim have looked like if the point were normal?” by aggregating the per-dim distribution of the forest’s currently-held sample points.
forest
Forest aggregate root.
group_score
Score decomposition by named feature groups.
histogram
Fixed-bin histogram for score / grade / CUSUM streams.
hyperloglog
HyperLogLog — probabilistic distinct-count (cardinality) estimator.
matrix_profile
Matrix profile — batch time-series discord / motif detector (STOMP, Zhu et al. 2016).
meta_drift
CUSUM change-point detector over the anomaly-score stream.
metrics
Metrics sink abstraction for observability wiring.
normalize
Per-feature min-max / z-score normalizer.
online_stats
Welford online mean + variance accumulator.
per_feature_cusum
Per-feature two-sided CUSUM change-point detector.
per_feature_ewma
Per-feature exponentially-weighted moving average detector.
persistence
Optional persistence helpers for crate::RandomCutForest and crate::ThresholdedForest.
pool
Pools of detectors keyed by tenant / stream id.
sampler
Streaming reservoir sampling primitives.
score_ci
Anomaly score with confidence interval.
serde_util
Serde adapters shared across the workspace.
severity
Ordinal severity bands derived from a raw anomaly score.
shingled
Internal shingling on top of crate::RandomCutForest.
space_saving
Space-Saving — deterministic top-K heavy hitters in O(K) memory.
tdigest
Streaming quantile estimator — Ted Dunning’s t-digest (Computing Extremely Accurate Quantiles Using t-Digests, 2019).
thresholded
Adaptive-threshold layer on top of crate::RandomCutForest.
tree
Tree algorithm primitives.
tsb_ad_m
TSB-AD-M — CSV loader for the multivariate split of the Time-Series Benchmark for Anomaly Detection (Liu & Paparrizos, NeurIPS 2024).
univariate_spot
Streaming Peaks-Over-Threshold (SPOT / DSPOT) per-dimension anomaly detector — Siffer et al., Anomaly Detection in Streams with Extreme Value Theory, KDD 2017.
visitor
Visitor trait used by crate::tree::RandomCutTree::traverse to dispatch per-node callbacks during a root→leaf walk, plus the two production visitors:
vus_pr
VUS-PR — Volume Under the Surface, Precision-Recall variant. Threshold-free, length-aware quality metric for time-series anomaly detection (Paparrizos et al. VLDB 2022).