Anomaly Grid

 █████╗ ███╗   ██╗ ██████╗ ███╗   ███╗ █████╗ ██╗  ██╗   ██╗
██╔══██╗████╗  ██║██╔═══██╗████╗ ████║██╔══██╗██║  ╚██╗ ██╔╝
███████║██╔██╗ ██║██║   ██║██╔████╔██║███████║██║   ╚████╔╝ 
██╔══██║██║╚██╗██║██║   ██║██║╚██╔╝██║██╔══██║██║    ╚██╔╝  
██║  ██║██║ ╚████║╚██████╔╝██║ ╚═╝ ██║██║  ██║███████╗██║   
╚═╝  ╚═╝╚═╝  ╚═══╝ ╚═════╝ ╚═╝     ╚═╝╚═╝  ╚═╝╚══════╝╚═╝   
[ANOMALY-GRID v0.2.2] - SEQUENCE ANOMALY DETECTION ENGINE

A Rust library implementing variable-order Markov chains for sequence anomaly detection in finite alphabets.

What This Library Does

Variable-Order Markov Models: Builds contexts of length 1 to max_order from training sequences
Hierarchical Context Selection: Uses longest available context, falls back to shorter contexts
Configurable Laplace Smoothing: Adjustable α parameter for probability estimation (default: 1.0)
Information-Theoretic Scoring: Shannon entropy and KL divergence with lazy computation
Memory-Optimized Storage: String interning, trie-based contexts, and small collections
Sliding Window Detection: Analyzes test sequences using overlapping windows
Parallel Batch Processing: Processes multiple sequences using Rayon

Known Limitations So Far

Memory Complexity: O(|alphabet|^max_order) theoretical, optimized in practice

Measured results with substantial datasets:

Scenario	Alphabet Size	Max Order	Elements Processed	Theoretical Max	Actual Contexts	Memory Usage	Efficiency
IoT Monitoring	20	3	1.6M	8,420	830	17.4 MB	9.9%
Security Analysis	50	3	1.8M	127,550	19,867	352.7 MB	15.6%
Enterprise Monitor	100	2	2.4M	10,100	2,892	89.9 MB	28.6%
Pattern Detection	30	4	2.2M	837,930	8,028	135.5 MB	1.0%

Memory optimizations reduce actual usage:

String interning reduces duplicate string storage
Trie-based storage shares common context prefixes
On-demand computation avoids storing precomputed values
Small collections use memory-efficient data structures

Algorithm Limitations

Standard Markov Model: Basic implementation, no theoretical optimality guarantees
Greedy Context Selection: Uses longest available context (hierarchical fallback)
Configurable Smoothing: Laplace smoothing with configurable α parameter (default: 1.0)
Prefix Compression: Uses trie-based storage for memory efficiency

Practical Constraints

Memory Bound: Will fail on large state spaces
Sequential Processing: Not designed for real-time high-volume streams (No plans to change this by my own, it is too much)

Installation

[dependencies]
anomaly-grid = "0.2.2"

Basic Usage

use anomaly_grid::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create detector (max_order determines memory usage)
    let mut detector = AnomalyDetector::new(3)?;
    
    // Train on normal patterns
    let normal_sequence = vec![
        "A".to_string(), "B".to_string(), "C".to_string(),
        "A".to_string(), "B".to_string(), "C".to_string(),
    ];
    detector.train(&normal_sequence)?;
    
    // Detect anomalies
    let test_sequence = vec![
        "A".to_string(), "X".to_string(), "Y".to_string(),
    ];
    let anomalies = detector.detect_anomalies(&test_sequence, 0.1)?;
    
    for anomaly in anomalies {
        println!("Sequence: {:?}, Likelihood: {:.6}", 
                 anomaly.sequence, anomaly.likelihood);
    }
    
    Ok(())
}

API Reference

AnomalyDetector

// Constructors - return Result due to validation
pub fn new(max_order: usize) -> AnomalyGridResult<Self>
pub fn with_config(config: AnomalyGridConfig) -> AnomalyGridResult<Self>

// Training - builds context tree from sequence
pub fn train(&mut self, sequence: &[String]) -> AnomalyGridResult<()>

// Detection - sliding window analysis
pub fn detect_anomalies(&self, sequence: &[String], threshold: f64) -> AnomalyGridResult<Vec<AnomalyScore>>
pub fn detect_anomalies_with_monitoring(&mut self, sequence: &[String], threshold: f64) -> AnomalyGridResult<Vec<AnomalyScore>>

// Performance monitoring and optimization
pub fn performance_metrics(&self) -> &PerformanceMetrics
pub fn optimize(&mut self, optimization_config: &OptimizationConfig) -> AnomalyGridResult<()>

// Batch processing - parallel analysis
pub fn batch_process_sequences(
    sequences: &[Vec<String>],
    config: &AnomalyGridConfig,
    threshold: f64,
) -> AnomalyGridResult<Vec<Vec<AnomalyScore>>>

AnomalyScore

pub struct AnomalyScore {
    pub sequence: Vec<String>,      // The analyzed window
    pub likelihood: f64,            // P(sequence|model) ∈ [0,1]
    pub log_likelihood: f64,        // ln(likelihood)
    pub information_score: f64,     // Average -log₂(P(x))
    pub anomaly_strength: f64,      // Normalized score ∈ [0,1]
}

Mathematical Implementation

Probability Estimation

P(next_state | context) = (count + α) / (total_count + α × vocab_size)

where α is configurable (default: 1.0, Laplace smoothing)

Shannon Entropy

H(X) = -∑ P(x) log₂ P(x)

Context Selection

Try context of length max_order
If no transitions found, try max_order-1
Continue until context of length 1
If still no match, use uniform probability

Anomaly Scoring

anomaly_strength = tanh((log_likelihood_component × likelihood_weight + info_score × information_weight) / normalization_factor)

where weights are configurable (default: likelihood_weight=0.7, information_weight=0.3, normalization_factor=10.0)

Complexity Analysis

Time Complexity (Theoretical)

Training: O(n × max_order × |alphabet|)
Detection: O(m × max_order) where m = test sequence length
Memory: O(|alphabet|^max_order)

Note: Actual performance depends heavily on your data characteristics, sequence patterns, and system resources. Profile your specific use case.

Configuration

Choosing max_order

Order 2: Lower memory usage, captures immediate dependencies
Order 3: Higher memory usage, captures longer patterns
Order 4+: Exponentially higher memory usage

Recommendation: Start with order 2, measure memory usage, then increase if needed.

Choosing threshold

Lower values (0.001): More sensitive, more detections
Higher values (0.1): Less sensitive, fewer detections

Recommendation: Experiment with your data to find appropriate values.

Memory Estimation

fn estimate_max_contexts(alphabet_size: usize, max_order: usize) -> usize {
    (1..=max_order)
        .map(|order| alphabet_size.pow(order as u32))
        .sum()
}

Note: Actual context count depends on sequence patterns and may be much lower.

Performance Monitoring

The library includes comprehensive performance monitoring:

let mut detector = AnomalyDetector::new(3)?;
detector.train(&sequence)?;

// Get performance metrics after training
let metrics = detector.performance_metrics();
println!("Training time: {} ms", metrics.training_time_ms);
println!("Context count: {}", metrics.context_count);
println!("Memory estimate: {} bytes", metrics.estimated_memory_bytes);

// Detection with monitoring (updates metrics)
let anomalies = detector.detect_anomalies_with_monitoring(&test_sequence, 0.1)?;
let updated_metrics = detector.performance_metrics();
println!("Detection time: {} ms", updated_metrics.detection_time_ms);

// Apply memory optimizations
use anomaly_grid::OptimizationConfig;
let opt_config = OptimizationConfig::default();
detector.optimize(&opt_config)?;

Suitable Use Cases

Potentially Good Fit

System logs with limited event types
Network protocols with small command sets
User workflows with simple action sequences
IoT sensors with categorical states
Educational purposes and algorithm learning

Likely Poor Fit

NLPs (large vocabulary)
High-resolution sensor data (continuous values)
Real-time processing (high-volume streams)
Large state spaces (many unique states)

Recommendation: Test with your specific data to determine suitability.

Error Handling

All functions return AnomalyGridResult<T> which is Result<T, AnomalyGridError>.

Common errors:

InvalidMaxOrder: max_order = 0
SequenceTooShort: sequence length < min_sequence_length
InvalidThreshold: threshold not in [0,1]
EmptyContextTree: detection attempted before training
MemoryLimitExceeded: too many contexts created

Dependencies

[dependencies]
rayon = "1.10.0"    # Parallel batch processing
smallvec = "1.13.0" # Memory-efficient small collections

Minimal dependencies for core functionality and memory optimization.

Testing

# Run all tests
cargo test

# Run comprehensive validation
cargo test --test comprehensive_test_runner

# Run examples
cargo run --example quick_start

Version History

v0.2.2

String interning for duplicate elimination
Trie-based context storage with prefix sharing
On-demand computation with lazy evaluation and caching
Small collections optimization using SmallVec
Memory pooling infrastructure
Enhanced Configuration: Configurable smoothing, weights, and memory limits
Expanded Test Suite: 72 tests with mathematical validation

v0.2.0

Major refactoring: Encapsulation of the main functions and a bigger set of tests
Documentation reduction: Difficulty to keep documenting everything in text outside the code got chaotic, reduced for simplicity but provided enough examples and comments in the code
API consistency: All constructors return Result<T, AnomalyGridError>

Benchmarking Your Use Case

To evaluate if this library suits your needs:

Start small: Test with a subset of your data
Monitor memory: Use the performance metrics
Profile thoroughly: Test with realistic data volumes
Compare alternatives: Benchmark against other solutions

When This Library May Not Be Suitable

Production systems requiring guaranteed performance
Applications with strict memory constraints
Real-time systems with strict latency requirements
Systems requiring theoretical optimality guarantees
Applications needing adaptive or online learning

License

MIT License - see LICENSE file.

anomaly-grid 0.2.2