RuVector Dataset Discovery Framework
Find hidden patterns and connections in massive datasets that traditional tools miss.
RuVector turns your dataβresearch papers, climate records, financial filingsβinto a connected graph, then uses cutting-edge algorithms to spot emerging trends, cross-domain relationships, and regime shifts before they become obvious.
Why RuVector?
Most data analysis tools excel at answering questions you already know to ask. RuVector is different: it helps you discover what you don't know you're looking for.
Real-world examples:
- π¬ Research: Spot a new field forming 6-12 months before it gets a name, by detecting when papers start citing across traditional boundaries
- π Climate: Detect regime shifts in weather patterns that correlate with economic disruptions
- π° Finance: Find companies whose narratives are diverging from their peersβoften an early warning signal
Features
| Feature | What It Does | Why It Matters |
|---|---|---|
| Vector Memory | Stores data as 128-dimensional embeddings | Similar concepts cluster together automatically |
| Graph Structure | Connects related items with weighted edges | Reveals hidden relationships in your data |
| Min-Cut Analysis | Measures how "connected" your network is | Drops signal regime changes and fragmentation |
| Cross-Domain Detection | Finds bridges between different fields | Discovers unexpected correlations (e.g., climate β finance) |
| Causality Testing | Checks if changes in X predict changes in Y | Moves beyond correlation to actionable insights |
| Statistical Rigor | Reports p-values and effect sizes | Know which findings are real vs. noise |
Performance
- β‘ 8.8x faster batch vector insertion (parallel processing)
- β‘ 2.9x faster similarity computation (SIMD acceleration)
- π Works with millions of records on standard hardware
Quick Start
Prerequisites
# Ensure you're in the ruvector workspace
Run Your First Example
# 1. Performance benchmark - see the speed improvements
# 2. Discovery hunter - find patterns in sample data
# 3. Cross-domain analysis - detect bridges between fields
Domain-Specific Examples
# Climate: Detect weather regime shifts
# Finance: Monitor corporate filing coherence
What You'll See
π Discovery Results:
Pattern: Climate β Finance bridge detected
Strength: 0.73 (strong connection)
P-value: 0.031 (statistically significant)
β Drought indices may predict utility sector
performance with a 3-period lag
The Discovery Thesis
RuVector's unique combination of vector memory, graph structures, and dynamic minimum cut algorithms enables discoveries that most analysis tools miss:
- Emerging patterns before they have names: Detect topic splits and merges as cut boundaries shift over time
- Non-obvious cross-domain bridges: Find small "connector" subgraphs where disciplines quietly start citing each other
- Causal leverage maps: Link funders, labs, venues, and downstream citations to spot high-impact intervention points
- Regime shifts in time series: Use coherence breaks to flag fundamental changes in system behavior
Tutorial
1. Creating the Engine
use ;
use ;
let config = OptimizedConfig ;
let mut engine = new;
2. Adding Data
use HashMap;
use Utc;
// Single vector
let vector = SemanticVector ;
let node_id = engine.add_vector;
// Batch insertion (8.8x faster)
3. Computing Coherence
let snapshot = engine.compute_coherence;
println!;
println!;
println!;
Interpretation:
| Min-cut Trend | Meaning |
|---|---|
| Rising | Network consolidating, stronger connections |
| Falling | Fragmentation, potential regime change |
| Stable | Steady state, consistent structure |
4. Pattern Detection
let patterns = engine.detect_patterns_with_significance;
for pattern in patterns.iter.filter
Pattern Types:
| Type | Description | Example |
|---|---|---|
CoherenceBreak |
Min-cut dropped significantly | Network fragmentation crisis |
Consolidation |
Min-cut increased | Market convergence |
BridgeFormation |
Cross-domain connections | Climate-finance link |
Cascade |
Temporal causality | Climate β Finance lag-3 |
EmergingCluster |
New dense subgraph | Research topic emerging |
5. Cross-Domain Analysis
// Check coupling strength
let stats = engine.stats;
let coupling = stats.cross_domain_edges as f64 / stats.total_edges as f64;
println!;
// Domain coherence scores
for domain in
Performance Benchmarks
| Operation | Baseline | Optimized | Speedup |
|---|---|---|---|
| Vector Insertion | 133ms | 15ms | 8.84x |
| SIMD Cosine | 432ms | 148ms | 2.91x |
| Pattern Detection | 524ms | 655ms | - |
Datasets
1. OpenAlex (Research Intelligence)
Best for: Emerging field detection, cross-discipline bridges
- 250M+ works, 90M+ authors
- Native graph structure
- Bulk download + API access
use ;
let radar = new;
let frontiers = radar.detect_emerging_topics;
2. NOAA + NASA (Climate Intelligence)
Best for: Regime shift detection, anomaly prediction
- Weather observations, satellite imagery
- Time series β graph transformation
- Economic risk modeling
use ;
let detector = new;
let shifts = detector.detect_shifts;
3. SEC EDGAR (Financial Intelligence)
Best for: Corporate risk signals, peer divergence
- XBRL financial statements
- 10-K/10-Q filings
- Narrative + fundamental analysis
use ;
let monitor = new;
let alerts = monitor.analyze_filing;
Directory Structure
examples/data/
βββ README.md # This file
βββ Cargo.toml # Workspace manifest
βββ framework/ # Core discovery framework
β βββ src/
β β βββ lib.rs # Framework exports
β β βββ ruvector_native.rs # Native engine with Stoer-Wagner
β β βββ optimized.rs # SIMD + parallel optimizations
β β βββ coherence.rs # Coherence signal computation
β β βββ discovery.rs # Pattern detection
β β βββ ingester.rs # Data ingestion
β βββ examples/
β βββ cross_domain_discovery.rs # Cross-domain patterns
β βββ optimized_benchmark.rs # Performance comparison
β βββ discovery_hunter.rs # Novel pattern search
βββ openalex/ # OpenAlex integration
βββ climate/ # NOAA/NASA integration
βββ edgar/ # SEC EDGAR integration
Configuration Reference
OptimizedConfig
| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.65 | Minimum cosine similarity for edges |
mincut_sensitivity |
0.12 | Sensitivity to coherence changes |
cross_domain |
true | Enable cross-domain discovery |
batch_size |
256 | Parallel batch size |
use_simd |
true | Enable SIMD acceleration |
significance_threshold |
0.05 | P-value threshold |
causality_lookback |
10 | Temporal lookback periods |
causality_min_correlation |
0.6 | Minimum correlation for causality |
Discovery Examples
Climate-Finance Bridge
Detected: Climate β Finance bridge
Strength: 0.73
Connections: 197
Hypothesis: Drought indices may predict
utility sector performance with lag-2
Regime Shift Detection
Min-cut trajectory:
t=0: 72.5 (baseline)
t=1: 73.3 (+1.1%)
t=2: 74.5 (+1.6%) β Consolidation
Effect size: 2.99 (large)
P-value: 0.042 (significant)
Causality Pattern
Climate β Finance causality detected
F-statistic: 4.23
Optimal lag: 3 periods
Correlation: 0.67
P-value: 0.031
Algorithms
Stoer-Wagner Min-Cut
Computes minimum cut of weighted undirected graph.
- Complexity: O(VE + VΒ² log V)
- Use: Network coherence measurement
SIMD Cosine Similarity
Processes 8 floats per iteration using AVX2.
- Speedup: 2.9x vs scalar
- Fallback: Chunked scalar (4 floats)
Granger Causality
Tests if past values of X predict Y.
- Compute cross-correlation at lags 1..k
- Find optimal lag with max |correlation|
- Calculate F-statistic
- Convert to p-value
Best Practices
- Start with low thresholds - Use
similarity_threshold: 0.45for exploration - Use batch insertion -
add_vectors_batch()is 8x faster - Monitor coherence trends - Min-cut trajectory predicts regime changes
- Filter by significance - Focus on
p_value < 0.05 - Validate causality - Temporal patterns need domain expertise
Troubleshooting
| Problem | Solution |
|---|---|
| No patterns detected | Lower mincut_sensitivity to 0.05 |
| Too many edges | Raise similarity_threshold to 0.70 |
| Slow performance | Use --features parallel --release |
| Memory issues | Reduce batch_size |
References
License
MIT OR Apache-2.0