ruvector-data-framework 0.3.0

Core discovery framework for RuVector dataset integrations - find hidden patterns in massive datasets using vector memory, graph structures, and dynamic min-cut algorithms
Documentation
# Real Data Discovery Example

This example demonstrates RuVector's discovery engine on **real academic research papers** fetched from the OpenAlex API.

## What It Does

Fetches actual climate-finance research papers across multiple topics:
- **Climate risk finance** (20 papers)
- **Stranded assets** (15 papers)
- **Carbon pricing markets** (15 papers)
- **Physical climate risk** (15 papers)
- **Transition risk disclosure** (15 papers)

Then runs RuVector's discovery engine to detect:
- Cross-topic bridges (papers connecting different research areas)
- Emerging research clusters
- Consolidation/fragmentation trends
- Anomalous coherence patterns

## Running the Example

```bash
cd examples/data/framework
cargo run --example real_data_discovery
```

## Example Output

```
╔══════════════════════════════════════════════════════════════╗
║     Real Climate-Finance Research Discovery with OpenAlex    ║
║              Powered by RuVector Discovery Engine            ║
╚══════════════════════════════════════════════════════════════╝

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📡 Phase 1: Fetching Research Papers from OpenAlex API

   Querying topics:
   • climate risk finance: fetching 20 papers... ✓ 20 papers
   • stranded assets energy: fetching 15 papers... ✓ 15 papers
   • carbon pricing markets: fetching 15 papers... ✓ 15 papers
   ...

   Total papers fetched: 80
```

## Features

### Real API Integration
- Uses OpenAlex's public API (no authentication required)
- Polite API usage with rate limiting
- Graceful fallback to synthetic data if API fails

### Semantic Analysis
- Simple bag-of-words embeddings (128-dim)
- Converts paper titles + abstracts to vectors
- Preserves citation and concept relationships

### Discovery Engine
- **Graph construction**: Builds semantic graph from paper embeddings
- **Coherence computation**: Dynamic minimum cut algorithm
- **Pattern detection**: Multi-signal trend analysis
  - Cross-topic bridges
  - Emerging clusters
  - Research consolidation/fragmentation
  - Anomaly detection

### Performance
- Processes ~8 papers/second
- Handles 50-100 papers comfortably
- Scalable to larger datasets with optimized backend

## API Rate Limits

OpenAlex allows polite API usage without authentication:
- ~10 requests/second with polite headers
- Built-in retry logic for rate limit errors
- Automatic fallback if API unavailable

To be extra polite, the client includes an email in requests (configurable in the code).

## Customization

### Fetch Different Topics

Edit the `queries` vector in `main()`:

```rust
let queries = vec![
    ("topic_id", "your search query", 20),  // 20 papers
    ("another_topic", "another query", 15), // 15 papers
];
```

### Adjust Discovery Thresholds

Modify the `DiscoveryConfig`:

```rust
let discovery_config = DiscoveryConfig {
    min_signal_strength: 0.01,    // Lower = more patterns
    emergence_threshold: 0.15,     // Cluster growth threshold
    bridge_threshold: 0.25,        // Cross-topic connection threshold
    anomaly_sigma: 2.0,            // Anomaly sensitivity
    ..Default::default()
};
```

### Change Coherence Settings

Adjust the `CoherenceConfig`:

```rust
let coherence_config = CoherenceConfig {
    min_edge_weight: 0.3,          // Similarity threshold
    window_size_secs: 86400 * 365, // Time window (1 year)
    approximate: true,             // Use fast approximate min-cut
    ..Default::default()
};
```

## Understanding Results

### Cross-Topic Bridges
Papers that connect different research areas. High bridge frequency indicates interdisciplinary research.

```
🌉 Cross-Topic Bridges: 3
   1. Climate risk papers bridging to finance literature
      Confidence: 0.85
      Entities: 12 papers
```

### Emerging Clusters
New research areas forming over time. Indicates novel directions.

```
🌱 Emerging Research Clusters: 2
   1. Emerging structure detected: 5 new nodes over 3 windows
      Strength: Moderate
```

### Consolidation/Fragmentation
Shows whether topics are converging or diverging.

```
📈 Consolidating Topics: 1
   • Strengthening trend detected: 3.2% per window
```

## Extending the Example

### Use Advanced Embeddings

Replace `SimpleEmbedder` with a real embedding model:

```rust
// Instead of SimpleEmbedder
use sentence_transformers::SentenceTransformer;

let model = SentenceTransformer::load("all-MiniLM-L6-v2")?;
let embedding = model.encode(&text)?;
```

### Integrate with RuVector Core

Use `ruvector-core` for production vector search:

```rust
use ruvector_core::HnswIndex;

let mut index = HnswIndex::new(128)?;
for record in &records {
    if let Some(embedding) = &record.embedding {
        index.insert(&record.id, embedding)?;
    }
}
```

### Export Results

Save discoveries to JSON:

```rust
use std::fs::File;
use std::io::Write;

let json = serde_json::to_string_pretty(&patterns)?;
let mut file = File::create("discoveries.json")?;
file.write_all(json.as_bytes())?;
```

## Troubleshooting

### API Errors
If you see frequent API errors:
1. Check your internet connection
2. The example will automatically fall back to synthetic data
3. For large queries, add delays between requests

### No Patterns Detected
This is normal with small datasets! Try:
1. Fetching more papers (increase limits)
2. Lowering thresholds in `DiscoveryConfig`
3. Fetching more diverse topics to find bridges

### Out of Memory
For large datasets:
1. Reduce the number of papers fetched
2. Use the `approximate` coherence engine
3. Process in batches

## Learn More

- OpenAlex API: https://docs.openalex.org
- RuVector Discovery: `/examples/data/framework/`
- Min-cut algorithms: `/crates/ruvector-cluster/`