scirs2-datasets 0.4.1

Datasets module for SciRS2 (scirs2-datasets)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
# Data Generation Tutorial

This tutorial covers the comprehensive data generation capabilities of SciRS2 datasets, allowing you to create synthetic datasets for various machine learning tasks.

## Overview

SciRS2 provides powerful data generators for:

- **Classification**: Linear and non-linear classification problems
- **Regression**: Single and multi-output regression with noise
- **Clustering**: Blob-like and hierarchical clustering datasets
- **Non-linear patterns**: Spirals, moons, circles, swiss roll
- **Time series**: Synthetic time series with trends and seasonality
- **Corrupted data**: Datasets with missing values and outliers

## Classification Data Generation

### Basic Classification Dataset

```rust
use scirs2_datasets::make_classification;

// Generate a basic classification dataset
let dataset = make_classification(
    1000,      // n_samples: number of samples
    20,        // n_features: total number of features
    5,         // n_classes: number of target classes
    3,         // n_clusters_per_class: clusters per class
    10,        // n_informative: number of informative features
    Some(42),  // random_state: for reproducibility
)?;

println!("Classification dataset:");
println!("  Samples: {}", dataset.n_samples());
println!("  Features: {}", dataset.n_features());
println!("  Classes: 5");
```

### Advanced Classification with Custom Parameters

```rust
use scirs2_datasets::generators::ClassificationConfig;

// Advanced configuration
let config = ClassificationConfig {
    n_samples: 2000,
    n_features: 50,
    n_classes: 4,
    n_clusters_per_class: 2,
    n_informative: 20,
    n_redundant: 5,
    n_repeated: 3,
    class_sep: 1.5,        // Class separation (higher = easier)
    flip_y: 0.01,          // Label noise (1% of labels flipped)
    weights: Some(vec![0.4, 0.3, 0.2, 0.1]), // Class imbalance
    random_state: Some(42),
};

let dataset = config.generate()?;

// Check class distribution
if let Some(target) = &dataset.target {
    let mut class_counts = std::collections::HashMap::new();
    for &class in target.iter() {
        *class_counts.entry(class as i32).or_insert(0) += 1;
    }
    println!("Class distribution: {:?}", class_counts);
}
```

## Regression Data Generation

### Single-Output Regression

```rust
use scirs2_datasets::make_regression;

let dataset = make_regression(
    500,       // n_samples
    10,        // n_features
    5,         // n_informative: number of features that affect target
    0.1,       // noise: standard deviation of gaussian noise
    Some(42),  // random_state
)?;

println!("Regression dataset:");
println!("  Samples: {}", dataset.n_samples());
println!("  Features: {}", dataset.n_features());

// Target statistics
if let Some(target) = &dataset.target {
    let mean = target.mean().unwrap();
    let std = target.std(0.0);
    println!("  Target mean: {:.3}, std: {:.3}", mean, std);
}
```

### Multi-Output Regression

```rust
use scirs2_datasets::generators::RegressionConfig;

let config = RegressionConfig {
    n_samples: 1000,
    n_features: 15,
    n_targets: 3,          // Multiple output targets
    n_informative: 10,
    noise: 0.05,
    bias: 100.0,           // Bias term added to targets
    tail_strength: 0.5,    // Tail strength for heavy-tailed noise
    random_state: Some(42),
};

let dataset = config.generate()?;
println!("Multi-output regression: {} targets", 3);
```

## Clustering Data Generation

### Blob Clusters

```rust
use scirs2_datasets::make_blobs;

// Generate blob-like clusters
let dataset = make_blobs(
    800,       // n_samples
    2,         // n_features (2D for visualization)
    4,         // n_centers: number of clusters
    1.0,       // cluster_std: standard deviation of clusters
    Some(42),  // random_state
)?;

println!("Blob clusters:");
println!("  Samples: {}", dataset.n_samples());
println!("  Clusters: 4");
```

### Custom Cluster Centers

```rust
use scirs2_datasets::generators::BlobConfig;
use ndarray::Array2;

// Define custom cluster centers
let centers = Array2::from_shape_vec(
    (3, 2),
    vec![0.0, 0.0,    // Center 1
         5.0, 5.0,    // Center 2
         0.0, 5.0],   // Center 3
)?;

let config = BlobConfig {
    n_samples: 600,
    centers: Some(centers),
    cluster_std: vec![0.5, 1.0, 1.5], // Different std for each cluster
    random_state: Some(42),
};

let dataset = config.generate()?;
```

## Non-Linear Pattern Generation

### Two Moons

```rust
use scirs2_datasets::make_moons;

// Generate two interleaving half circles
let dataset = make_moons(
    400,       // n_samples
    0.1,       // noise level
    Some(42),  // random_state
)?;

println!("Two moons pattern: {} samples", dataset.n_samples());
```

### Concentric Circles

```rust
use scirs2_datasets::make_circles;

let dataset = make_circles(
    500,       // n_samples
    0.05,      // noise level
    0.6,       // factor: scale factor between inner and outer circle
    Some(42),
)?;

println!("Concentric circles: {} samples", dataset.n_samples());
```

### Spiral Patterns

```rust
use scirs2_datasets::make_spirals;

let dataset = make_spirals(
    600,       // n_samples
    2,         // n_spirals: number of spiral arms
    0.1,       // noise level
    Some(42),
)?;

println!("Spiral pattern: {} samples", dataset.n_samples());
```

### Swiss Roll (3D Manifold)

```rust
use scirs2_datasets::make_swiss_roll;

let dataset = make_swiss_roll(
    1000,      // n_samples
    0.1,       // noise level
    Some(42),
)?;

println!("Swiss roll: {} samples, {} features", 
         dataset.n_samples(), dataset.n_features());
assert_eq!(dataset.n_features(), 3); // 3D embedding
```

## Time Series Generation

### Basic Time Series

```rust
use scirs2_datasets::make_time_series;

let dataset = make_time_series(
    100,       // n_timesteps
    3,         // n_features/variables
    true,      // with_trend: include linear trend
    true,      // with_seasonality: include seasonal component
    0.1,       // noise_level
    Some(42),
)?;

println!("Time series:");
println!("  Timesteps: {}", dataset.n_samples());
println!("  Variables: {}", dataset.n_features());
```

### Advanced Time Series with Custom Patterns

```rust
use scirs2_datasets::generators::{TimeSeriesConfig, SeasonalPattern, TrendType};

let config = TimeSeriesConfig {
    n_timesteps: 365,  // One year of daily data
    n_features: 2,
    trend_type: TrendType::Polynomial(2), // Quadratic trend
    seasonal_patterns: vec![
        SeasonalPattern {
            period: 7,     // Weekly seasonality
            amplitude: 2.0,
        },
        SeasonalPattern {
            period: 30,    // Monthly seasonality  
            amplitude: 1.0,
        },
    ],
    noise_type: scirs2_datasets::generators::NoiseType::ARMA(1, 1),
    noise_level: 0.2,
    random_state: Some(42),
};

let dataset = config.generate()?;
```

## Data Corruption and Noise

### Adding Missing Values

```rust
use scirs2_datasets::{make_classification, utils::add_missing_values};

let mut dataset = make_classification(500, 10, 3, 2, 8, Some(42))?;

// Add 10% missing values randomly
add_missing_values(&mut dataset.data, 0.1, Some(42))?;

println!("Added missing values to dataset");
```

### Adding Outliers

```rust
use scirs2_datasets::{make_regression, utils::add_outliers};

let mut dataset = make_regression(300, 5, 4, 0.05, Some(42))?;

// Add 5% outliers
add_outliers(&mut dataset.data, 0.05, 3.0, Some(42))?; // 3.0 = outlier strength

println!("Added outliers to dataset");
```

## Combining Datasets

### Concatenating Datasets

```rust
use scirs2_datasets::{make_classification, utils::concatenate_datasets};

let dataset1 = make_classification(200, 10, 2, 1, 8, Some(42))?;
let dataset2 = make_classification(300, 10, 2, 1, 8, Some(43))?;

let combined = concatenate_datasets(&[dataset1, dataset2])?;
println!("Combined dataset: {} samples", combined.n_samples()); // 500 samples
```

### Feature Augmentation

```rust
use scirs2_datasets::{make_classification, utils::add_polynomial_features};

let mut dataset = make_classification(100, 3, 2, 1, 3, Some(42))?;

// Add polynomial features (degree 2)
add_polynomial_features(&mut dataset.data, 2)?;

println!("Augmented features: {}", dataset.n_features());
```

## Best Practices for Data Generation

### Reproducible Experiments

```rust
// Always use random_state for reproducible results
let seed = 42;
let dataset1 = make_classification(100, 10, 2, 1, 8, Some(seed))?;
let dataset2 = make_classification(100, 10, 2, 1, 8, Some(seed))?;

// These datasets will be identical
assert_eq!(dataset1.data, dataset2.data);
```

### Realistic Data Characteristics

```rust
use scirs2_datasets::generators::ClassificationConfig;

// Create realistic, challenging classification data
let config = ClassificationConfig {
    n_samples: 1000,
    n_features: 50,
    n_classes: 5,
    n_informative: 20,     // Not all features are informative
    n_redundant: 10,       // Some features are linear combinations
    n_repeated: 5,         // Some features are duplicated
    class_sep: 0.8,        // Moderate class separation (not too easy)
    flip_y: 0.05,          // 5% label noise (realistic)
    weights: Some(vec![0.4, 0.25, 0.2, 0.1, 0.05]), // Imbalanced classes
    random_state: Some(42),
};

let realistic_dataset = config.generate()?;
```

### Memory-Efficient Generation

```rust
// For large datasets, generate in chunks
fn generate_large_dataset(total_samples: usize, batch_size: usize) -> Result<Dataset, Box<dyn std::error::Error>> {
    let mut batches = Vec::new();
    
    for i in (0..total_samples).step_by(batch_size) {
        let current_batch_size = std::cmp::min(batch_size, total_samples - i);
        let batch = make_classification(
            current_batch_size, 10, 3, 2, 8, 
            Some(42 + i as u64)  // Different seed for each batch
        )?;
        batches.push(batch);
    }
    
    concatenate_datasets(&batches)
}

let large_dataset = generate_large_dataset(100_000, 1000)?;
```

## Performance Considerations

### Benchmarking Data Generation

```rust
use std::time::Instant;

let start = Instant::now();
let dataset = make_classification(10_000, 100, 5, 2, 50, Some(42))?;
let duration = start.elapsed();

println!("Generated {} samples in {:.2}ms", 
         dataset.n_samples(), duration.as_millis());
println!("Throughput: {:.1} samples/s", 
         dataset.n_samples() as f64 / duration.as_secs_f64());
```

### Parallel Generation

```rust
use rayon::prelude::*;

// Generate multiple datasets in parallel
let seeds: Vec<u64> = (0..10).collect();
let datasets: Vec<_> = seeds.par_iter()
    .map(|&seed| make_classification(1000, 20, 3, 2, 15, Some(seed)))
    .collect::<Result<Vec<_>, _>>()?;

println!("Generated {} datasets in parallel", datasets.len());
```

This tutorial covered the comprehensive data generation capabilities of SciRS2. These tools enable you to create diverse, realistic synthetic datasets for algorithm development, testing, and benchmarking in machine learning applications.