u-insight 0.10.0

# u-insight

[![Crates.io](https://img.shields.io/crates/v/u-insight.svg)](https://crates.io/crates/u-insight)
[![NuGet](https://img.shields.io/nuget/v/UInsight.svg)](https://www.nuget.org/packages/UInsight)
[![docs.rs](https://docs.rs/u-insight/badge.svg)](https://docs.rs/u-insight)
[![CI](https://github.com/iyulab/u-insight/actions/workflows/ci.yml/badge.svg)](https://github.com/iyulab/u-insight/actions/workflows/ci.yml)

A statistical analysis and data profiling engine in Rust with C FFI bindings.

## What's New in 0.9.1

- **BREAKING — Rust**: `InsightError::NonNumericColumn` variant removed. The 0.9.0 audit redirected all internal call sites to `DegenerateData`, leaving the variant unused. Removed per `Delete over deprecate` policy. External `match` arms over `InsightError` must drop the corresponding branch.

## What's New in 0.9.0

- **Kendall tau-b correlation** added to `CorrelationMethod` (Pearson / Spearman / Kendall)
- **Outlier fences exposed** on `OutlierResult` — `lower_fence`, `upper_fence`, `center`, `spread`
- **`detect_outliers_slice(&[f64], method)`** helper for raw-slice input
- **`vif_analysis()` and `condition_number()`** standalone multicollinearity diagnostics
- **WASM**: `correlation_matrix` accepts optional `_method` field (`"pearson"` | `"spearman"` | `"kendall"`); new `detect_univariate_outliers`, `vif_diagnostic`, `condition_number_diagnostic`
- **C#**: `CorrelationMethodKind` enum + `Correlate(data, method)` parameter
- **BREAKING — Rust**: Non-finite numeric inputs now return `InsightError::DegenerateData` (was `NonNumericColumn`); audit covered 11 call sites
- **BREAKING — Rust**: `OutlierResult` gained 4 new fields (exhaustive pattern matches must be updated)
- **BREAKING — FFI**: `insight_correlation` signature gained a `method: u32` parameter (use `INSIGHT_CORR_PEARSON` = 0 to keep prior behaviour)
- **BREAKING — C#**: `Correlate(...)` now takes an optional `CorrelationMethodKind` parameter (default `Pearson` keeps existing call-sites compiling)

## Overview

u-insight transforms raw tabular data into actionable statistical insights. It operates in two distinct layers with **opposite assumptions about input data quality**:

```
CSV (raw)
  │
  ├─→ Profiling ─→ "What is the state of this data?"
  │     Tolerates dirty data (missing values, type mismatches expected)
  │
  │   (external preprocessing)
  │
  └─→ Analysis  ─→ "What can we learn from this data?"
        Requires clean numeric data (no NaN, no missing)
```

Built on `u-analytics` (statistical algorithms), `u-numflow` (math primitives).

## Modules

### Data Layer

| Module | Description |
|--------|-------------|
| `dataframe` | Column-major tabular data model (DataFrame, Column, DataType) |
| `csv_parser` | CSV parsing with automatic type inference |
| `error` | Error types (InsightError) |

### Profiling Layer (dirty data tolerated)

| Module | Description |
|--------|-------------|
| `profiling` | Column-level and dataset-level data profiling — descriptive stats, missing analysis, outlier flagging (IQR/Z-score/Modified Z-score), diagnostic flags |

### Analysis Layer (clean data required)

| Module | Description |
|--------|-------------|
| `analysis` | Correlation (Pearson/Spearman), regression (simple/multiple OLS), Cramer's V contingency analysis |
| `clustering` | K-Means++ (auto-K, Gap Statistic), Mini-Batch K-Means, DBSCAN, Hierarchical Agglomerative (Single/Complete/Average/Ward), HDBSCAN |
| `distribution` | ECDF, histogram bins (Sturges/Scott/FD), QQ-plot, normality tests (KS, Jarque-Bera, Shapiro-Wilk, Anderson-Darling), Grubbs test, distribution fitting |
| `pca` | Principal Component Analysis with auto-scaling option |
| `isolation_forest` | Isolation Forest anomaly detection (Liu et al. 2008) |
| `lof` | Local Outlier Factor (LOF) density-based anomaly detection |
| `mahalanobis` | Mahalanobis distance multivariate outlier detection |
| `feature_importance` | Variance threshold, correlation filter, VIF, condition number, composite importance, ANOVA F-test selection, Mutual Information, Permutation Importance |

### FFI Layer

| Module | Description |
|--------|-------------|
| `ffi` | C FFI bindings — 32 functions, 20 `#[repr(C)]` structs, auto-generated C header via cbindgen |

## Quick Start

```rust
use u_insight::csv_parser::CsvParser;
use u_insight::profiling::profile_dataframe;

// 1. Parse CSV
let csv = "name,value,active\nAlice,1.5,true\nBob,2.3,false\nCharlie,3.1,true\n";
let df = CsvParser::new().parse_str(csv).unwrap();

// 2. Profile
let profiles = profile_dataframe(&df);
```

### Clustering

```rust
use u_insight::clustering::{kmeans, dbscan, KMeansConfig, DbscanConfig};

let data = vec![
    vec![0.0, 0.0], vec![0.5, 0.5],
    vec![10.0, 10.0], vec![10.5, 10.5],
];

// K-Means
let km = kmeans(&data, &KMeansConfig::new(2)).unwrap();
assert_eq!(km.k, 2);

// DBSCAN
let db = dbscan(&data, &DbscanConfig::new(1.5, 2)).unwrap();
assert_eq!(db.n_clusters, 2);
```

### Distribution Analysis

```rust
use u_insight::distribution::{distribution_analysis, DistributionConfig};

let data: Vec<f64> = (0..50).map(|i| (i as f64 - 25.0) * 0.2).collect();
let result = distribution_analysis(&data, &DistributionConfig::default()).unwrap();
println!("Normal: {}", result.normality.is_normal);
```

## C FFI

u-insight builds as `cdylib` + `staticlib` for cross-language interop. A C header (`u_insight.h`) is auto-generated by cbindgen at build time.

### Profiling

| Function | Description |
|----------|-------------|
| `insight_profile_csv` | Profile a CSV string → opaque context |
| `insight_profile_free` | Free profile context |
| `insight_profile_row_count` | Row count from profile |
| `insight_profile_col_count` | Column count from profile |
| `insight_profile_column` | Get column summary |

### Clustering

| Function | Description |
|----------|-------------|
| `insight_kmeans` | K-Means++ clustering |
| `insight_mini_batch_kmeans` | Mini-Batch K-Means clustering |
| `insight_dbscan` | DBSCAN density-based clustering |
| `insight_hierarchical` | Hierarchical Agglomerative clustering (4 linkages) |
| `insight_hdbscan` | HDBSCAN clustering with membership probabilities |
| `insight_gap_statistic` | Gap statistic for optimal K selection |

### Dimensionality Reduction

| Function | Description |
|----------|-------------|
| `insight_pca` | Principal Component Analysis |

### Anomaly Detection

| Function | Description |
|----------|-------------|
| `insight_isolation_forest` | Isolation Forest anomaly detection |
| `insight_lof` | Local Outlier Factor detection |
| `insight_mahalanobis` | Mahalanobis distance outlier detection |

### Statistical Analysis

| Function | Description |
|----------|-------------|
| `insight_correlation` | Pearson correlation matrix |
| `insight_regression` | Simple linear regression |
| `insight_cramers_v` | Cramer's V contingency analysis |

### Distribution

| Function | Description |
|----------|-------------|
| `insight_distribution` | Normality testing (KS, JB, SW, AD) |

### Feature Importance

| Function | Description |
|----------|-------------|
| `insight_feature_importance` | Composite feature importance scores |
| `insight_anova_select` | ANOVA F-test feature selection |
| `insight_mutual_info` | Mutual information feature ranking |
| `insight_permutation_importance` | Permutation importance for regression |

### Memory Management

| Function | Description |
|----------|-------------|
| `insight_free_labels` | Free u32 label arrays |
| `insight_free_i32_array` | Free i32 arrays |
| `insight_free_f64_array` | Free f64 arrays |
| `insight_free_anova_features` | Free ANOVA feature arrays |
| `insight_free_mi_features` | Free MI feature arrays |
| `insight_free_perm_features` | Free permutation importance arrays |

### Error & Version

| Function | Description |
|----------|-------------|
| `insight_last_error` | Last error message (thread-local) |
| `insight_clear_error` | Clear error state |
| `insight_version` | Library version string |

All FFI functions use `catch_unwind` to prevent panics from crossing the FFI boundary.

## C# Binding (UInsight)

Install via NuGet — native libraries are bundled automatically:

```bash
dotnet add package UInsight
```

```csharp
using UInsight;

using var client = new InsightClient();
Console.WriteLine(client.GetVersion());

var data = new double[,] { {0,0}, {1,1}, {10,10}, {11,11} };
var result = client.KMeans(data, k: 2);
Console.WriteLine($"K={result.K}, WCSS={result.Wcss:F2}");
```

The binding is in `bindings/csharp/UInsight/` with:

- `Interop/NativeLibrary.cs` — `[LibraryImport]` declarations for all 32 FFI functions
- `Interop/NativeStructs.cs` — `[StructLayout]` mappings for all 20 C structs
- `InsightClient.cs` — High-level managed API (automatic memory management)
- `InsightException.cs` — Error code to exception conversion

## Test Status

```
357 lib tests + 49 doc-tests = 406 total
0 clippy warnings
Build: lib + cdylib + staticlib
C header: auto-generated via cbindgen (20 structs, 32 functions)
```

## Scope & Non-Goals

**In Scope:**
- Data profiling (dirty data → quality report + diagnostic flags)
- Statistical analysis (clean data → patterns + relationships)
- Correlation, regression, clustering, PCA, anomaly detection
- Feature importance and selection (ANOVA, MI, Permutation)
- Distribution analysis and normality testing
- C FFI for cross-language use
- C# binding (UInsight NuGet package)

**Out of Scope:**
- Visualization / charting
- Data cleaning / transformation / imputation
- ML model training / deployment
- Deep learning

## Requirements

- Rust 1.75+
- Dependencies: `u-analytics`, `u-numflow`

## WebAssembly / npm

Available as an npm package via [wasm-pack](https://rustwasm.github.io/wasm-pack/).

```bash
npm install @iyulab/u-insight
```

### Quick Start

```javascript
import init, { describe, kmeans } from '@iyulab/u-insight';

await init();
const stats = describe({ col1: [1, 2, 3], col2: [4, 5, 6] });
```

### Functions

#### `describe(data) -> [ColumnResult]`

Descriptive statistics per column. Input: column-major `{ "col1": [1,2,3] }`.

**Output:** Array of `{ name, data_type, numeric: { count, min, max, mean, median, std_dev, variance, skewness, kurtosis, q1, q3, iqr, p5, p95, ... } }`.

#### `correlation_matrix(data) -> CorrelationResult`

Pearson correlation matrix. Input: column-major `{ "col1": [1,2,3], "col2": [4,5,6] }`.

**Output:**
```json
{ "names": ["col1","col2"], "matrix": [1,0.99,0.99,1], "n": 2, "high_pairs": [{ "col_a": "col1", "col_b": "col2", "r": 0.99, "p_value": 0.01 }] }
```

#### `kmeans(data, k) -> KMeansResult`

K-Means++ clustering on row-major data `[[x,y,...], ...]`.

**Output:**
```json
{ "k": 3, "labels": [0,0,1,1,2,2], "centroids": [[...]], "wcss": 5.2, "iterations": 12, "cluster_sizes": [2,2,2] }
```

#### `silhouette(data, labels, k) -> SilhouetteResult`

Silhouette analysis for an existing clustering assignment. Works with any clustering output (`kmeans`, `dbscan`, `hierarchical`, etc.). `data` is row-major `[[x,y,...], ...]`, `labels` is one cluster id per row (each `< k`), `k` is the number of distinct clusters. O(n²) — use sparingly on very large inputs.

**Output:**
```json
{ "avg": 0.74, "per_sample": [0.81, 0.79, 0.62, ...] }
```

`avg` ranges from -1 (wrong cluster) to +1 (well-separated); singleton-cluster points report 0.0 in `per_sample`.

#### `pca(data, n_components) -> PcaResult`

Principal Component Analysis on row-major data.

**Output:**
```json
{ "n_components": 2, "n_features": 4, "eigenvalues": [3.1,0.9], "explained_variance_ratio": [0.77,0.23], "cumulative_variance_ratio": [0.77,1.0], "loadings": [[...]], "scores": [[...]], "means": [...], "stds": [...] }
```

#### `dbscan(data, config) -> DbscanResult`

DBSCAN density-based clustering. `config`: `{ "epsilon": 1.5, "min_samples": 3 }`.

**Output:**
```json
{ "labels": [0,0,null,1,1], "n_clusters": 2, "noise_count": 1, "cluster_sizes": [2,2], "core_points": [true,true,false,true,true] }
```

#### `hierarchical(data, config) -> HierarchicalResult`

Hierarchical agglomerative clustering. `config`: `{ "linkage": "ward", "n_clusters": 3 }` or `{ "linkage": "single", "distance_threshold": 5.0 }`.

**Output:**
```json
{ "merges": [{ "cluster_a": 0, "cluster_b": 1, "distance": 1.2, "size": 2 }], "labels": [0,0,1,1,2], "n_clusters": 3 }
```

#### `isolation_forest(data, config) -> IsolationForestResult`

Isolation Forest anomaly detection. `config`: `{ "n_estimators": 100, "contamination": 0.1, "seed": 42 }`.

**Output:**
```json
{ "scores": [0.45, 0.82], "anomalies": [false, true], "threshold": 0.65, "anomaly_count": 1, "anomaly_fraction": 0.5 }
```

#### `lof(data, config) -> LofResult`

Local Outlier Factor anomaly detection. `config`: `{ "k": 20, "threshold": 1.5 }`.

**Output:**
```json
{ "scores": [1.0, 2.3], "anomalies": [false, true], "threshold": 1.5, "anomaly_count": 1, "anomaly_fraction": 0.5 }
```

#### `distribution_analysis(data, config) -> DistributionResult`

Distribution analysis on a 1-D array. `config`: `{ "bin_method": "freedman_diaconis", "significance_level": 0.05, "compute_ecdf": true, "compute_histogram": true, "compute_qq_plot": true, "fit_distributions": false }`.

**Output:**
```json
{ "n": 100, "ecdf": { "values": [...], "probabilities": [...] }, "histogram": { "n_bins": 10, "bin_width": 0.5, "edges": [...], "counts": [...] }, "qq_plot": { "theoretical": [...], "sample": [...] }, "normality": { "shapiro_wilk": { "statistic": 0.98, "p_value": 0.45, "rejected": false }, "is_normal": true }, "fits": [] }
```

#### `regression(data) -> RegressionResult`

OLS regression analysis.

**Input:**
```json
{ "predictors": { "x1": [1,2,3,4,5] }, "target": [2.1, 3.9, 6.1, 7.9, 10.1], "target_name": "y" }
```

**Output:**
```json
{ "target_name": "y", "predictor_names": ["x1"], "r_squared": 0.99, "adj_r_squared": 0.99, "coefficients": [0.1, 2.0], "p_values": [0.9, 0.0001], "vif": [1.0], "f_p_value": 0.0001 }
```

#### `feature_importance(data) -> FeatureImportanceResult`

Feature importance via permutation, ANOVA, or mutual information.

**Input:**
```json
{ "features": { "f1": [1,2,3], "f2": [5,4,3] }, "target": [0,0,1], "method": "permutation", "n_repeats": 5, "seed": 42 }
```

**Output:**
```json
{ "method": "permutation", "features": [{ "name": "f1", "index": 0, "score": 0.8, "std_dev": 0.1 }], "baseline_score": 0.5 }
```

## Related

- [u-analytics](https://github.com/iyulab/u-analytics) -- Statistical analytics
- [u-numflow](https://github.com/iyulab/u-numflow) -- Mathematical primitives

## License

MIT License