u-insight 0.10.0

Statistical analysis and data profiling engine with C FFI bindings.
Documentation

u-insight

Crates.io NuGet docs.rs CI

A statistical analysis and data profiling engine in Rust with C FFI bindings.

What's New in 0.9.1

  • BREAKING — Rust: InsightError::NonNumericColumn variant removed. The 0.9.0 audit redirected all internal call sites to DegenerateData, leaving the variant unused. Removed per Delete over deprecate policy. External match arms over InsightError must drop the corresponding branch.

What's New in 0.9.0

  • Kendall tau-b correlation added to CorrelationMethod (Pearson / Spearman / Kendall)
  • Outlier fences exposed on OutlierResultlower_fence, upper_fence, center, spread
  • detect_outliers_slice(&[f64], method) helper for raw-slice input
  • vif_analysis() and condition_number() standalone multicollinearity diagnostics
  • WASM: correlation_matrix accepts optional _method field ("pearson" | "spearman" | "kendall"); new detect_univariate_outliers, vif_diagnostic, condition_number_diagnostic
  • C#: CorrelationMethodKind enum + Correlate(data, method) parameter
  • BREAKING — Rust: Non-finite numeric inputs now return InsightError::DegenerateData (was NonNumericColumn); audit covered 11 call sites
  • BREAKING — Rust: OutlierResult gained 4 new fields (exhaustive pattern matches must be updated)
  • BREAKING — FFI: insight_correlation signature gained a method: u32 parameter (use INSIGHT_CORR_PEARSON = 0 to keep prior behaviour)
  • BREAKING — C#: Correlate(...) now takes an optional CorrelationMethodKind parameter (default Pearson keeps existing call-sites compiling)

Overview

u-insight transforms raw tabular data into actionable statistical insights. It operates in two distinct layers with opposite assumptions about input data quality:

CSV (raw)
  │
  ├─→ Profiling ─→ "What is the state of this data?"
  │     Tolerates dirty data (missing values, type mismatches expected)
  │
  │   (external preprocessing)
  │
  └─→ Analysis  ─→ "What can we learn from this data?"
        Requires clean numeric data (no NaN, no missing)

Built on u-analytics (statistical algorithms), u-numflow (math primitives).

Modules

Data Layer

Module Description
dataframe Column-major tabular data model (DataFrame, Column, DataType)
csv_parser CSV parsing with automatic type inference
error Error types (InsightError)

Profiling Layer (dirty data tolerated)

Module Description
profiling Column-level and dataset-level data profiling — descriptive stats, missing analysis, outlier flagging (IQR/Z-score/Modified Z-score), diagnostic flags

Analysis Layer (clean data required)

Module Description
analysis Correlation (Pearson/Spearman), regression (simple/multiple OLS), Cramer's V contingency analysis
clustering K-Means++ (auto-K, Gap Statistic), Mini-Batch K-Means, DBSCAN, Hierarchical Agglomerative (Single/Complete/Average/Ward), HDBSCAN
distribution ECDF, histogram bins (Sturges/Scott/FD), QQ-plot, normality tests (KS, Jarque-Bera, Shapiro-Wilk, Anderson-Darling), Grubbs test, distribution fitting
pca Principal Component Analysis with auto-scaling option
isolation_forest Isolation Forest anomaly detection (Liu et al. 2008)
lof Local Outlier Factor (LOF) density-based anomaly detection
mahalanobis Mahalanobis distance multivariate outlier detection
feature_importance Variance threshold, correlation filter, VIF, condition number, composite importance, ANOVA F-test selection, Mutual Information, Permutation Importance

FFI Layer

Module Description
ffi C FFI bindings — 32 functions, 20 #[repr(C)] structs, auto-generated C header via cbindgen

Quick Start

use u_insight::csv_parser::CsvParser;
use u_insight::profiling::profile_dataframe;

// 1. Parse CSV
let csv = "name,value,active\nAlice,1.5,true\nBob,2.3,false\nCharlie,3.1,true\n";
let df = CsvParser::new().parse_str(csv).unwrap();

// 2. Profile
let profiles = profile_dataframe(&df);

Clustering

use u_insight::clustering::{kmeans, dbscan, KMeansConfig, DbscanConfig};

let data = vec![
    vec![0.0, 0.0], vec![0.5, 0.5],
    vec![10.0, 10.0], vec![10.5, 10.5],
];

// K-Means
let km = kmeans(&data, &KMeansConfig::new(2)).unwrap();
assert_eq!(km.k, 2);

// DBSCAN
let db = dbscan(&data, &DbscanConfig::new(1.5, 2)).unwrap();
assert_eq!(db.n_clusters, 2);

Distribution Analysis

use u_insight::distribution::{distribution_analysis, DistributionConfig};

let data: Vec<f64> = (0..50).map(|i| (i as f64 - 25.0) * 0.2).collect();
let result = distribution_analysis(&data, &DistributionConfig::default()).unwrap();
println!("Normal: {}", result.normality.is_normal);

C FFI

u-insight builds as cdylib + staticlib for cross-language interop. A C header (u_insight.h) is auto-generated by cbindgen at build time.

Profiling

Function Description
insight_profile_csv Profile a CSV string → opaque context
insight_profile_free Free profile context
insight_profile_row_count Row count from profile
insight_profile_col_count Column count from profile
insight_profile_column Get column summary

Clustering

Function Description
insight_kmeans K-Means++ clustering
insight_mini_batch_kmeans Mini-Batch K-Means clustering
insight_dbscan DBSCAN density-based clustering
insight_hierarchical Hierarchical Agglomerative clustering (4 linkages)
insight_hdbscan HDBSCAN clustering with membership probabilities
insight_gap_statistic Gap statistic for optimal K selection

Dimensionality Reduction

Function Description
insight_pca Principal Component Analysis

Anomaly Detection

Function Description
insight_isolation_forest Isolation Forest anomaly detection
insight_lof Local Outlier Factor detection
insight_mahalanobis Mahalanobis distance outlier detection

Statistical Analysis

Function Description
insight_correlation Pearson correlation matrix
insight_regression Simple linear regression
insight_cramers_v Cramer's V contingency analysis

Distribution

Function Description
insight_distribution Normality testing (KS, JB, SW, AD)

Feature Importance

Function Description
insight_feature_importance Composite feature importance scores
insight_anova_select ANOVA F-test feature selection
insight_mutual_info Mutual information feature ranking
insight_permutation_importance Permutation importance for regression

Memory Management

Function Description
insight_free_labels Free u32 label arrays
insight_free_i32_array Free i32 arrays
insight_free_f64_array Free f64 arrays
insight_free_anova_features Free ANOVA feature arrays
insight_free_mi_features Free MI feature arrays
insight_free_perm_features Free permutation importance arrays

Error & Version

Function Description
insight_last_error Last error message (thread-local)
insight_clear_error Clear error state
insight_version Library version string

All FFI functions use catch_unwind to prevent panics from crossing the FFI boundary.

C# Binding (UInsight)

Install via NuGet — native libraries are bundled automatically:

dotnet add package UInsight
using UInsight;

using var client = new InsightClient();
Console.WriteLine(client.GetVersion());

var data = new double[,] { {0,0}, {1,1}, {10,10}, {11,11} };
var result = client.KMeans(data, k: 2);
Console.WriteLine($"K={result.K}, WCSS={result.Wcss:F2}");

The binding is in bindings/csharp/UInsight/ with:

  • Interop/NativeLibrary.cs[LibraryImport] declarations for all 32 FFI functions
  • Interop/NativeStructs.cs[StructLayout] mappings for all 20 C structs
  • InsightClient.cs — High-level managed API (automatic memory management)
  • InsightException.cs — Error code to exception conversion

Test Status

357 lib tests + 49 doc-tests = 406 total
0 clippy warnings
Build: lib + cdylib + staticlib
C header: auto-generated via cbindgen (20 structs, 32 functions)

Scope & Non-Goals

In Scope:

  • Data profiling (dirty data → quality report + diagnostic flags)
  • Statistical analysis (clean data → patterns + relationships)
  • Correlation, regression, clustering, PCA, anomaly detection
  • Feature importance and selection (ANOVA, MI, Permutation)
  • Distribution analysis and normality testing
  • C FFI for cross-language use
  • C# binding (UInsight NuGet package)

Out of Scope:

  • Visualization / charting
  • Data cleaning / transformation / imputation
  • ML model training / deployment
  • Deep learning

Requirements

  • Rust 1.75+
  • Dependencies: u-analytics, u-numflow

WebAssembly / npm

Available as an npm package via wasm-pack.

npm install @iyulab/u-insight

Quick Start

import init, { describe, kmeans } from '@iyulab/u-insight';

await init();
const stats = describe({ col1: [1, 2, 3], col2: [4, 5, 6] });

Functions

describe(data) -> [ColumnResult]

Descriptive statistics per column. Input: column-major { "col1": [1,2,3] }.

Output: Array of { name, data_type, numeric: { count, min, max, mean, median, std_dev, variance, skewness, kurtosis, q1, q3, iqr, p5, p95, ... } }.

correlation_matrix(data) -> CorrelationResult

Pearson correlation matrix. Input: column-major { "col1": [1,2,3], "col2": [4,5,6] }.

Output:

{ "names": ["col1","col2"], "matrix": [1,0.99,0.99,1], "n": 2, "high_pairs": [{ "col_a": "col1", "col_b": "col2", "r": 0.99, "p_value": 0.01 }] }

kmeans(data, k) -> KMeansResult

K-Means++ clustering on row-major data [[x,y,...], ...].

Output:

{ "k": 3, "labels": [0,0,1,1,2,2], "centroids": [[...]], "wcss": 5.2, "iterations": 12, "cluster_sizes": [2,2,2] }

silhouette(data, labels, k) -> SilhouetteResult

Silhouette analysis for an existing clustering assignment. Works with any clustering output (kmeans, dbscan, hierarchical, etc.). data is row-major [[x,y,...], ...], labels is one cluster id per row (each < k), k is the number of distinct clusters. O(n²) — use sparingly on very large inputs.

Output:

{ "avg": 0.74, "per_sample": [0.81, 0.79, 0.62, ...] }

avg ranges from -1 (wrong cluster) to +1 (well-separated); singleton-cluster points report 0.0 in per_sample.

pca(data, n_components) -> PcaResult

Principal Component Analysis on row-major data.

Output:

{ "n_components": 2, "n_features": 4, "eigenvalues": [3.1,0.9], "explained_variance_ratio": [0.77,0.23], "cumulative_variance_ratio": [0.77,1.0], "loadings": [[...]], "scores": [[...]], "means": [...], "stds": [...] }

dbscan(data, config) -> DbscanResult

DBSCAN density-based clustering. config: { "epsilon": 1.5, "min_samples": 3 }.

Output:

{ "labels": [0,0,null,1,1], "n_clusters": 2, "noise_count": 1, "cluster_sizes": [2,2], "core_points": [true,true,false,true,true] }

hierarchical(data, config) -> HierarchicalResult

Hierarchical agglomerative clustering. config: { "linkage": "ward", "n_clusters": 3 } or { "linkage": "single", "distance_threshold": 5.0 }.

Output:

{ "merges": [{ "cluster_a": 0, "cluster_b": 1, "distance": 1.2, "size": 2 }], "labels": [0,0,1,1,2], "n_clusters": 3 }

isolation_forest(data, config) -> IsolationForestResult

Isolation Forest anomaly detection. config: { "n_estimators": 100, "contamination": 0.1, "seed": 42 }.

Output:

{ "scores": [0.45, 0.82], "anomalies": [false, true], "threshold": 0.65, "anomaly_count": 1, "anomaly_fraction": 0.5 }

lof(data, config) -> LofResult

Local Outlier Factor anomaly detection. config: { "k": 20, "threshold": 1.5 }.

Output:

{ "scores": [1.0, 2.3], "anomalies": [false, true], "threshold": 1.5, "anomaly_count": 1, "anomaly_fraction": 0.5 }

distribution_analysis(data, config) -> DistributionResult

Distribution analysis on a 1-D array. config: { "bin_method": "freedman_diaconis", "significance_level": 0.05, "compute_ecdf": true, "compute_histogram": true, "compute_qq_plot": true, "fit_distributions": false }.

Output:

{ "n": 100, "ecdf": { "values": [...], "probabilities": [...] }, "histogram": { "n_bins": 10, "bin_width": 0.5, "edges": [...], "counts": [...] }, "qq_plot": { "theoretical": [...], "sample": [...] }, "normality": { "shapiro_wilk": { "statistic": 0.98, "p_value": 0.45, "rejected": false }, "is_normal": true }, "fits": [] }

regression(data) -> RegressionResult

OLS regression analysis.

Input:

{ "predictors": { "x1": [1,2,3,4,5] }, "target": [2.1, 3.9, 6.1, 7.9, 10.1], "target_name": "y" }

Output:

{ "target_name": "y", "predictor_names": ["x1"], "r_squared": 0.99, "adj_r_squared": 0.99, "coefficients": [0.1, 2.0], "p_values": [0.9, 0.0001], "vif": [1.0], "f_p_value": 0.0001 }

feature_importance(data) -> FeatureImportanceResult

Feature importance via permutation, ANOVA, or mutual information.

Input:

{ "features": { "f1": [1,2,3], "f2": [5,4,3] }, "target": [0,0,1], "method": "permutation", "n_repeats": 5, "seed": 42 }

Output:

{ "method": "permutation", "features": [{ "name": "f1", "index": 0, "score": 0.8, "std_dev": 0.1 }], "baseline_score": 0.5 }

Related

License

MIT License