u-insight
A statistical analysis and data profiling engine in Rust with C FFI bindings.
What's New in 0.9.1
- BREAKING — Rust:
InsightError::NonNumericColumnvariant removed. The 0.9.0 audit redirected all internal call sites toDegenerateData, leaving the variant unused. Removed perDelete over deprecatepolicy. Externalmatcharms overInsightErrormust drop the corresponding branch.
What's New in 0.9.0
- Kendall tau-b correlation added to
CorrelationMethod(Pearson / Spearman / Kendall) - Outlier fences exposed on
OutlierResult—lower_fence,upper_fence,center,spread detect_outliers_slice(&[f64], method)helper for raw-slice inputvif_analysis()andcondition_number()standalone multicollinearity diagnostics- WASM:
correlation_matrixaccepts optional_methodfield ("pearson"|"spearman"|"kendall"); newdetect_univariate_outliers,vif_diagnostic,condition_number_diagnostic - C#:
CorrelationMethodKindenum +Correlate(data, method)parameter - BREAKING — Rust: Non-finite numeric inputs now return
InsightError::DegenerateData(wasNonNumericColumn); audit covered 11 call sites - BREAKING — Rust:
OutlierResultgained 4 new fields (exhaustive pattern matches must be updated) - BREAKING — FFI:
insight_correlationsignature gained amethod: u32parameter (useINSIGHT_CORR_PEARSON= 0 to keep prior behaviour) - BREAKING — C#:
Correlate(...)now takes an optionalCorrelationMethodKindparameter (defaultPearsonkeeps existing call-sites compiling)
Overview
u-insight transforms raw tabular data into actionable statistical insights. It operates in two distinct layers with opposite assumptions about input data quality:
CSV (raw)
│
├─→ Profiling ─→ "What is the state of this data?"
│ Tolerates dirty data (missing values, type mismatches expected)
│
│ (external preprocessing)
│
└─→ Analysis ─→ "What can we learn from this data?"
Requires clean numeric data (no NaN, no missing)
Built on u-analytics (statistical algorithms), u-numflow (math primitives).
Modules
Data Layer
| Module | Description |
|---|---|
dataframe |
Column-major tabular data model (DataFrame, Column, DataType) |
csv_parser |
CSV parsing with automatic type inference |
error |
Error types (InsightError) |
Profiling Layer (dirty data tolerated)
| Module | Description |
|---|---|
profiling |
Column-level and dataset-level data profiling — descriptive stats, missing analysis, outlier flagging (IQR/Z-score/Modified Z-score), diagnostic flags |
Analysis Layer (clean data required)
| Module | Description |
|---|---|
analysis |
Correlation (Pearson/Spearman), regression (simple/multiple OLS), Cramer's V contingency analysis |
clustering |
K-Means++ (auto-K, Gap Statistic), Mini-Batch K-Means, DBSCAN, Hierarchical Agglomerative (Single/Complete/Average/Ward), HDBSCAN |
distribution |
ECDF, histogram bins (Sturges/Scott/FD), QQ-plot, normality tests (KS, Jarque-Bera, Shapiro-Wilk, Anderson-Darling), Grubbs test, distribution fitting |
pca |
Principal Component Analysis with auto-scaling option |
isolation_forest |
Isolation Forest anomaly detection (Liu et al. 2008) |
lof |
Local Outlier Factor (LOF) density-based anomaly detection |
mahalanobis |
Mahalanobis distance multivariate outlier detection |
feature_importance |
Variance threshold, correlation filter, VIF, condition number, composite importance, ANOVA F-test selection, Mutual Information, Permutation Importance |
FFI Layer
| Module | Description |
|---|---|
ffi |
C FFI bindings — 32 functions, 20 #[repr(C)] structs, auto-generated C header via cbindgen |
Quick Start
use CsvParser;
use profile_dataframe;
// 1. Parse CSV
let csv = "name,value,active\nAlice,1.5,true\nBob,2.3,false\nCharlie,3.1,true\n";
let df = new.parse_str.unwrap;
// 2. Profile
let profiles = profile_dataframe;
Clustering
use ;
let data = vec!;
// K-Means
let km = kmeans.unwrap;
assert_eq!;
// DBSCAN
let db = dbscan.unwrap;
assert_eq!;
Distribution Analysis
use ;
let data: = .map.collect;
let result = distribution_analysis.unwrap;
println!;
C FFI
u-insight builds as cdylib + staticlib for cross-language interop. A C header (u_insight.h) is auto-generated by cbindgen at build time.
Profiling
| Function | Description |
|---|---|
insight_profile_csv |
Profile a CSV string → opaque context |
insight_profile_free |
Free profile context |
insight_profile_row_count |
Row count from profile |
insight_profile_col_count |
Column count from profile |
insight_profile_column |
Get column summary |
Clustering
| Function | Description |
|---|---|
insight_kmeans |
K-Means++ clustering |
insight_mini_batch_kmeans |
Mini-Batch K-Means clustering |
insight_dbscan |
DBSCAN density-based clustering |
insight_hierarchical |
Hierarchical Agglomerative clustering (4 linkages) |
insight_hdbscan |
HDBSCAN clustering with membership probabilities |
insight_gap_statistic |
Gap statistic for optimal K selection |
Dimensionality Reduction
| Function | Description |
|---|---|
insight_pca |
Principal Component Analysis |
Anomaly Detection
| Function | Description |
|---|---|
insight_isolation_forest |
Isolation Forest anomaly detection |
insight_lof |
Local Outlier Factor detection |
insight_mahalanobis |
Mahalanobis distance outlier detection |
Statistical Analysis
| Function | Description |
|---|---|
insight_correlation |
Pearson correlation matrix |
insight_regression |
Simple linear regression |
insight_cramers_v |
Cramer's V contingency analysis |
Distribution
| Function | Description |
|---|---|
insight_distribution |
Normality testing (KS, JB, SW, AD) |
Feature Importance
| Function | Description |
|---|---|
insight_feature_importance |
Composite feature importance scores |
insight_anova_select |
ANOVA F-test feature selection |
insight_mutual_info |
Mutual information feature ranking |
insight_permutation_importance |
Permutation importance for regression |
Memory Management
| Function | Description |
|---|---|
insight_free_labels |
Free u32 label arrays |
insight_free_i32_array |
Free i32 arrays |
insight_free_f64_array |
Free f64 arrays |
insight_free_anova_features |
Free ANOVA feature arrays |
insight_free_mi_features |
Free MI feature arrays |
insight_free_perm_features |
Free permutation importance arrays |
Error & Version
| Function | Description |
|---|---|
insight_last_error |
Last error message (thread-local) |
insight_clear_error |
Clear error state |
insight_version |
Library version string |
All FFI functions use catch_unwind to prevent panics from crossing the FFI boundary.
C# Binding (UInsight)
Install via NuGet — native libraries are bundled automatically:
using UInsight;
using var client = new InsightClient();
Console.WriteLine(client.GetVersion());
var data = new double[,] { {0,0}, {1,1}, {10,10}, {11,11} };
var result = client.KMeans(data, k: 2);
Console.WriteLine($"K={result.K}, WCSS={result.Wcss:F2}");
The binding is in bindings/csharp/UInsight/ with:
Interop/NativeLibrary.cs—[LibraryImport]declarations for all 32 FFI functionsInterop/NativeStructs.cs—[StructLayout]mappings for all 20 C structsInsightClient.cs— High-level managed API (automatic memory management)InsightException.cs— Error code to exception conversion
Test Status
357 lib tests + 49 doc-tests = 406 total
0 clippy warnings
Build: lib + cdylib + staticlib
C header: auto-generated via cbindgen (20 structs, 32 functions)
Scope & Non-Goals
In Scope:
- Data profiling (dirty data → quality report + diagnostic flags)
- Statistical analysis (clean data → patterns + relationships)
- Correlation, regression, clustering, PCA, anomaly detection
- Feature importance and selection (ANOVA, MI, Permutation)
- Distribution analysis and normality testing
- C FFI for cross-language use
- C# binding (UInsight NuGet package)
Out of Scope:
- Visualization / charting
- Data cleaning / transformation / imputation
- ML model training / deployment
- Deep learning
Requirements
- Rust 1.75+
- Dependencies:
u-analytics,u-numflow
WebAssembly / npm
Available as an npm package via wasm-pack.
Quick Start
import init from '@iyulab/u-insight';
await ;
const stats = ;
Functions
describe(data) -> [ColumnResult]
Descriptive statistics per column. Input: column-major { "col1": [1,2,3] }.
Output: Array of { name, data_type, numeric: { count, min, max, mean, median, std_dev, variance, skewness, kurtosis, q1, q3, iqr, p5, p95, ... } }.
correlation_matrix(data) -> CorrelationResult
Pearson correlation matrix. Input: column-major { "col1": [1,2,3], "col2": [4,5,6] }.
Output:
kmeans(data, k) -> KMeansResult
K-Means++ clustering on row-major data [[x,y,...], ...].
Output:
silhouette(data, labels, k) -> SilhouetteResult
Silhouette analysis for an existing clustering assignment. Works with any clustering output (kmeans, dbscan, hierarchical, etc.). data is row-major [[x,y,...], ...], labels is one cluster id per row (each < k), k is the number of distinct clusters. O(n²) — use sparingly on very large inputs.
Output:
avg ranges from -1 (wrong cluster) to +1 (well-separated); singleton-cluster points report 0.0 in per_sample.
pca(data, n_components) -> PcaResult
Principal Component Analysis on row-major data.
Output:
dbscan(data, config) -> DbscanResult
DBSCAN density-based clustering. config: { "epsilon": 1.5, "min_samples": 3 }.
Output:
hierarchical(data, config) -> HierarchicalResult
Hierarchical agglomerative clustering. config: { "linkage": "ward", "n_clusters": 3 } or { "linkage": "single", "distance_threshold": 5.0 }.
Output:
isolation_forest(data, config) -> IsolationForestResult
Isolation Forest anomaly detection. config: { "n_estimators": 100, "contamination": 0.1, "seed": 42 }.
Output:
lof(data, config) -> LofResult
Local Outlier Factor anomaly detection. config: { "k": 20, "threshold": 1.5 }.
Output:
distribution_analysis(data, config) -> DistributionResult
Distribution analysis on a 1-D array. config: { "bin_method": "freedman_diaconis", "significance_level": 0.05, "compute_ecdf": true, "compute_histogram": true, "compute_qq_plot": true, "fit_distributions": false }.
Output:
regression(data) -> RegressionResult
OLS regression analysis.
Input:
Output:
feature_importance(data) -> FeatureImportanceResult
Feature importance via permutation, ANOVA, or mutual information.
Input:
Output:
Related
- u-analytics -- Statistical analytics
- u-numflow -- Mathematical primitives
License
MIT License