Skip to main content

Module statistics

Module statistics 

Source

Structs§

AnalysisContext
AnalysisResults
CategoricalStatistics
ColumnStatistics
ColumnStats
ComputeOptions
CorrelationMatrix
CorrelationPair
DistributionAnalysis
DistributionCharacteristics
DistributionInfo
NumericStatistics
OutlierAnalysis
OutlierRow
PercentileBreakdown

Enums§

DistributionType
IqrPosition
OutlierMethod

Constants§

SAMPLING_THRESHOLD
Default sampling threshold: datasets >= this size are sampled. Used as fallback when sample_size is None. App uses config value.

Functions§

analysis_results_from_describe
Builds describe-only AnalysisResults from a list of column statistics.
calculate_fit_quality
Calculates fit quality (p-value) for a given distribution type.
calculate_theoretical_bin_probabilities
Calculates probabilities for each bin defined by bin_boundaries.
calculate_theoretical_probability_in_interval
Calculates the probability that a value falls in [lower, upper] for the given distribution.
collect_lazy
Collects a LazyFrame into a DataFrame.
compute_correlation_matrix
Computes pairwise Pearson correlation matrix for all numeric columns.
compute_correlation_pair
Computes correlation statistics for a pair of columns.
compute_correlation_statistics
Computes correlation matrix if not already present in results.
compute_describe_column
Computes describe statistics for a single column of an already-collected DataFrame.
compute_describe_from_lazy
Computes describe statistics from a LazyFrame without materializing all rows. When sampling is disabled, runs a single aggregation collect (like Polars describe) for similar performance. When sampling is enabled, samples then runs describe on the sample.
compute_describe_single_aggregation
Computes describe statistics in a single aggregation pass over the DataFrame. Uses one collect() with aggregated expressions for all columns (count, null_count, mean, std, min, percentiles, max).
compute_distribution_statistics
Computes distribution statistics for numeric columns.
compute_statistics
Computes statistics for a LazyFrame with default options.
compute_statistics_with_options
Computes comprehensive statistics for a LazyFrame.
sample_dataframe
Samples a LazyFrame for analysis when row count exceeds threshold. Used by chunked describe.