# Descriptive Statistics Theory
Descriptive statistics summarize and describe the main features of a dataset. This chapter covers aprender's statistics module, focusing on quantiles, five-number summaries, and histogram generation with adaptive binning.
## Quantiles and Percentiles
### Definition
A **quantile** divides a dataset into equal-sized groups. The q-th quantile (0 ≤ q ≤ 1) is the value below which a proportion q of the data falls.
**Percentiles** are quantiles multiplied by 100:
- 25th percentile = 0.25 quantile (Q1)
- 50th percentile = 0.50 quantile (median, Q2)
- 75th percentile = 0.75 quantile (Q3)
### R-7 Method (Hyndman & Fan)
There are 9 different quantile calculation methods. Aprender uses **R-7**, the default in R, NumPy, and Pandas, which provides smooth interpolation.
**Algorithm**:
1. Sort the data (or use QuickSelect for single quantile)
2. Compute position: `h = (n - 1) * q`
3. If h is integer: return `data[h]`
4. Otherwise: linear interpolation between `data[floor(h)]` and `data[ceil(h)]`
**Interpolation formula**:
```text
Q(q) = data[h_floor] + (h - h_floor) * (data[h_ceil] - data[h_floor])
```
### Implementation
```rust,ignore
use aprender::stats::DescriptiveStats;
use trueno::Vector;
let data = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0, 5.0]);
let stats = DescriptiveStats::new(&data);
let median = stats.quantile(0.5).unwrap();
assert!((median - 3.0).abs() < 1e-6);
let q25 = stats.quantile(0.25).unwrap();
let q75 = stats.quantile(0.75).unwrap();
println!("IQR: {:.2}", q75 - q25);
```
### QuickSelect Optimization
Naive approach: Sort the entire array (O(n log n))
**QuickSelect** (Floyd-Rivest SELECT algorithm):
- Average case: O(n)
- Worst case: O(n²) (rare with good pivot selection)
- **10-100x faster for single quantiles on large datasets**
Rust's `select_nth_unstable` uses Hoare's selection algorithm with median-of-medians pivot selection.
```rust,ignore
// Inside quantile() implementation
let mut working_copy = self.data.as_slice().to_vec();
working_copy.select_nth_unstable_by(h_floor, |a, b| {
a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)
});
let value = working_copy[h_floor];
```
### Time Complexity
| Single quantile | O(n log n) | O(n) average |
| Multiple quantiles | O(n log n) | O(n log n) (reuse sorted) |
**Best practice**: For 3+ quantiles, sort once and reuse:
```rust,ignore
let percentiles = stats.percentiles(&[25.0, 50.0, 75.0]).unwrap();
```
## Five-Number Summary
### Definition
The five-number summary provides a robust description of data distribution:
1. **Minimum**: Smallest value
2. **Q1 (25th percentile)**: Lower quartile
3. **Median (50th percentile)**: Middle value
4. **Q3 (75th percentile)**: Upper quartile
5. **Maximum**: Largest value
**Interquartile Range (IQR)**:
```text
IQR = Q3 - Q1
```
The IQR measures the spread of the middle 50% of data, resistant to outliers.
### Outlier Detection
**1.5 × IQR Rule** (Tukey's fences):
```text
Lower fence = Q1 - 1.5 * IQR
Upper fence = Q3 + 1.5 * IQR
```
Values outside these fences are potential outliers.
**3 × IQR Rule** (extreme outliers):
```text
Extreme lower = Q1 - 3 * IQR
Extreme upper = Q3 + 3 * IQR
```
### Implementation
```rust,ignore
use aprender::stats::DescriptiveStats;
use trueno::Vector;
let data = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0, 5.0, 100.0]); // 100 is outlier
let stats = DescriptiveStats::new(&data);
let summary = stats.five_number_summary().unwrap();
println!("Min: {:.1}", summary.min);
println!("Q1: {:.1}", summary.q1);
println!("Median: {:.1}", summary.median);
println!("Q3: {:.1}", summary.q3);
println!("Max: {:.1}", summary.max);
let iqr = stats.iqr().unwrap();
let lower_fence = summary.q1 - 1.5 * iqr;
let upper_fence = summary.q3 + 1.5 * iqr;
// 100.0 > upper_fence → outlier detected
```
### Applications
- **Exploratory Data Analysis**: Quick distribution overview
- **Quality Control**: Detect defects in manufacturing
- **Anomaly Detection**: Find unusual values in sensor data
- **Data Validation**: Identify data entry errors
## Histogram Binning Methods
### Overview
Histograms visualize data distribution by grouping values into bins. Choosing the right number of bins is critical:
- **Too few bins**: Over-smoothing, miss important features
- **Too many bins**: Noise dominates, hard to interpret
### Freedman-Diaconis Rule
**Formula**:
```text
bin_width = 2 * IQR * n^(-1/3)
n_bins = ceil((max - min) / bin_width)
```
**Characteristics**:
- **Outlier-resistant**: Uses IQR instead of standard deviation
- **Adaptive**: Adjusts to data spread
- **Best for**: Skewed distributions, data with outliers
**Time complexity**: O(n log n) for full sort (or O(n) with QuickSelect for Q1/Q3)
### Sturges' Rule
**Formula**:
```text
n_bins = ceil(log2(n)) + 1
```
**Characteristics**:
- **Fast**: O(1) computation
- **Simple**: Only depends on sample size
- **Best for**: Normal distributions, quick exploration
- **Warning**: Underestimates bins for non-normal data
**Example**: 1000 samples → 11 bins, 1M samples → 21 bins
### Scott's Rule
**Formula**:
```text
bin_width = 3.5 * σ * n^(-1/3)
n_bins = ceil((max - min) / bin_width)
```
where σ is standard deviation.
**Characteristics**:
- **Statistically optimal**: Minimizes integrated mean squared error (IMSE)
- **Sensitive to outliers**: Uses standard deviation
- **Best for**: Normal or near-normal distributions
**Time complexity**: O(n) for mean and stddev
### Square Root Rule
**Formula**:
```text
n_bins = ceil(sqrt(n))
```
**Characteristics**:
- **Very fast**: O(1) computation
- **Simple heuristic**: No statistical basis
- **Best for**: Quick exploration, initial EDA
**Example**: 100 samples → 10 bins, 10K samples → 100 bins
### Bayesian Blocks (Placeholder)
**Status**: Future implementation (O(n²) dynamic programming)
**Characteristics**:
- **Adaptive**: Finds optimal change points
- **Non-uniform**: Bins can have different widths
- **Best for**: Time series, event data with varying density
Currently falls back to Freedman-Diaconis.
### Comparison Table
| Freedman-Diaconis | O(n log n) | ✅ Yes (uses IQR) | Skewed data, outliers |
| Sturges | O(1) | ❌ No | Normal distributions |
| Scott | O(n) | ❌ No (uses σ) | Near-normal data |
| Square Root | O(1) | ❌ No | Quick exploration |
| Bayesian Blocks | O(n²) | ✅ Yes | Time series, events |
### Implementation
```rust,ignore
use aprender::stats::{BinMethod, DescriptiveStats};
use trueno::Vector;
let data = Vector::from_slice(&[/* your data */]);
let stats = DescriptiveStats::new(&data);
// Use Freedman-Diaconis for outlier-resistant binning
let hist = stats.histogram_method(BinMethod::FreedmanDiaconis).unwrap();
println!("Bins: {} bins created", hist.bins.len());
for (i, (&lower, &count)) in hist.bins.iter().zip(hist.counts.iter()).enumerate() {
let upper = if i < hist.bins.len() - 1 { hist.bins[i + 1] } else { data.max().unwrap() };
println!("[{:.1} - {:.1}): {} samples", lower, upper, count);
}
```
### Density vs Count
Histograms can show counts or probability density:
**Counts**: Number of samples in each bin (default)
**Density**: Normalized so area = 1
```text
density[i] = count[i] / (n * bin_width[i])
```
Density allows comparison across different sample sizes.
## Performance Characteristics
### Quantile Computation (1M samples)
| Full sort | 45 ms | O(n log n), reusable for multiple quantiles |
| QuickSelect (single) | 0.8 ms | O(n) average, 56x faster |
| QuickSelect (5 quantiles) | 4 ms | Still 11x faster (partially sorted) |
**Recommendation**: Use QuickSelect for 1-2 quantiles, full sort for 3+.
### Histogram Generation (1M samples)
| Freedman-Diaconis | 52 ms | Includes IQR computation |
| Sturges | 8 ms | Just sorting + binning |
| Scott | 10 ms | Includes stddev computation |
| Square Root | 8 ms | Just sorting + binning |
### Memory Usage
All methods operate on a single copy of the data (O(n) memory):
- Quantiles: O(n) working copy for partial sort
- Histograms: O(n) for sorting + O(k) for bins (k ≪ n)
## Real-World Applications
### Exploratory Data Analysis (EDA)
**Problem**: Understand data distribution before modeling.
**Approach**:
1. Compute five-number summary
2. Identify outliers with 1.5 × IQR rule
3. Generate histogram with Freedman-Diaconis
4. Check for skewness, multimodality
**Example**: Analyzing house prices, salary distributions.
### Quality Control (Manufacturing)
**Problem**: Detect defective parts in production.
**Approach**:
1. Measure dimensions of parts
2. Compute Q1, Q3, IQR
3. Set control limits at Q1 - 3×IQR and Q3 + 3×IQR
4. Flag parts outside limits
**Example**: Bolt diameter tolerance, circuit board resistance.
### Anomaly Detection (Security)
**Problem**: Find unusual login times or network traffic.
**Approach**:
1. Compute median and IQR of normal behavior
2. New observation outside Q3 + 1.5×IQR → alert
3. Histogram shows temporal patterns (e.g., night-time access)
**Example**: Fraud detection, intrusion detection systems.
### A/B Testing
**Problem**: Compare two groups (treatment vs control).
**Approach**:
1. Compute five-number summary for both groups
2. Compare medians (more robust than means)
3. Check if distributions overlap using IQR
4. Histogram shows distribution differences
**Example**: Website conversion rates, drug trial outcomes.
## Toyota Way Principles
### Muda (Waste Elimination)
**QuickSelect for single quantiles**: Avoids O(n log n) full sort when only one quantile is needed.
**Benchmark** (1M samples):
- Full sort: 45 ms
- QuickSelect: 0.8 ms
- **56x speedup**
### Poka-Yoke (Error Prevention)
**Outlier-resistant methods**:
- Freedman-Diaconis uses IQR (robust to outliers)
- Median preferred over mean (robust to skew)
**Example**: Dataset with outlier (100x normal values):
- Mean: biased by outlier
- Median: unaffected
- IQR-based bins: capture true distribution
### Heijunka (Load Balancing)
**Adaptive binning**: Methods like Freedman-Diaconis adjust bin count to data characteristics, avoiding over/under-binning.
## Best Practices
### Quantile Computation
```rust,ignore
// ✅ Good: Single quantile with QuickSelect
let median = stats.quantile(0.5).unwrap();
// ✅ Good: Multiple quantiles with single sort
let percentiles = stats.percentiles(&[25.0, 50.0, 75.0, 90.0]).unwrap();
// ❌ Avoid: Multiple calls to quantile() (sorts each time)
let q1 = stats.quantile(0.25).unwrap();
let q2 = stats.quantile(0.50).unwrap(); // Sorts again!
let q3 = stats.quantile(0.75).unwrap(); // Sorts again!
```
### Outlier Detection
```rust,ignore
// ✅ Conservative: 1.5 × IQR (flags ~0.7% of normal data)
let lower = q1 - 1.5 * iqr;
let upper = q3 + 1.5 * iqr;
// ✅ Strict: 3 × IQR (flags ~0.003% of normal data)
let lower_extreme = q1 - 3.0 * iqr;
let upper_extreme = q3 + 3.0 * iqr;
```
### Histogram Method Selection
```rust,ignore
// Outliers present or skewed data
let hist = stats.histogram_method(BinMethod::FreedmanDiaconis).unwrap();
// Normal distribution, quick exploration
let hist = stats.histogram_method(BinMethod::Sturges).unwrap();
// Need statistical optimality (IMSE)
let hist = stats.histogram_method(BinMethod::Scott).unwrap();
```
## Common Pitfalls
### Using Mean Instead of Median
**Problem**: Mean is sensitive to outliers.
**Example**: Salaries [30K, 35K, 40K, 45K, 500K]
- Mean: 130K (misleading, inflated by 500K)
- Median: 40K (robust, represents typical salary)
### Too Few Histogram Bins
**Problem**: Over-smoothing hides important features.
**Solution**: Use Freedman-Diaconis or Scott for adaptive binning.
### Ignoring IQR for Spread
**Problem**: Standard deviation inflated by outliers.
**Example**: Response times [10ms, 12ms, 15ms, 20ms, 5000ms]
- Stddev: ~1000ms (dominated by outlier)
- IQR: 8ms (captures typical variation)
## Further Reading
**Quantile Methods**:
- Hyndman, R.J., Fan, Y. (1996). "Sample Quantiles in Statistical Packages"
- Floyd, R.W., Rivest, R.L. (1975). "Algorithm 489: The Algorithm SELECT"
**Histogram Binning**:
- Freedman, D., Diaconis, P. (1981). "On the Histogram as a Density Estimator"
- Sturges, H.A. (1926). "The Choice of a Class Interval"
- Scott, D.W. (1979). "On Optimal and Data-Based Histograms"
**Outlier Detection**:
- Tukey, J.W. (1977). "Exploratory Data Analysis"
## Summary
- **Quantiles**: R-7 method with QuickSelect optimization (10-100x faster)
- **Five-number summary**: Robust description using min, Q1, median, Q3, max
- **IQR**: Outlier-resistant measure of spread (Q3 - Q1)
- **Histograms**: Four binning methods (Freedman-Diaconis recommended for outliers)
- **Outlier detection**: 1.5 × IQR rule (conservative) or 3 × IQR (strict)
- **Toyota Way**: Eliminates waste (QuickSelect), prevents errors (IQR), adapts to data