# Oaxaca-Blinder Decomposition in Rust
[](https://crates.io/crates/oaxaca_blinder)
[](https://docs.rs/oaxaca_blinder)
A high-performance Rust library for performing Oaxaca-Blinder decomposition, designed for economists, data scientists, and HR analysts. It decomposes the gap in an outcome variable (like wage) between two groups into "explained" (characteristics) and "unexplained" (discrimination/coefficients) components.
## 🚀 Feature Support
| **OLS Mean Decomposition** | ✅ |
| **Quantile Decomposition (Machado-Mata)** | ✅ |
| **Quantile Decomposition (RIF Regression)** | ✅ |
| **Categorical Normalization (Yun)** | ✅ |
| **Bootstrapped Standard Errors** | ✅ |
| **Budget Optimization Solver** | ✅ |
| **JMP Decomposition (Time Series)** | ✅ |
| **DFL Reweighting (Counterfactuals)** | ✅ |
| **Sample Weights** | ❌ |
---
## 🖥️ Command Line Interface (CLI)
Don't want to write Rust code? You can use the `oaxaca-cli` tool directly from your terminal to analyze CSV files.
### Installation
```bash
cargo install oaxaca_blinder --features cli
```
### Usage
**Basic Decomposition:**
```bash
oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
--predictors education experience --categorical sector
```
**Using R-style Formula:**
```bash
oaxaca-cli --data wage.csv --group gender --reference F \
--formula "wage ~ education + experience + C(sector)"
```
**With Sample Weights (WLS):**
```bash
oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
--predictors education experience \
--weights sampling_weight
```
**With Heckman Correction (Selection Bias):**
```bash
oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
--predictors education experience \
--selection-outcome employed \
--selection-predictors education experience age marital_status
```
**Export Results:**
```bash
oaxaca-cli --data wage.csv ... --output-json results.json --output-markdown report.md
```
Supports both `--analysis-type mean` (default) and `--analysis-type quantile`.
---
## ⚡ Quick Start (Rust)
Add to `Cargo.toml`:
```toml
[dependencies]
oaxaca_blinder = "0.1.0"
polars = { version = "0.38", features = ["lazy", "csv"] }
```
### Basic OLS Decomposition
```rust
use polars::prelude::*;
use oaxaca_blinder::{OaxacaBuilder, ReferenceCoefficients};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let df = df!(
"wage" => &[25.0, 30.0, 35.0, 40.0, 45.0, 20.0, 22.0, 28.0, 32.0, 38.0],
"education" => &[16.0, 18.0, 14.0, 20.0, 16.0, 12.0, 14.0, 16.0, 12.0, 18.0],
"gender" => &["M", "M", "M", "M", "M", "F", "F", "F", "F", "F"]
)?;
let results = OaxacaBuilder::new(df, "wage", "gender", "F")
.predictors(&["education"])
.reference_coefficients(ReferenceCoefficients::Pooled)
.run()?;
results.summary();
Ok(())
}
```
---
## 💰 Policy Simulation: Budget Optimization
**"The Cheapest Fix"**
This unique feature is designed for HR analytics. It answers: *"Given a limited budget, how can we reduce the pay gap as much as possible?"*
It identifies individuals in the disadvantaged group with the largest negative unexplained residuals (i.e., the most "underpaid" relative to their qualifications) and calculates the optimal raises.
```rust
// Scenario: You have $200,000 to reduce the gap to 5%
let adjustments = results.optimize_budget(200_000.0, 0.05);
for adj in adjustments {
println!("Give ${:.2} raise to employee #{}", adj.adjustment, adj.index);
}
```
---
## 📊 Quantile Decomposition Strategies
The library supports two robust methods for decomposing the wage gap across the distribution:
| **Machado-Mata (Simulation)** | Constructing full counterfactual distributions and "glass ceiling" analysis. | `QuantileDecompositionBuilder` |
| **RIF Regression (Analytical)** | Fast, detailed decomposition of specific quantiles (e.g., "Why is the 90th percentile gap so large?"). | `OaxacaBuilder::decompose_quantile(0.9)` |
### Example: RIF Decomposition
```rust
// Fast decomposition of the 90th percentile gap
let results = OaxacaBuilder::new(df, "wage", "gender", "F")
.predictors(&["education", "experience"])
.decompose_quantile(0.9)?;
```
---
## 📈 Visualizing DFL Reweighting
**DiNardo-Fortin-Lemieux (DFL)** reweighting is a non-parametric alternative that allows you to visualize what the wage distribution of Group B would look like if they had the characteristics of Group A.
The `run_dfl` function returns density vectors perfect for plotting in Python (matplotlib) or Rust (plotters).
```rust
use oaxaca_blinder::run_dfl;
let dfl = run_dfl(&df, "wage", "gender", "F", &["education", "experience"])?;
// dfl.grid <- X-axis (Wage levels)
// dfl.density_a <- Actual Group A Density
// dfl.density_b <- Actual Group B Density
// dfl.density_b_counterfactual <- "What B would earn with A's characteristics"
```
*Tip: Plot `density_b` vs `density_b_counterfactual` to visualize the "explained" gap.*
---
## ⏱️ Benchmarks
Designed for performance, utilizing Rust's speed and parallelization (Rayon) for bootstrapping.
**Performance vs Python (`statsmodels`)**
*Dataset: 100k rows, 10 predictors*
| **Raw Decomposition** | **0.02s** 🚀 | 0.26s |
| **With 100 Bootstrap Reps** | **0.65s** 🚀 | N/A (No built-in bootstrap) |
*Rust's raw decomposition is over **10x faster** than statsmodels.*
### Performance Comparison (Real-World Data)
**Rust Implementation**: 4.32 seconds
**R Implementation**: ~1.99 minutes (119.4 seconds)
**Speedup**: ~27.6x faster
### Results Validation
The results are nearly identical, confirming the correctness of the Rust implementation:
| **Total Gap** | 3.2101 | 3.210084 | ~0.000016 |
| **Explained** | -0.8097 | -0.8097 | Exact match (4 decimals) |
| **Unexplained** | 4.0198 | 4.0198 | Exact match (4 decimals) |
**Detailed Components (Selected):**
| **Education** | Explained | -0.7852 | -0.7852 |
| **Experience** | Unexplained | 2.0670 | 2.0670 |
| **Intercept** | Unexplained | 2.1405 | 2.1405 |
*The minor differences in standard errors (e.g., Rust SE for unexplained is 0.0314 vs R SE 0.0311) are expected due to the random nature of bootstrapping.*
---
## 📚 Theory & Methodology
<details>
<summary><strong>Deep Dive: The Indexing Problem & Reference Groups</strong></summary>
The decomposition depends on the choice of the non-discriminatory coefficient vector $\beta^*$. The general decomposition equation is:
<div align="center">
<img src="https://latex.codecogs.com/svg.image?\Delta\bar{Y}=\underbrace{(\bar{X}_A-\bar{X}_B)'\beta^*}_{\text{Explained}}+\underbrace{\bar{X}_A'(\beta_A-\beta^*)+\bar{X}_B'(\beta^*-\beta_B)}_{\text{Unexplained}}" alt="Oaxaca Decomposition Equation" />
</div>
This library supports:
- **Group A / Group B**: Uses $\beta_A$ or $\beta_B$ as the reference.
- **Pooled (Neumark)**: Uses $\beta^*$ from a pooled regression of both groups.
- **Weighted (Cotton)**: Uses a weighted average: $\beta^* = w\beta_A + (1-w)\beta_B$.
</details>
<details>
<summary><strong>Deep Dive: Categorical Variables (Yun Normalization)</strong></summary>
Standard detailed decomposition is sensitive to the choice of the omitted base category for dummy variables. This library implements **Yun's normalization**, which transforms coefficients to be invariant to the base category choice:
<div align="center">
<img src="https://latex.codecogs.com/svg.image?\tilde{\beta}_{k}=\beta_{k}+\bar{\beta}_k" alt="Yun Normalization Equation" />
</div>
Where $\bar{\beta}_k$ is the mean of the coefficients for the categorical variable $k$. This ensures robust detailed results.
</details>
<details>
<summary><strong>Deep Dive: JMP Decomposition</strong></summary>
The **Juhn-Murphy-Pierce (JMP)** method decomposes the *change* in the gap over time (or between distributions) into three components:
<div align="center">
<img src="https://latex.codecogs.com/svg.image?\Delta\bar{Y}=\underbrace{\Delta&space;X\beta}_{\text{Quantity&space;Effect}}+\underbrace{X\Delta\beta}_{\text{Price&space;Effect}}+\underbrace{\Delta\epsilon}_{\text{Gap&space;Effect}}" alt="JMP Decomposition Equation" />
</div>
1. **Quantity Effect**: Changes in observable characteristics ($X$).
2. **Price Effect**: Changes in returns to characteristics ($\beta$).
3. **Gap Effect**: Changes in the distribution of unobserved residuals.
</details>
---
## License
This project is licensed under the MIT License.