oaxaca_blinder 0.2.0

A Rust library for performing Oaxaca-Blinder decomposition on Polars DataFrames, with support for categorical variables and bootstrapped standard errors.
Documentation

Oaxaca-Blinder Decomposition in Rust

crates.io docs.rs

A high-performance Rust library for performing Oaxaca-Blinder decomposition, designed for economists, data scientists, and HR analysts. It decomposes the gap in an outcome variable (like wage) between two groups into "explained" (characteristics) and "unexplained" (discrimination/coefficients) components.

🚀 Feature Support

Feature Support
OLS Mean Decomposition
Quantile Decomposition (Machado-Mata)
Quantile Decomposition (RIF Regression)
Categorical Normalization (Yun)
Bootstrapped Standard Errors
Budget Optimization Solver
JMP Decomposition (Time Series)
DFL Reweighting (Counterfactuals)
Sample Weights

🖥️ Command Line Interface (CLI)

Don't want to write Rust code? You can use the oaxaca-cli tool directly from your terminal to analyze CSV files.

Installation

cargo install oaxaca_blinder --features cli

Usage

Basic Decomposition:

oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
    --predictors education experience --categorical sector

Using R-style Formula:

oaxaca-cli --data wage.csv --group gender --reference F \
    --formula "wage ~ education + experience + C(sector)"

With Sample Weights (WLS):

oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
    --predictors education experience \
    --weights sampling_weight

With Heckman Correction (Selection Bias):

oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
    --predictors education experience \
    --selection-outcome employed \
    --selection-predictors education experience age marital_status

Export Results:

oaxaca-cli --data wage.csv ... --output-json results.json --output-markdown report.md

Supports both --analysis-type mean (default) and --analysis-type quantile.


⚡ Quick Start (Rust)

Add to Cargo.toml:

[dependencies]
oaxaca_blinder = "0.1.0"
polars = { version = "0.38", features = ["lazy", "csv"] }

Basic OLS Decomposition

use polars::prelude::*;
use oaxaca_blinder::{OaxacaBuilder, ReferenceCoefficients};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let df = df!(
        "wage" => &[25.0, 30.0, 35.0, 40.0, 45.0, 20.0, 22.0, 28.0, 32.0, 38.0],
        "education" => &[16.0, 18.0, 14.0, 20.0, 16.0, 12.0, 14.0, 16.0, 12.0, 18.0],
        "gender" => &["M", "M", "M", "M", "M", "F", "F", "F", "F", "F"]
    )?;

    let results = OaxacaBuilder::new(df, "wage", "gender", "F")
        .predictors(&["education"])
        .reference_coefficients(ReferenceCoefficients::Pooled)
        .run()?;

    results.summary();
    Ok(())
}

💰 Policy Simulation: Budget Optimization

"The Cheapest Fix"

This unique feature is designed for HR analytics. It answers: "Given a limited budget, how can we reduce the pay gap as much as possible?"

It identifies individuals in the disadvantaged group with the largest negative unexplained residuals (i.e., the most "underpaid" relative to their qualifications) and calculates the optimal raises.

// Scenario: You have $200,000 to reduce the gap to 5%
let adjustments = results.optimize_budget(200_000.0, 0.05);

for adj in adjustments {
    println!("Give ${:.2} raise to employee #{}", adj.adjustment, adj.index);
}

📊 Quantile Decomposition Strategies

The library supports two robust methods for decomposing the wage gap across the distribution:

Method Best For... Builder
Machado-Mata (Simulation) Constructing full counterfactual distributions and "glass ceiling" analysis. QuantileDecompositionBuilder
RIF Regression (Analytical) Fast, detailed decomposition of specific quantiles (e.g., "Why is the 90th percentile gap so large?"). OaxacaBuilder::decompose_quantile(0.9)

Example: RIF Decomposition

// Fast decomposition of the 90th percentile gap
let results = OaxacaBuilder::new(df, "wage", "gender", "F")
    .predictors(&["education", "experience"])
    .decompose_quantile(0.9)?;

📈 Visualizing DFL Reweighting

DiNardo-Fortin-Lemieux (DFL) reweighting is a non-parametric alternative that allows you to visualize what the wage distribution of Group B would look like if they had the characteristics of Group A.

The run_dfl function returns density vectors perfect for plotting in Python (matplotlib) or Rust (plotters).

use oaxaca_blinder::run_dfl;

let dfl = run_dfl(&df, "wage", "gender", "F", &["education", "experience"])?;

// dfl.grid                   <- X-axis (Wage levels)
// dfl.density_a              <- Actual Group A Density
// dfl.density_b              <- Actual Group B Density
// dfl.density_b_counterfactual <- "What B would earn with A's characteristics"

Tip: Plot density_b vs density_b_counterfactual to visualize the "explained" gap.


⏱️ Benchmarks

Designed for performance, utilizing Rust's speed and parallelization (Rayon) for bootstrapping.

Performance vs Python (statsmodels) Dataset: 100k rows, 10 predictors

Method Rust (oaxaca_blinder) Python (statsmodels)
Raw Decomposition 0.02s 🚀 0.26s
With 100 Bootstrap Reps 0.65s 🚀 N/A (No built-in bootstrap)

Rust's raw decomposition is over 10x faster than statsmodels.

Performance Comparison (Real-World Data)

Rust Implementation: 4.32 seconds R Implementation: ~1.99 minutes (119.4 seconds) Speedup: ~27.6x faster

Results Validation

The results are nearly identical, confirming the correctness of the Rust implementation:

Metric Rust Result R Result Difference
Total Gap 3.2101 3.210084 ~0.000016
Explained -0.8097 -0.8097 Exact match (4 decimals)
Unexplained 4.0198 4.0198 Exact match (4 decimals)

Detailed Components (Selected):

Variable Component Rust Contribution R Contribution
Education Explained -0.7852 -0.7852
Experience Unexplained 2.0670 2.0670
Intercept Unexplained 2.1405 2.1405

The minor differences in standard errors (e.g., Rust SE for unexplained is 0.0314 vs R SE 0.0311) are expected due to the random nature of bootstrapping.


📚 Theory & Methodology

The decomposition depends on the choice of the non-discriminatory coefficient vector $\beta^*$. The general decomposition equation is:

This library supports:

  • Group A / Group B: Uses $\beta_A$ or $\beta_B$ as the reference.
  • Pooled (Neumark): Uses $\beta^*$ from a pooled regression of both groups.
  • Weighted (Cotton): Uses a weighted average: $\beta^* = w\beta_A + (1-w)\beta_B$.

Standard detailed decomposition is sensitive to the choice of the omitted base category for dummy variables. This library implements Yun's normalization, which transforms coefficients to be invariant to the base category choice:

Where $\bar{\beta}_k$ is the mean of the coefficients for the categorical variable $k$. This ensures robust detailed results.

The Juhn-Murphy-Pierce (JMP) method decomposes the change in the gap over time (or between distributions) into three components:

  1. Quantity Effect: Changes in observable characteristics ($X$).
  2. Price Effect: Changes in returns to characteristics ($\beta$).
  3. Gap Effect: Changes in the distribution of unobserved residuals.

License

This project is licensed under the MIT License.