greeners 1.0.0 - Docs.rs

# Formula API - R/Python-style Syntax for Greeners

Greeners now supports model specification using R/Python-style formulas (similar to Python's `statsmodels` and R's `lm()`).

## Basic Syntax

```rust
use greeners::{OLS, DataFrame, Formula, CovarianceType};
use ndarray::Array1;
use std::collections::HashMap;

// Create data
let mut data = HashMap::new();
data.insert("y".to_string(), Array1::from(vec![...]));
data.insert("x1".to_string(), Array1::from(vec![...]));
data.insert("x2".to_string(), Array1::from(vec![...]));

let df = DataFrame::new(data)?;

// Specify formula (R/Python syntax)
let formula = Formula::parse("y ~ x1 + x2")?;

// Estimate model
let result = OLS::from_formula(&formula, &df, CovarianceType::HC1)?;
```

## Python (statsmodels) Equivalents

### Basic OLS

**Python:**
```python
import statsmodels.formula.api as smf
model = smf.ols('y ~ x1 + x2', data=df).fit()
```

**Rust (Greeners):**
```rust
let formula = Formula::parse("y ~ x1 + x2")?;
let result = OLS::from_formula(&formula, &df, CovarianceType::NonRobust)?;
```

### OLS with Robust Standard Errors (HC1)

**Python:**
```python
model = smf.ols('y ~ x1 + x2', data=df).fit(cov_type='HC1')
```

**Rust (Greeners):**
```rust
let formula = Formula::parse("y ~ x1 + x2")?;
let result = OLS::from_formula(&formula, &df, CovarianceType::HC1)?;
```

### WLS (Weighted Least Squares)

**Python:**
```python
model = smf.wls('fte ~ treated + t + effect', data=df, weights=weights).fit(cov_type='HC1')
```

**Rust (Greeners):**
```rust
let formula = Formula::parse("fte ~ treated + t + effect")?;
let result = FGLS::wls_from_formula(&formula, &df, &weights)?;
```

### No Intercept

**Python:**
```python
model = smf.ols('y ~ x1 + x2 - 1', data=df).fit()
# Or: model = smf.ols('y ~ 0 + x1 + x2', data=df).fit()
```

**Rust (Greeners):**
```rust
let formula = Formula::parse("y ~ x1 + x2 - 1")?;
// Or: let formula = Formula::parse("y ~ 0 + x1 + x2")?;
let result = OLS::from_formula(&formula, &df, CovarianceType::NonRobust)?;
```

## Formula Syntax

### General Format
```
dependent ~ independent1 + independent2 + ... + independentN
```

### Intercept Control

- **With intercept (default):**
  - `y ~ x1 + x2`
  - `y ~ 1 + x1 + x2` (explicit)

- **Without intercept:**
  - `y ~ x1 + x2 - 1`
  - `y ~ 0 + x1 + x2`

- **Intercept only:**
  - `y ~ 1`

## DataFrame Structure

### Loading from CSV (Recommended!)

The easiest way to work with data is to load it from a CSV file with headers:

```rust
use greeners::DataFrame;

// Load data from CSV file (just like pandas.read_csv!)
let df = DataFrame::from_csv("data.csv")?;

println!("Loaded {} rows and {} columns", df.n_rows(), df.n_cols());
println!("Column names: {:?}", df.column_names());

// Access individual columns
let y_column = df.get("y")?;
```

### Creating Manually

You can also create DataFrames manually:

```rust
use greeners::DataFrame;
use ndarray::Array1;
use std::collections::HashMap;

let mut data = HashMap::new();
data.insert("column1".to_string(), Array1::from(vec![1.0, 2.0, 3.0]));
data.insert("column2".to_string(), Array1::from(vec![4.0, 5.0, 6.0]));

let df = DataFrame::new(data)?;

// Access columns
let col = df.get("column1")?;

// Information
println!("Rows: {}", df.n_rows());
println!("Columns: {}", df.n_cols());
println!("Names: {:?}", df.column_names());
```

## Available Methods

### OLS
- `OLS::from_formula(formula, data, cov_type)` - Estimate OLS using formula

### FGLS/WLS
- `FGLS::wls_from_formula(formula, data, weights)` - WLS using formula
- `FGLS::cochrane_orcutt_from_formula(formula, data)` - Cochrane-Orcutt AR(1) using formula

### Difference-in-Differences (DiD)
- `DiffInDiff::from_formula(formula, data, treated_var, post_var, cov_type)` - DiD using formula

### IV/2SLS
- `IV::from_formula(endog_formula, instrument_formula, data, cov_type)` - IV/2SLS using formulas

### Discrete Choice Models
- `Logit::from_formula(formula, data)` - Logit using formula
- `Probit::from_formula(formula, data)` - Probit using formula

### Quantile Regression
- `QuantileReg::from_formula(formula, data, tau, n_boot)` - Quantile regression using formula

### Panel Data
- `FixedEffects::from_formula(formula, data, entity_ids)` - Fixed Effects using formula
- `RandomEffects::from_formula(formula, data, entity_ids)` - Random Effects using formula
- `BetweenEstimator::from_formula(formula, data, entity_ids)` - Between Estimator using formula

## Complete Examples

See examples in the `examples/` folder:
- **`csv_formula_example.rs`** - Load CSV files and run regressions (RECOMMENDED!)
- `formula_example.rs` - General formula API demonstration
- `did_formula_example.rs` - Difference-in-Differences using formulas
- `quickstart_formula.rs` - Quick start example from README

Run with:
```bash
cargo run --example csv_formula_example
cargo run --example formula_example
cargo run --example did_formula_example
cargo run --example quickstart_formula
```

### CSV File Format

Your CSV file should have headers in the first row:

```csv
y,x1,x2,x3
10.5,1.2,2.3,0.5
12.3,2.1,3.1,0.7
15.2,3.5,4.2,0.9
...
```

Then load and use with formulas:

```rust
let df = DataFrame::from_csv("data.csv")?;
let formula = Formula::parse("y ~ x1 + x2")?;
let result = OLS::from_formula(&formula, &df, CovarianceType::HC1)?;
```

## Supported Covariance Types

```rust
pub enum CovarianceType {
    /// Standard errors (homoscedasticity assumed)
    NonRobust,

    /// White's robust errors (HC1) - heteroscedasticity only
    HC1,

    /// Newey-West (HAC) - heteroscedasticity + autocorrelation
    /// The usize parameter is the number of lags
    NeweyWest(usize),

    /// Clustered standard errors - for panel/grouped data
    /// The Vec<usize> contains cluster IDs for each observation
    /// Critical for panel data and hierarchical structures
    Clustered(Vec<usize>),
}
```

### Using Clustered Standard Errors (NEW in v0.2.0)

Clustered standard errors account for within-cluster correlation:

```rust
use greeners::{OLS, DataFrame, Formula, CovarianceType};

// Panel data example: 3 firms × 5 time periods = 15 observations
let cluster_ids = vec![
    0,0,0,0,0,  // Firm 0 (5 time periods)
    1,1,1,1,1,  // Firm 1 (5 time periods)
    2,2,2,2,2,  // Firm 2 (5 time periods)
];

let formula = Formula::parse("profit ~ advertising + rd_spending")?;
let result = OLS::from_formula(&formula, &df, CovarianceType::Clustered(cluster_ids))?;

println!("{}", result);  // Shows "Clustered (3 clusters)" in output
```

**When to use clustered SE:**
- Panel data (repeated observations per entity: firms, individuals, countries)
- Hierarchical data (students in schools, patients in hospitals)
- Experiments with cluster randomization (villages, classrooms)
- Geographic clustering (cities, regions, countries)
- Any situation where observations within groups are likely correlated

**Why it matters:**
- Standard OLS SE assume independence → underestimated standard errors
- Clustered SE correct for within-cluster correlation → proper inference
- Ignoring clustering leads to over-rejection of null hypotheses (false discoveries)

## Current Limitations

1. **Interactions:** Not yet supported (e.g., `y ~ x1 * x2`)
2. **Functions:** Not yet supported (e.g., `y ~ log(x1)`)
3. **Categorical factors:** Not yet supported (e.g., `y ~ C(category)`)
4. **I():** Not yet supported (e.g., `y ~ I(x1**2)`)

These features may be added in future versions.

## Advantages

1. **Familiar syntax:** If you come from R or Python, the syntax is immediately recognizable
2. **Less verbose:** No need to manually construct matrices
3. **Type-safe:** Errors caught at compile time whenever possible
4. **Performance:** Native Rust speed with memory safety
5. **Integration:** Works with all existing estimators

## Performance

The formula API adds minimal overhead:
1. Formula parsing: O(k) where k = number of variables
2. Design matrix construction: O(n*k) where n = observations

For repeated analyses with the same structure, you can parse the formula once and reuse it.