Crate fiasto

Expand description

§Fiasto: High-Performance Statistical Formula Parser

Pronouned like fiasco, but with a t instead of an c

§(F)ormulas (I)n (AST) (O)ut

A Language-Agnostic modern Wilkinson’s formula parser and lexer.

§Motivation

Formula parsing and materialization is normally done in a single library. Python, for example, has patsy/formulaic/formulae which all do parsing & materialization. R’s model.matrix also handles formula parsing and design matrix creation.

There is nothing wrong with this coupling. I wanted to try decoupling the parsing and materialization. I thought this would allow a focused library that could be used in multiple languages or dataframe libraries. This package has a clear path, to parse and/or lex formulas and return structured JSON metadata.

Note: Technically an AST is not returned. A simplified/structured intermediate representation (IR) in the form of json is returned. This json IR ought to be easy for many language bindings to use.

§🎯 Simple API

The library exposes a clean, focused API:

parse_formula() - Takes a Wilkinson’s formula string and returns structured JSON metadata
lex_formula() - Tokenizes a formula string and returns JSON describing each token

“Only two functions?! What kind of library is this?!”

An easy to maintain library with a small surface area. The best kind.

§Output Format

The parser returns a variable-centric JSON structure where each variable is described with its roles, transformations, interactions, and random effects. This makes it easy to understand the complete model structure and generate appropriate design matrices. wayne is a python package that can take this JSON and generates design matrices for use in statistical modeling.

§Features

Comprehensive Formula Support: Full R/Wilkinson notation including complex random effects
Variable-Centric Output: Variables are first-class citizens with detailed metadata
Advanced Random Effects: brms-style syntax with correlation control and grouping options
High Performance: Zero-copy processing and efficient tokenization
Pretty Error Messages: Colored, contextual error reporting with syntax highlighting
Robust Error Recovery: Graceful handling of malformed formulas with specific error types
Language Agnostic Output: JSON format for easy integration with various programming languages
Comprehensive Documentation: Detailed usage examples and grammar rules
Comprehensive Metadata: Variable roles, transformations, interactions, and relationships
Automatic Naming For Generated Columns: Consistent, descriptive names for transformed and interaction terms
Dual API: Both parsing and lexing functions for flexibility
Efficient tokenization: using one of the fastest lexer generators for Rust (logos crate)
Fast pattern matching: using match statements and enum-based token handling. Rust match statements are zero-cost abstractions.
Minimal string copying: with extensive use of string slices (&str) where possible

§Use Cases:

Formula Validation: Check if formulas are valid against datasets before expensive computation
Cross-Platform Model Specs: Define models once, implement in multiple statistical frameworks

§Quick Start `parse_formula`

To parse a formula and get JSON metadata:

use fiasto::parse_formula;

// Parse a simple linear model
let result = parse_formula("y ~ x + z");
match result {
    Ok(metadata) => println!("{}", serde_json::to_string_pretty(&metadata).unwrap()),
    Err(e) => eprintln!("Error: {}", e),
}

§Intercept-Only, No-Intercept, and Multivariate Models

All model types are fully supported:

use fiasto::parse_formula;

// Parse an intercept-only model
let result = parse_formula("y ~ 1");
match result {
    Ok(metadata) => {
        // The metadata will include an "intercept" column
        // and has_intercept will be true
        println!("{}", serde_json::to_string_pretty(&metadata).unwrap());
    }
    Err(e) => eprintln!("Error: {}", e),
}

// Parse a no-intercept model
let result = parse_formula("y ~ 0");
match result {
    Ok(metadata) => {
        // The metadata will NOT include an "intercept" column
        // and has_intercept will be false
        println!("{}", serde_json::to_string_pretty(&metadata).unwrap());
    }
    Err(e) => eprintln!("Error: {}", e),
}

// Parse a multivariate model
let result = parse_formula("bind(y1, y2) ~ x + z");
match result {
    Ok(metadata) => {
        // The metadata will include both y1 and y2 as response variables
        // with ID 1, and x, z as predictors with IDs 2, 3
        println!("{}", serde_json::to_string_pretty(&metadata).unwrap());
    }
    Err(e) => eprintln!("Error: {}", e),
}

This prints a JSON object like:

{
    "all_generated_columns": [
    "y",
    "x",
    "z"
  ],
  "columns": {
    "x": {
      "generated_columns": [
        "x"
      ],
      "id": 2,
      "interactions": [],
      "random_effects": [],
      "roles": [
        "FixedEffect"
      ],
      "transformations": []
    },
    "y": {
      "generated_columns": [
        "y"
      ],
      "id": 1,
      "interactions": [],
      "random_effects": [],
      "roles": [
        "Response"
      ],
      "transformations": []
    },
    "z": {
      "generated_columns": [
        "z"
      ],
      "id": 3,
      "interactions": [],
      "random_effects": [],
      "roles": [
        "FixedEffect"
      ],
      "transformations": []
    }
  },
  "formula": "y ~ x + z",
  "metadata": {
    "family": null,
    "has_intercept": true,
    "has_uncorrelated_slopes_and_intercepts": false,
    "is_random_effects_model": false
  }
}

§Quick Start `lex_formula`

To lex a formula and get token information:

use fiasto::lex_formula;

// Lex a simple linear model
let result = lex_formula("y ~ x + z");
match result {
    Ok(tokens) => println!("{}", serde_json::to_string_pretty(&tokens).unwrap()),
    Err(e) => eprintln!("Error: {}", e),
}

This prints objects like:

{ "token": "ColumnName", "lexeme": "mpg" }
{ "token": "Tilde", "lexeme": "~" }
{ "token": "Plus", "lexeme": "+" }

§Run Examples

You can run the examples in the examples/ directory with the command: cargo run --example <example_name> For example:

cargo run --example intercept_only - Demonstrates intercept-only model parsing
cargo run --example 03 - Demonstrates parsing a complex formula shown below

use fiasto::parse_formula;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let input = "y ~ x + poly(x, 2) + poly(x1, 4) + log(x1) - 1, family = gaussian";

    println!("Testing public parse_formula function:");
    println!("Input: {}", input);

    let result = parse_formula(input)?;

    println!("FORMULA METADATA (as JSON):");
    println!("{}", result);
    println!("{}", serde_json::to_string_pretty(&result)?);

    println!("\n\n");

    Ok(())
}

§Supported Syntax

§Basic Models

Linear models: y ~ x + z
Intercept-only models: y ~ 1
No-intercept models: y ~ 0
Multivariate models: bind(y1, y2) ~ x + z
Polynomial terms: y ~ poly(x, 3)
Interactions: y ~ x:z or y ~ x*z
Family specification: y ~ x, family = gaussian

§Random Effects

Random intercepts: (1 | group)
Random slopes: (0 + x | group)
Correlated effects: (x | group)
Uncorrelated effects: (x || group)
Advanced grouping: (1 | gr(group, cor = FALSE))

Modules§

internal

Functions§

lex_formula: Lex a formula and return JSON describing each token.
parse_formula: Parse a statistical formula string and return comprehensive metadata as JSON

Crate fiasto

Crate fiasto Copy item path

§Fiasto: High-Performance Statistical Formula Parser

§(F)ormulas (I)n (AST) (O)ut

§Motivation

§🎯 Simple API

§Output Format

§Features

§Use Cases:

§Quick Start parse_formula

§Intercept-Only, No-Intercept, and Multivariate Models

§Quick Start lex_formula

§Run Examples

§Supported Syntax

§Basic Models

§Random Effects

Modules§

Functions§

Crate fiasto

§Quick Start `parse_formula`

§Quick Start `lex_formula`