Robin Sparkless

PySpark-style DataFrames in Rust—no JVM. A DataFrame library that mirrors PySpark’s API and semantics while using Polars as the execution engine.

Why Robin Sparkless?

Familiar API — SparkSession, DataFrame, Column, and PySpark-like functions so you can reuse patterns without the JVM.
Polars under the hood — Fast, native Rust execution with Polars for IO, expressions, and aggregations.
Persistence options — Global temp views (cross-session in-memory) and disk-backed saveAsTable via spark.sql.warehouse.dir.
Sparkless backend target — Designed to power Sparkless (the Python PySpark replacement) as a Rust execution engine.

Features

Area	What’s included
Core	`SparkSession`, `DataFrame`, `Column`; lazy by default (transformations extend the plan; only actions like `collect`, `show`, `count`, `write` materialize); `filter`, `select`, `with_column`, `order_by`, `group_by`, joins
IO	CSV, Parquet, JSON via `SparkSession::read_*`
Expressions	`col()`, `lit()`, `when`/`then`/`otherwise`, `coalesce`, cast, type/conditional helpers
Aggregates	`count`, `sum`, `avg`, `min`, `max`, and more; multi-column groupBy
Window	`row_number`, `rank`, `dense_rank`, `lag`, `lead`, `first_value`, `last_value`, and others with `.over()`
Arrays & maps	`array_*`, `explode`, `create_map`, `map_keys`, `map_values`, and related functions
Strings & JSON	String functions (`upper`, `lower`, `substring`, `regexp_*`, etc.), `get_json_object`, `from_json`, `to_json`
Datetime & math	Date/time extractors and arithmetic, `year`/`month`/`day`, math (`sin`, `cos`, `sqrt`, `pow`, …)
Optional SQL	`spark.sql("SELECT ...")` with temp views, global temp views (cross-session), and tables: `createOrReplaceTempView`, `createOrReplaceGlobalTempView`, `table(name)`, `table("global_temp.name")`, `df.write().saveAsTable(name, mode=...)`, `spark.catalog().listTables()` — enable with `--features sql`
Optional Delta	`read_delta(path)` or `read_delta(table_name)`, `read_delta_with_version`, `write_delta`, `write_delta_table(name)` — enable with `--features delta` (path I/O); table-by-name works with `sql` only
UDFs	Pure-Rust UDFs registered in a session-scoped registry; see `docs/UDF_GUIDE.md`

Parity: 200+ fixtures validated against PySpark. Known differences from PySpark are documented in docs/PYSPARK_DIFFERENCES.md. Out-of-scope items (XML, UDTF, streaming, RDD) are in docs/DEFERRED_SCOPE.md. Full parity status: docs/PARITY_STATUS.md.

Installation

Rust

Add to your Cargo.toml:

[dependencies]
robin-sparkless = "0.12.0"

Optional features:

robin-sparkless = { version = "0.12.0", features = ["sql"] }   # spark.sql(), temp views
robin-sparkless = { version = "0.12.0", features = ["delta"] }  # Delta Lake read/write

Quick start

Rust

use robin_sparkless::{col, lit_i64, SparkSession};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let spark = SparkSession::builder().app_name("demo").get_or_create();

    // Create a DataFrame from rows (id, age, name)
    let df = spark.create_dataframe(
        vec![
            (1, 25, "Alice".to_string()),
            (2, 30, "Bob".to_string()),
            (3, 35, "Charlie".to_string()),
        ],
        vec!["id", "age", "name"],
    )?;

    // Filter and show (Expr for filter: use .into_expr() on Column)
    let adults = df.filter(col("age").gt(lit_i64(26).into_expr()).into_expr())?;
    adults.show(Some(10))?;

    Ok(())
}

Output (from show; run with cargo run --example demo):

shape: (2, 3)
┌─────┬─────┬─────────┐
│ id  ┆ age ┆ name    │
│ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ str     │
╞═════╪═════╪═════════╡
│ 2   ┆ 30  ┆ Bob     │
│ 3   ┆ 35  ┆ Charlie │
└─────┴─────┴─────────┘

You can also wrap an existing Polars DataFrame with DataFrame::from_polars(polars_df). See docs/QUICKSTART.md for joins, window functions, and more.

Embedding robin-sparkless in your app

Use the prelude for one-stop imports and optional config from environment for session setup. Results can be returned as JSON for bindings or CLI tools.

[dependencies]
robin-sparkless = "0.12.0"

use robin_sparkless::prelude::*;

fn main() -> Result<(), robin_sparkless::EngineError> {
    // Optional: configure from env (ROBIN_SPARKLESS_WAREHOUSE_DIR, etc.)
    let config = SparklessConfig::from_env();
    let spark = SparkSession::from_config(&config);

    let df = spark
        .create_dataframe(
            vec![
                (1i64, 10i64, "a".to_string()),
                (2i64, 20i64, "b".to_string()),
                (3i64, 30i64, "c".to_string()),
            ],
            vec!["id", "value", "label"],
        )
        .map_err(robin_sparkless::EngineError::from)?;
    let filtered = df
        .filter(col("id").gt(lit_i64(1).into_expr()).into_expr())
        .map_err(robin_sparkless::EngineError::from)?;
    let json = filtered.to_json_rows()?;
    println!("{}", json);
    Ok(())
}

Example output (from the snippet above or cargo run --example embed_readme; JSON key order may vary):

[{"id":2,"value":20,"label":"b"},{"id":3,"value":30,"label":"c"}]

Run the embed_basic example: cargo run --example embed_basic. For a minimal FFI surface (no Polars types), use robin_sparkless::prelude::embed; use the *_engine() methods and schema helpers (StructType::to_json, schema_from_json) so bindings rely only on EngineError and robin-sparkless types. See docs/EMBEDDING.md.

Development

Prerequisites: Rust (see rust-toolchain.toml).

Command	Description
`cargo build`	Build (Rust only)
`cargo test`	Run Rust tests
`make test`	Run Rust tests (wrapper for `cargo test`)
`make check`	Rust only: format check, clippy, audit, deny, Rust tests. Use `make -j5 check` to run the five jobs in parallel.
`make check-full`	Full Rust check suite (what CI runs): `fmt --check`, clippy, audit, deny, tests.
`make fmt`	Format Rust code (run before check if you want to fix formatting).
`make test-parity-phase-a` … `make test-parity-phase-g`	Run parity fixtures for a specific phase (see PARITY_STATUS).
`make test-parity-phases`	Run all parity phases (A–G) via the parity harness.
`make sparkless-parity`	When `SPARKLESS_EXPECTED_OUTPUTS` is set and PySpark/Java are available, convert Sparkless fixtures, regenerate expected from PySpark, and run Rust parity tests.
`cargo bench`	Benchmarks (robin-sparkless vs Polars)
`cargo doc --open`	Build and open API docs

CI runs format, clippy, audit, deny, Rust tests, and parity tests on push/PR (see .github/workflows/ci.yml).

Documentation

Resource	Description
Read the Docs	Full docs: quickstart, Rust usage, Sparkless integration (MkDocs)
docs.rs	Rust API reference
QUICKSTART	Build, usage, optional features, benchmarks
User Guide	Everyday usage (Rust)
Persistence Guide	Global temp views, disk-backed saveAsTable
UDF Guide	Scalar, vectorized, and grouped UDFs
PySpark Differences	Known divergences
Roadmap	Development phases, Sparkless integration
RELEASING	Publishing to crates.io

See CHANGELOG.md for version history.

License

MIT

robin-sparkless 0.12.0

Robin Sparkless

Why Robin Sparkless?

Features

Installation

Rust

Quick start

Rust

Embedding robin-sparkless in your app

Development

Documentation

License