robin-sparkless 0.8.4

PySpark-like DataFrame API in Rust on Polars; no JVM.
docs.rs failed to build robin-sparkless-0.8.4
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: robin-sparkless-0.8.0

Robin Sparkless

PySpark-style DataFrames in Rust—no JVM. A DataFrame library that mirrors PySpark’s API and semantics while using Polars as the execution engine.

CI crates.io PyPI version docs.rs Documentation License: MIT


Why Robin Sparkless?

  • Familiar APISparkSession, DataFrame, Column, and PySpark-like functions so you can reuse patterns without the JVM.
  • Polars under the hood — Fast, native Rust execution with Polars for IO, expressions, and aggregations.
  • Persistence options — Global temp views (cross-session in-memory) and disk-backed saveAsTable via spark.sql.warehouse.dir.
  • Rust-first, Python optional — Use it as a Rust library or build the Python extension via PyO3 for a drop-in style API.
  • Sparkless backend target — Designed to power Sparkless (the Python PySpark replacement) so Sparkless can run on this engine via PyO3.

Features

Area What’s included
Core SparkSession, DataFrame, Column; filter, select, with_column, order_by, group_by, joins
IO CSV, Parquet, JSON via SparkSession::read_*
Expressions col(), lit(), when/then/otherwise, coalesce, cast, type/conditional helpers
Aggregates count, sum, avg, min, max, and more; multi-column groupBy
Window row_number, rank, dense_rank, lag, lead, first_value, last_value, and others with .over()
Arrays & maps array_*, explode, create_map, map_keys, map_values, and related functions
Strings & JSON String functions (upper, lower, substring, regexp_*, etc.), get_json_object, from_json, to_json
Datetime & math Date/time extractors and arithmetic, year/month/day, math (sin, cos, sqrt, pow, …)
Optional SQL spark.sql("SELECT ...") with temp views, global temp views (cross-session), and tables: createOrReplaceTempView, createOrReplaceGlobalTempView, table(name), table("global_temp.name"), df.write().saveAsTable(name, mode=...), spark.catalog().listTables() — enable with --features sql
Optional Delta read_delta(path) or read_delta(table_name), read_delta_with_version, write_delta, write_delta_table(name) — enable with --features delta (path I/O); table-by-name works with sql only
UDFs Scalar and vectorized Python UDFs via spark.udf().register(...), grouped vectorized pandas UDFs for group_by().agg(...) (function_type="grouped_agg"), and pure-Rust UDFs; see docs/UDF_GUIDE.md

Parity: 200+ fixtures validated against PySpark. Known differences from PySpark are documented in docs/PYSPARK_DIFFERENCES.md. Out-of-scope items (XML, UDTF, streaming, RDD) are in docs/DEFERRED_SCOPE.md. Full parity status: docs/PARITY_STATUS.md.


Installation

Rust

Add to your Cargo.toml:

[dependencies]
robin-sparkless = "0.8.4"

Optional features:

robin-sparkless = { version = "0.8.4", features = ["sql"] }   # spark.sql(), temp views
robin-sparkless = { version = "0.8.4", features = ["delta"] }  # Delta Lake read/write

Python (PyO3)

Install from PyPI (Python 3.8+):

pip install robin-sparkless

Or build from source with maturin:

pip install maturin
maturin develop --features pyo3
# With optional SQL and/or Delta:
maturin develop --features "pyo3,sql"
maturin develop --features "pyo3,delta"
maturin develop --features "pyo3,sql,delta"

Then use the robin_sparkless module; see docs/PYTHON_API.md.


Quick start

Rust

use robin_sparkless::{col, lit_i64, SparkSession};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let spark = SparkSession::builder().app_name("demo").get_or_create();

    // Create a DataFrame from rows (id, age, name)
    let df = spark.create_dataframe(
        vec![
            (1, 25, "Alice".to_string()),
            (2, 30, "Bob".to_string()),
            (3, 35, "Charlie".to_string()),
        ],
        vec!["id", "age", "name"],
    )?;

    // Filter and show
    let adults = df.filter(col("age").gt(lit_i64(26)))?;
    adults.show(Some(10))?;

    Ok(())
}

Output (from show):

shape: (2, 3)
┌─────┬─────┬─────────┐
│ id  ┆ age ┆ name    │
│ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ str     │
╞═════╪═════╪═════════╡
│ 2   ┆ 30  ┆ Bob     │
│ 3   ┆ 35  ┆ Charlie │
└─────┴─────┴─────────┘

You can also wrap an existing Polars DataFrame with DataFrame::from_polars(polars_df). See docs/QUICKSTART.md for joins, window functions, and more.

Python

import robin_sparkless as rs

spark = rs.SparkSession.builder().app_name("demo").get_or_create()
df = spark.create_dataframe(
    [(1, 25, "Alice"), (2, 30, "Bob"), (3, 35, "Charlie")],
    ["id", "age", "name"],
)
filtered = df.filter(rs.col("age") > rs.lit(26))  # or .gt(rs.lit(26))
print(filtered.collect())

Output:

[{'id': 2, 'age': 30, 'name': 'Bob'}, {'id': 3, 'age': 35, 'name': 'Charlie'}]

Development

Prerequisites: Rust (see rust-toolchain.toml). For Python tests: Python 3.8+, maturin, pytest. For full check (lint + type-check): ruff, mypy (installed by Makefile when needed).

Command Description
cargo build Build (Rust only)
cargo build --features pyo3 Build with Python extension
cargo test Run Rust tests
make test Run Rust + Python tests (creates venv, maturin develop --features pyo3,sql,delta, pytest)
make check Rust only: format check, clippy, audit, deny, Rust tests. Use make -j5 check to run the five jobs in parallel.
make check-full Full CI: check + Python lint (ruff, mypy) + Python tests. Use make -j7 check-full to run all 7 jobs in parallel (5 Rust + 2 Python), or -j3 for the three top-level jobs.
make fmt Format Rust code (run before check if you want to fix formatting).
make test-parity-phase-amake test-parity-phase-g Run parity fixtures for a specific phase (see PARITY_STATUS)
make lint-python Python only: ruff format --check, ruff check, mypy
cargo bench Benchmarks (robin-sparkless vs Polars)
cargo doc --open Build and open API docs
make gap-analysis PySpark gap analysis (clones Spark repo, extracts APIs, produces docs/GAP_ANALYSIS_PYSPARK_REPO.md)
make gap-analysis-quick Quick gap analysis (uses existing pyspark_api_from_repo.json)

CI runs format, clippy, audit, deny, Rust tests, Python lint (ruff, mypy), and Python tests on push/PR (see .github/workflows/ci.yml).


Documentation

Resource Description
Read the Docs Full docs: quickstart, Python API, Sparkless integration (MkDocs)
docs.rs Rust API reference
PyPI Python package (wheels for Linux, macOS, Windows)
QUICKSTART Build, usage, optional features, benchmarks
User Guide Everyday usage (Rust and Python)
Persistence Guide Global temp views, disk-backed saveAsTable
UDF Guide Scalar, vectorized, and grouped UDFs
PySpark Differences Known divergences
Rust–Python parity cross-check Column/function binding parity (Rust vs Python)
Roadmap Development phases, Sparkless integration
RELEASING Publishing to crates.io

See CHANGELOG.md for version history.


License

MIT