robin-sparkless 4.2.0

# Robin-Sparkless Readiness for Post-Refactor Merge

While Sparkless implements the [refactor plan](SPARKLESS_REFACTOR_PLAN.md) (serializable logical plan and optional `materialize_from_plan`), we can prepare on the robin-sparkless side so that when the new contract lands, integration is a thin adapter rather than a large translation layer. This document lists **concrete steps we can take here** in parallel.

---

## 1. Plan Interpreter: Execute a Serialized Op List

**Goal**: Add an entry point that takes `(data, schema, logical_plan)` and returns rows, using only robin-sparkless’s existing DataFrame API.

- **Rust**: Add a function (e.g. in `src/session.rs` or a new `src/plan.rs`) such as `execute_plan(session, data, schema, plan: &[PlanOp]) -> Result<DataFrame>` where `PlanOp` is a struct that mirrors the serialized format (op name + payload as serde-friendly types). The implementation loops over the plan and calls existing `filter`, `select`, `join`, etc. based on op name.
- **Python**: Expose it, e.g. `robin_sparkless.execute_plan(data, schema, logical_plan)` returning a list of dicts (or a DataFrame that can be collected). Sparkless’s `RobinMaterializer.materialize_from_plan(data, schema, logical_plan)` would then be a one-liner that calls this and converts rows to Sparkless `Row`.
- **Format**: Start with a minimal schema we control (see §2). When Sparkless publishes their format, we either align to it or add a small adapter that maps their plan to our `PlanOp` shape.

**Outcome**: Sparkless backend becomes “call `execute_plan` and convert to Row”; no Column-tree translation in Python.

---

## 2. Propose a Minimal Logical Plan Schema (Optional Coordination)

**Goal**: So that Sparkless and robin-sparkless don’t diverge, we can publish a minimal “backend plan format” we’re willing to consume.

- **Op list**: List of `{"op": "filter"|"select"|"limit"|... , "payload": ...}`. Payload is op-specific.
- **Payload shapes** (minimal set we need):
  - `filter`: expression tree (see below).
  - `select`: `["col1", "col2"]` or list of expression trees for computed columns.
  - `withColumn`: `{"name": "x", "expr": <expression tree>}`.
  - `join`: `{"other_data": [...], "other_schema": [...], "on": ["id"], "how": "inner"}` (other side as data + schema so it’s serializable).
  - `union`: same idea, other as data + schema.
  - `orderBy`: `{"columns": ["a","b"], "ascending": [true, false]}`.
  - `limit` / `offset`: `{"n": 10}`.
  - `groupBy`: `{"group_by": ["a"], "aggs": [{"agg": "sum", "column": "b"}, ...]}`.
  - `distinct`: `{}`.
  - `drop`: `{"columns": ["x"]}`.
  - `withColumnRenamed`: `{"old": "a", "new": "b"}`.
- **Expression tree**: Recursive structure we can interpret, e.g. `{"col": "age"}`, `{"lit": 30}`, `{"op": "gt", "left": {"col": "age"}, "right": {"lit": 30}}`, `{"op": "and", "left": {...}, "right": {...}}`, and for function calls `{"fn": "upper", "args": [{"col": "name"}]}`. Document the set of ops and functions we support so Sparkless can serialize to that subset (or we extend our interpreter).

**Action**: Add a short doc (e.g. `docs/LOGICAL_PLAN_FORMAT.md`) that defines this schema. Sparkless refactor can target it; if they choose a different format, we add a thin “plan adapter” that converts their plan to ours before execution.

**Outcome**: Clear contract; less risk of incompatible plans at merge time.

---

## 3. Expression Interpreter from Structured Form

**Goal**: Our plan interpreter must evaluate “expression trees” (filter conditions, withColumn exprs, select exprs) that are already in serialized form (dict/list/primitives).

- **Rust**: Add a module that turns a serialized expression (e.g. `{"op": "gt", "left": {"col": "age"}, "right": {"lit": 30}}`) into a Polars `Expr` (or our `Column`). Recursively handle `col`, `lit`, comparison ops, logical ops, and function calls. **Done**: The expression interpreter in `src/plan/expr.rs` now supports **all scalar functions** in robin-sparkless that are valid in filter/select/withColumn (string, math, datetime, type/conditional, binary/bit, array/list, map/struct, misc), delegating to `crate::functions` and `Column`; see [LOGICAL_PLAN_FORMAT.md](LOGICAL_PLAN_FORMAT.md).
- **Coverage**: Full. Any function available in `functions.rs` / `Column` for scalar expressions can be used in plan expression trees.
- **Python**: No need to expose the expression interpreter directly; it’s used internally by `execute_plan`.

**Outcome**: We can run plans that include filter/select/withColumn with structured expressions without any Python-side Column translation.

---

## 4. Tests Against Fixture Plans

**Goal**: Lock in behavior of our plan interpreter and catch regressions.

- **Fixture files**: Add JSON fixtures under e.g. `tests/fixtures/plans/` that represent a logical plan (schema + input rows + op list + expected output rows). Same structure as existing parity fixtures but with a “plan” instead of “operations” in the current robin format.
- **Test**: Load fixture, call `execute_plan` (or the Rust equivalent in tests), assert output schema and rows match expected. **Done**: Three plan fixtures — `filter_select_limit.json`, `join_simple.json`, `with_column_functions.json` (withColumn with upper, when); `plan_parity_fixtures` test in `tests/parity.rs`.
- **Reuse**: When Sparkless publishes sample serialized plans (e.g. from their Phase 1 tests), we can add those as fixtures and run them here to ensure we stay aligned.

**Outcome**: Safe refactors; clear compatibility with whatever plan format we commit to.

---

## 5. Python API: Flexible DataFrame Creation from Rows

**Goal**: Sparkless will pass `data` as list of dicts and `schema` as a list of (name, type) or similar.

- **Done (#372)**: Python **`createDataFrame(data, schema=None)`** accepts list of dicts (schema inferred), list of tuples with column names, or explicit schema as list of `(name, dtype_str)` or StructType-like. Sparkless can call `spark.createDataFrame(data, schema)` for arbitrary schemas.
- **Backward compatibility**: `create_dataframe(data, column_names)` remains for 3-tuple rows; `_create_dataframe_from_rows(data, schema)` is internal/compatibility.

**Outcome**: Robin backend can handle any schema Sparkless sends, not only the PoC’s 3-column case.

---

## 6. Summary Table

| Item | Owner | Delivers |
|------|--------|----------|
| Plan interpreter (Rust + Python) | robin-sparkless | `execute_plan(data, schema, plan)` → rows |
| Minimal plan schema doc | robin-sparkless | `docs/LOGICAL_PLAN_FORMAT.md` (optional coordination with Sparkless) |
| Expression interpreter (from dict tree) | robin-sparkless | Filter/select/withColumn exprs from serialized form |
| Plan-based fixtures and tests | robin-sparkless | Regression tests for plan execution |
| Flexible DataFrame creation (`createDataFrame` / `create_dataframe_from_rows`) | robin-sparkless | Python: `createDataFrame(data, schema=None)`; Rust: `create_dataframe_from_rows(rows, schema)` |

Doing these in parallel with Sparkless’s refactor means that when they add `materialize_from_plan` and emit a logical plan, we already have an interpreter and tests; the Sparkless robin backend then just wires their plan into our `execute_plan` and converts results to `Row`.