# Parity Status (PySpark vs Robin Sparkless)
This doc is the **living parity matrix** for `robin-sparkless`.
- **Oracle**: PySpark (fixtures generated by `tests/gen_pyspark_cases.py`)
- **Harness**: `tests/parity.rs`
- **Fixtures**: `tests/fixtures/*.json` (operations format); `tests/fixtures/plans/*.json` (plan format, see [LOGICAL_PLAN_FORMAT.md](LOGICAL_PLAN_FORMAT.md)); `tests/fixtures/phase_manifest.json` (phase-to-fixture mapping for phase-specific tests)
- **Sparkless integration**: Robin-sparkless is designed to replace Sparkless's backend. Sparkless has 270+ expected_outputs; a fixture converter can convert those to robin-sparkless format. See [SPARKLESS_INTEGRATION_ANALYSIS.md](SPARKLESS_INTEGRATION_ANALYSIS.md) §4.
Status as of **February 2026**: **PASSING (212 fixtures; 0 skipped)**. **Phase G** ✅ **COMPLETED**: Parity fixture expansion — 201 hand-written fixtures passing (filter_age_lt_25, filter_name_eq, select_single_column, groupby_count_desc, limit_one, orderby_desc, with_column_lit, distinct_all, fillna_simple, filter_then_select, groupby_sum_simple, filter_ge, filter_ne, filter_le, filter_or_simple, filter_eq_lit, select_reorder, and 40+ more added). **Phase C** ✅ **COMPLETED**: DataFrameReader/Writer parity — `spark.read().option/options/format/load/table/csv/parquet/json`; `df.write().option/options/partition_by/parquet/csv/json`; fixtures `read_csv_with_options`, `read_table`. **Phase D** ✅ **COMPLETED**: DataFrame method gaps — `df.createOrReplaceTempView`, `df.corr(col1,col2)`, `df.cov(col1,col2)`, `toDF`/`toJSON`/`toPandas`, `columns`, `cache`, `hint`, `repartitionByRange`, `sortWithinPartitions`, `sameSemantics`, `semanticHash`, `isLocal`, `inputFiles`, `writeTo` (stub). **Phase E** ✅ **COMPLETED**: SparkSession & Catalog stubs — `spark.catalog()`, `spark.conf()`, `spark.range()`, `spark.version`, `spark.newSession()`, `spark.stop()`, `spark.getActiveSession()`, `spark.getDefaultSession()`, `spark.udf()` (stub); Catalog 27 methods (functional: dropTempView, listTables, tableExists, etc.; stubs: cacheTable, createTable, etc.). **Gap closure** (Feb 2026): bitmap (5), make_dt_interval, make_ym_interval, to_timestamp_ltz/ntz, sequence, shuffle, inline, inline_outer, regr_* (9); DataFrame cube, rollup, write, data, toLocalIterator, persist/unpersist and stubs (rdd, foreach, foreachPartition, mapInPandas, mapPartitions, storageLevel, isStreaming, withWatermark). **Signature alignment** (optional params and two-arg when): fixtures `position_start`, `assert_true_err_msg`, `like_escape_char`, `ilike_escape_char`, `months_between_round_off`, `parse_url_key`, `make_timestamp_timezone`, `to_timestamp_format`, `to_char_format`, `when_two_arg` added. **Phase 25** ✅ **COMPLETED**: Plan interpreter (`execute_plan`), expression interpreter, [LOGICAL_PLAN_FORMAT.md](LOGICAL_PLAN_FORMAT.md), plan fixtures in `tests/fixtures/plans/` (filter_select_limit, join_simple, with_column_functions), `plan_parity_fixtures` test; `create_dataframe_from_rows` (Rust + Python). Remaining: Phase 26 (crate publish), Phase 27 (Sparkless integration). **Phase 24** ✅ **COMPLETED**: bit (bit_and, bit_or, bit_xor, bit_count, bit_get, bitwise_not/bitwiseNOT), control (assert_true, raise_error), JVM stubs (broadcast, spark_partition_id, input_file_name, monotonically_increasing_id, current_catalog, current_database, current_schema, current_user, user), random (rand, randn with per-row values when used in with_column/with_columns), crypto (aes_encrypt, aes_decrypt, try_aes_decrypt; AES-128-GCM). Fixtures `with_bit_ops`, `with_rand_seed`, `with_jvm_stubs`. See [PYSPARK_DIFFERENCES.md](PYSPARK_DIFFERENCES.md) for crypto semantics. **Phase 23** ✅ **COMPLETED**: JSON/URL/misc (isin, url_decode, url_encode, json_array_length, parse_url, hash, shift_left, shift_right, version, equal_null, stack); fixtures `with_isin`, `with_url_decode`, `with_url_encode`, `json_array_length_test`, `with_hash`, `with_shift_left`. **Phase 22** ✅ **COMPLETED**: Datetime extensions (curdate, now, localtimestamp, date_diff, dateadd, datepart, extract, date_part, unix_micros, unix_millis, unix_seconds, dayname, weekday, make_timestamp, make_timestamp_ntz, make_interval, timestampadd, timestampdiff, days, hours, minutes, months, years, from_utc_timestamp, to_utc_timestamp, convert_timezone, current_timezone, to_timestamp); fixtures `with_dayname`, `with_weekday`, `with_extract`, `with_unix_micros`, `make_timestamp_test`, `timestampadd_test`, `from_utc_timestamp_test`. **Phase 21** ✅ **COMPLETED**: String (btrim, locate, conv), binary (hex, unhex, bin, getbit), type (to_char, to_varchar, to_number, try_to_number, try_to_timestamp), array (arrays_overlap, arrays_zip, explode_outer, posexplode_outer, array_agg), map (str_to_map), struct (transform_keys, transform_values). **Phase 20** ✅ **COMPLETED**: Ordering (asc, desc, nulls_first/last), aggregates (median, mode, stddev_pop, var_pop, try_sum, try_avg), numeric (bround, negate, positive, cot, csc, sec, e, pi); fixtures `groupby_median`, `with_bround`; OrderBy supports optional nulls_first. **Phase 19** ✅ **COMPLETED**: Aggregates (any_value, bool_and, bool_or, count_if, max_by, min_by, percentile, product, collect_list, collect_set), try_* (try_divide, try_add, try_subtract, try_multiply), misc (width_bucket, elt, bit_length, typeof); fixtures `groupby_any_value`, `groupby_product`, `try_divide`, `width_bucket`. **Phase 18** ✅ **COMPLETED**: array/map/struct (map_filter, zip_with, map_zip_with). **Phase 17** ✅ **COMPLETED**: Datetime/unix, math (pmod, factorial). **Phase 16** ✅ **COMPLETED**: String/regex. **Phase 15** ✅ **COMPLETED**: aliases, string, math, array_distinct. Remaining: ROADMAP Phases 25–26 (crate publish, Sparkless integration). **Phase 14**: Math (sin, cos, tan, asin, acos, atan, atan2, degrees, radians, signum), datetime (quarter, weekofyear, dayofweek, dayofyear, add_months, months_between, next_day), type/conditional (cast, try_cast, isnan, greatest, least); parity parser extended; fixtures `math_sin_cos`, `datetime_quarter_week`. **Phase 13**: String/binary/collection batch 1: ascii, format_number, overlay, position, char, chr, base64, unbase64, sha1, sha2, md5, array_compact implemented in Rust; parity parser and fixtures `string_ascii`, `string_format_number`. **Phase 12**: DataFrame methods implemented in Rust and exposed in Python: sample, random_split, first, head, tail, take, is_empty, to_json, to_pandas, explain, print_schema, checkpoint, repartition, coalesce, offset, summary, to_df, select_expr, col_regex, with_columns, with_columns_renamed, stat (cov/corr), na (fill/drop), freq_items, approx_quantile, crosstab, melt, except_all, intersect_all, sample_by, and Spark no-ops. Parity fixtures for first/head/offset: `first_row`, `head_n`, `offset_n`. **Phase 11**: Parity harness supports date, timestamp, and boolean in fixture input; datetime fixtures `date_add_sub`, `datediff`, `datetime_hour_minute`; String 6.4 fixtures `string_soundex`, `string_levenshtein`, `string_crc32`, `string_xxhash64`. Window fixtures percent_rank, cume_dist, ntile, nth_value are covered (multi-step workaround in harness). Phase 6: array functions `array_position`, `array_remove`, `posexplode` are **implemented** (via Polars list.eval); array fixtures `array_contains`, `element_at`, `array_size`, `array_sum`; array extensions (exists, forall, filter, transform, array_sum, array_mean; **Phase 8**: array_flatten, array_repeat **implemented** via map UDFs). **Phase 8**: Map (create_map, map_keys, map_values, map_entries, map_from_arrays **implemented**; Map as List(Struct{key, value})). JSON (get_json_object, from_json, to_json implemented). CI runs format, clippy, audit, deny, and all tests (including parity). Python smoke tests in `tests/python/` (run via `make test` or `make test-python`); see [EMBEDDING.md](EMBEDDING.md).
## Phase test coverage
Parity fixtures are grouped into phases (A–G) defined in `tests/fixtures/phase_manifest.json`. Run phase-specific tests:
```bash
PARITY_PHASE=a cargo test pyspark_parity_fixtures
make test-parity-phase-a # Phase A: signature alignment
make test-parity-phase-b # Phase B: high-value functions
make test-parity-phase-c # Phase C: Reader/Writer
make test-parity-phase-d # Phase D: DataFrame methods (read_table)
make test-parity-phase-e # Phase E: SparkSession/Catalog (no fixtures; passes)
make test-parity-phase-f # Phase F: behavioral (assert_true, raise_error)
make test-parity-phase-g # Phase G: fixture expansion
make test-parity-phases # All phases
```
Python phase smoke tests: `test_phase_a_signature_alignment`, `test_phase_b_functions`, `test_phase_c_reader_writer`, `test_phase_d_dataframe_methods`, `test_phase_e_spark_session_catalog`, `test_phase_f_behavioral`. When adding new fixtures, add the fixture name to the appropriate phase in `phase_manifest.json`. See [TEST_CREATION_GUIDE.md](TEST_CREATION_GUIDE.md) for phase testing details.
---
## Legend
- **✅ Covered**: Covered by one or more fixtures (listed)
- **🚧 Not yet covered**: Supported/partially supported but missing fixture coverage
- **❌ Not implemented**: Not implemented in the Rust API yet
- **⚠️ Diverges**: Implemented but intentionally differs from PySpark (must be documented)
## Coverage Matrix (high level)
| Area | Capability | Status | Fixtures |
| --- | --- | --- | --- |
| Data creation | `SparkSession::create_dataframe` (simple rows) | ✅ Covered | `filter_age_gt_30`, `groupby_count`, `groupby_with_nulls` (and most others) |
| Data creation | `SparkSession::create_dataframe_from_rows` (arbitrary schema) | ✅ Covered | Used by plan interpreter; plan fixtures |
| Plan execution | `execute_plan` (serialized logical plan) | ✅ Covered | `tests/fixtures/plans/filter_select_limit`, `join_simple`, `with_column_functions` (plan_parity_fixtures) |
| IO | `read_csv` | ✅ Covered | `read_csv` |
| IO | `read_parquet` | ✅ Covered | `read_parquet` |
| IO | `read_json` | ✅ Covered | `read_json` |
| IO | `spark.read().option/options().csv` (reader options) | ✅ Covered | `read_csv_with_options` |
| IO | `spark.read().table(name)` (temp view) | ✅ Covered | `read_table` |
| DataFrame | `select` | ✅ Covered | many (e.g. `filter_age_gt_30`) |
| DataFrame | `filter` basic comparisons | ✅ Covered | `filter_age_gt_30` |
| DataFrame | `filter` nested boolean logic | ✅ Covered | `filter_and_or`, `filter_nested`, `filter_not` |
| DataFrame | `orderBy` | ✅ Covered | many (e.g. `filter_age_gt_30`, `groupby_count`) |
| GroupBy | `groupBy(...).count()` | ✅ Covered | `groupby_count`, `groupby_with_nulls` |
| GroupBy | `groupBy(...).sum()` | ✅ Covered | `groupby_sum` |
| GroupBy | `groupBy(...).avg()` | ✅ Covered | `groupby_avg` |
| GroupBy | `groupBy(...).min()` | ✅ Covered | `groupby_min` |
| GroupBy | `groupBy(...).max()` | ✅ Covered | `groupby_max` |
| GroupBy | groupBy with NULL keys | ✅ Covered | `groupby_null_keys` |
| GroupBy | groupBy single-row groups / single group | ✅ Covered | `groupby_single_row_groups`, `groupby_single_group` |
| GroupBy | multi-agg `agg([..])` | ✅ Covered | `groupby_multi_agg` |
| GroupBy | stddev, variance, count_distinct in agg | ✅ Covered | `groupby_stddev_count_distinct` |
| DataFrame | `withColumn` (arithmetic) | ✅ Covered | `type_coercion_mixed` |
| DataFrame | `withColumn` (logical/boolean) | ✅ Covered | `with_logical_column` |
| DataFrame | `withColumn` (mixed arithmetic + comparison) | ✅ Covered | `with_arithmetic_logical_mix` |
| Functions | `when().then().otherwise()` | ✅ Covered | `when_otherwise`, `when_then_otherwise` |
| Functions | `coalesce()` | ✅ Covered | `coalesce` |
| Null semantics | NULL equality/inequality | ✅ Covered | `null_comparison_equality` |
| Null semantics | NULL ordering comparisons | ✅ Covered | `null_comparison_ordering` |
| Null semantics | `eqNullSafe` | ✅ Covered | `null_safe_equality` |
| Null semantics | NULLs inside filter predicates | ✅ Covered | `null_in_filter` |
| Type coercion | numeric comparison coercion (int vs double) | ✅ Covered | `type_coercion_numeric` |
| Type coercion | numeric arithmetic coercion (int + double) | ✅ Covered | `type_coercion_mixed` |
| Joins | inner/left/right/outer joins | ✅ Covered | `inner_join`, `left_join`, `right_join`, `outer_join` |
| Joins | join with NULL keys (inner: nulls excluded) | ✅ Covered | `join_null_keys` |
| Joins | join with duplicate keys (cartesian match) | ✅ Covered | `join_duplicate_keys` |
| Windows | row_number, rank, dense_rank, lag, lead | ✅ Covered | `row_number_window`, `rank_window`, `lag_lead_window` |
| Strings | upper, lower, substring, concat, concat_ws | ✅ Covered | `string_upper_lower`, `string_substring`, `string_concat` |
| Strings | length, trim, ltrim, rtrim, regexp_extract, regexp_replace, split, initcap | ✅ Covered | `string_length_trim` |
| Config | `spark.sql.caseSensitive` (case-insensitive column resolution) | ✅ Covered | `case_insensitive_columns` |
| DataFrame | `union` / `unionAll` | ✅ Covered | `union_all` |
| DataFrame | `unionByName` | ✅ Covered | `union_by_name` |
| DataFrame | `distinct` / `dropDuplicates` | ✅ Covered | `distinct` |
| DataFrame | `drop` (columns) | ✅ Covered | `drop_columns` |
| DataFrame | `dropna` | ✅ Covered | `dropna` |
| DataFrame | `fillna` (single value) | ✅ Covered | `fillna` |
| DataFrame | `limit` | ✅ Covered | `limit` |
| DataFrame | `withColumnRenamed` | ✅ Covered | `with_column_renamed` |
| Array/List | array, array_contains, element_at, size/array_size, array_join, array_sort, array_slice, explode; array_position, array_remove, posexplode (implemented) | ✅ Covered | `array_contains`, `element_at`, `array_size` |
| Windows | first_value, last_value, percent_rank | ✅ Covered | `first_value_window`, `last_value_window`, `percent_rank_window` |
| Windows | cume_dist, ntile, nth_value | ✅ Covered | `cume_dist_window`, `ntile_window`, `nth_value_window` (multi-step workaround in harness) |
| Strings | regexp_extract_all, regexp_like | ✅ Covered | `regexp_extract_all`, `regexp_like` |
| Strings | repeat, reverse, instr, lpad, rpad | ✅ Covered | `string_repeat_reverse`, `string_lpad_rpad` |
| Strings | mask, translate, substring_index; soundex, levenshtein, crc32, xxhash64 (Phase 8) | ✅ Covered | `string_mask`, `string_translate`, `string_substring_index`, `string_soundex`, `string_levenshtein`, `string_crc32`, `string_xxhash64` |
| Strings (Phase 13) | ascii, format_number, overlay, position, char, chr, base64, unbase64, sha1, sha2, md5 | ✅ Implemented | `string_ascii`, `string_format_number` |
| Strings (Phase 16) | regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf | ✅ Covered | `regexp_count`, `regexp_substr`, `regexp_instr`, `split_part`, `find_in_set`, `format_string` |
| Datetime (Phase 17) | unix_timestamp, from_unixtime, make_date, timestamp_seconds/millis/micros, unix_date, date_from_unix_date | ✅ Covered | `unix_timestamp`, `from_unixtime`, `make_date`, `timestamp_seconds`, `timestamp_millis`, `timestamp_micros`, `unix_date`, `date_from_unix_date` |
| Math (Phase 17) | pmod, factorial | ✅ Covered | `pmod`, `factorial` |
| Array | array_sum, array_exists, forall, filter, transform; array_flatten, array_repeat (Phase 8); array_compact (Phase 13) | ✅ Implemented | `array_sum` |
| Map | create_map, map_keys, map_values, map_entries, map_from_arrays (Phase 8) | ✅ Implemented | No fixture yet |
| JSON | get_json_object, from_json, to_json (Phase 10) | ✅ get_json_object covered | `json_get_json_object` |
| Math | sqrt, pow, exp, log | ✅ Covered | `math_sqrt_pow` |
| GroupBy | first, last, approx_count_distinct in agg | ✅ Covered | `groupby_first_last` |
| GroupBy (Phase 19) | any_value, bool_and, bool_or, product, collect_list, collect_set, count_if, percentile, max_by, min_by | ✅ Covered | `groupby_any_value`, `groupby_product` |
| Misc (Phase 19) | try_divide, try_add, try_subtract, try_multiply, width_bucket, elt, bit_length, typeof | ✅ Covered | `try_divide`, `width_bucket` |
| DataFrame | replace, crossJoin, describe, subtract, intersect | ✅ Covered | `replace`, `cross_join`, `describe`, `subtract`, `intersect` |
| SQL | `SparkSession::sql()` (optional `sql` feature) | ✅ Implemented | No fixture (SQL translated to DataFrame ops; parity via DataFrame fixtures) |
| Datetime | year, month, day, to_date, date_format; current_date, date_add, hour, etc. | ✅ Covered | `date_add_sub`, `datediff`, `datetime_hour_minute` |
| DataFrame (Phase 12) | first, head, offset, sample, to_json, summary, stat, select_expr, freq_items, crosstab, melt, etc. (Rust + PyO3) | ✅ first/head/offset/summary covered | `first_row`, `head_n`, `offset_n`, `summary`; additional Phase 12 ops implemented, fixtures TBD |
| DataFrame (Phase D) | createOrReplaceTempView, corr(col1,col2), cov(col1,col2), toDF/toJSON/toPandas, columns, cache, hint, repartitionByRange, sortWithinPartitions, sameSemantics, semanticHash, isLocal, inputFiles, writeTo (stub) | ✅ Implemented | Python: `test_phase_d_dataframe_methods`; table read via `read_table` fixture |
## Fixture Index
| Fixture | What it covers |
| --- | --- |
| `filter_age_gt_30` | Filter + select + orderBy (baseline) |
| `filter_and_or` | AND/OR precedence + parentheses |
| `filter_nested` | Nested boolean logic |
| `filter_not` | NOT / negation |
| `groupby_count` | groupBy + count + orderBy |
| `groupby_with_nulls` | groupBy with NULLs |
| `groupby_sum` | groupBy + sum |
| `groupby_avg` | groupBy + avg |
| `groupby_min` | groupBy + min |
| `groupby_max` | groupBy + max |
| `groupby_null_keys` | groupBy with NULL keys |
| `groupby_single_row_groups` | groupBy with single-row groups (each key once) |
| `groupby_single_group` | groupBy with single group (all same key) |
| `join_null_keys` | inner join with NULL join keys (nulls excluded) |
| `join_duplicate_keys` | inner join with duplicate keys (multiple matches) |
| `case_insensitive_columns` | case-insensitive column resolution (filter/select/orderBy with mixed-case names) |
| `read_csv` | CSV read path + operations |
| `read_parquet` | Parquet read path + operations |
| `read_json` | JSON read path + operations |
| `read_csv_with_options` | spark.read.option("header","true").csv(path) with reader_options |
| `read_table` | spark.read.table("name") via table_source (temp view) |
| `with_logical_column` | Logical columns/expressions in withColumn |
| `with_arithmetic_logical_mix` | Mixed arithmetic + comparison in withColumn |
| `when_otherwise` | when/then/otherwise |
| `when_then_otherwise` | chained when |
| `coalesce` | coalesce null handling |
| `null_comparison_equality` | NULL equality/inequality semantics |
| `null_comparison_ordering` | NULL ordering semantics |
| `null_safe_equality` | eqNullSafe semantics |
| `null_in_filter` | NULLs in filter predicates |
| `type_coercion_numeric` | int/double comparison coercion |
| `type_coercion_mixed` | int+double arithmetic coercion |
| `inner_join` | inner join on dept_id |
| `left_join` | left join + orderBy |
| `right_join` | right join + orderBy |
| `outer_join` | outer join + orderBy |
| `groupby_multi_agg` | groupBy + multiple aggregations in one agg() |
| `groupby_stddev_count_distinct` | groupBy + stddev and count_distinct in agg |
| `row_number_window` | row_number() over partition by dept order by salary desc |
| `rank_window` | rank() over partition with ties |
| `lag_lead_window` | lag and lead over partition |
| `string_upper_lower` | upper(), lower() |
| `string_substring` | substring() 1-based |
| `string_concat` | concat(), concat_ws() |
| `string_length_trim` | length(), trim() in withColumn |
| `union_all` | union (vertical stack, same schema) |
| `union_by_name` | unionByName (align columns by name) |
| `distinct` | distinct (drop duplicate rows) |
| `drop_columns` | drop(columns) |
| `dropna` | dropna (drop rows with nulls) |
| `fillna` | fillna (fill nulls with value) |
| `limit` | limit(n) |
| `with_column_renamed` | withColumnRenamed(old, new) |
| `array_contains` | split + array_contains(col, lit) |
| `element_at` | split + element_at(col, 1-based index) |
| `array_size` | split + size(col) |
| `first_value_window` | first_value over partition |
| `last_value_window` | last_value over partition |
| `percent_rank_window` | percent_rank over partition |
| `cume_dist_window` | cume_dist over partition |
| `ntile_window` | ntile(n) over partition |
| `nth_value_window` | nth_value over partition |
| `regexp_like` | regexp_like(col, pattern) boolean match |
| `regexp_extract_all` | regexp_extract_all(col, pattern) list of matches |
| `string_repeat_reverse` | repeat(col, n), reverse(col) |
| `string_lpad_rpad` | lpad(col, len, pad), rpad(col, len, pad) |
| `math_sqrt_pow` | sqrt(col), pow(col, exp) |
| `groupby_first_last` | groupBy + first(name), last(name) |
| `groupby_any_value` | groupBy + any_value(column) |
| `groupby_product` | groupBy + product(column) |
| `try_divide` | try_divide(col, col) — null on divide-by-zero |
| `width_bucket` | width_bucket(value, min, max, num_bucket) |
| `cross_join` | crossJoin (cartesian product) |
| `describe` | describe() summary statistics |
| `summary` | summary() (same as describe) |
| `replace` | replace(column, old_value, new_value) |
| `subtract` | subtract (set difference) |
| `intersect` | intersect (set intersection) |
| `first_row` | first() – first row as one-row DataFrame |
| `head_n` | head(n) – first n rows |
| `offset_n` | offset(n) – skip first n rows |
| `string_mask` | mask(col) – replace upper/lower/digit with X/x/n |
| `string_translate` | translate(col, from_str, to_str) |
| `string_substring_index` | substring_index(col, delim, count) before/after nth delim |
| `array_sum` | array(cols) + array_sum(col) |
| `json_get_json_object` | get_json_object(col, '$.path') |
| `date_add_sub` | date_add(col('d'), 7), date_sub(col('d'), 3) |
| `datediff` | datediff(col('end'), col('start')) |
| `datetime_hour_minute` | hour(col('ts')), minute(col('ts')) with timestamp input |
| `string_soundex` | soundex(col('name')) |
| `string_levenshtein` | levenshtein(col('a'), col('b')) |
| `string_crc32` | crc32(col('s')) |
| `string_xxhash64` | xxhash64(col('s')) |
| `string_ascii` | ascii(col('name')) → first-char code point |
| `string_format_number` | format_number(col('value'), 2) → fixed-decimal string |
| `phase15_aliases_nvl_isnull` | nvl, nvl2, isnull, isnotnull (Phase 15) |
| `string_left_right_replace` | left, right, replace, startswith, endswith, contains, like, ilike, rlike |
| `math_cosh_cbrt` | cosh, sinh, tanh, acosh, asinh, atanh, cbrt, expm1, log1p, log10, log2, rint, hypot |
| `array_distinct` | array_distinct(col) — *skipped* (order divergence) |
| `regexp_count` | regexp_count(col, pattern) – count non-overlapping matches |
| `regexp_substr` | regexp_substr(col, pattern) – first match substring |
| `regexp_instr` | regexp_instr(col, pattern) – 1-based position of first match |
| `split_part` | split_part(col, delim, part_num) – 1-based part of split |
| `find_in_set` | find_in_set(col('str'), col('set')) – 1-based index in comma-delimited list |
| `format_string` | format_string('%d %s', col('a'), col('b')) – printf-style formatting |
| `unix_timestamp` | unix_timestamp(col), unix_timestamp(col, format) – string to seconds |
| `from_unixtime` | from_unixtime(col), from_unixtime(col, format) – seconds to formatted string |
| `make_date` | make_date(year, month, day) – build date from parts |
| `timestamp_seconds` | timestamp_seconds(col) – seconds epoch to timestamp |
| `timestamp_millis` | timestamp_millis(col) – millis epoch to timestamp |
| `timestamp_micros` | timestamp_micros(col) – micros epoch to timestamp |
| `unix_date` | unix_date(col) – date to days since epoch |
| `date_from_unix_date` | date_from_unix_date(col) – days to date |
| `pmod` | pmod(a, b) – positive modulus |
| `factorial` | factorial(n) – n! for n 0..20 |
| `with_bit_ops` | bit operations (bit_and, bit_or, bit_xor, bit_count, bit_get) via withColumn |
## Next additions to the matrix (recommended)
- Add more join edge-case fixtures (e.g. left/outer with null keys) if needed.
- **ROADMAP Phases 16–27**: Phases 18–19 completed. Phases 20–24 (full parity in 5 parts), Phase 25 (readiness for post-refactor merge), Phase 26 (publish crate on crates.io), Phase 27 (Sparkless integration, 200+ tests). See [ROADMAP.md](ROADMAP.md), [GAP_ANALYSIS_SPARKLESS_3.28.md](GAP_ANALYSIS_SPARKLESS_3.28.md).
## Sparkless Test Conversion
Sparkless ([github.com/eddiethedean/sparkless](https://github.com/eddiethedean/sparkless)) has 270+ JSON expected outputs in `tests/expected_outputs/`. These can drive robin-sparkless parity tests via a fixture converter that maps Sparkless JSON format → robin-sparkless fixture format. See [SPARKLESS_INTEGRATION_ANALYSIS.md](SPARKLESS_INTEGRATION_ANALYSIS.md) §4 for:
- Fixture format comparison (input_data vs input/rows; expected_output vs expected)
- Conversion steps per test
- Priority order: parity/dataframe, parity/functions, then parity/sql