substrait-explain 0.2.0

# Substrait Text Format Grammar

This document describes the grammar for the human-readable Substrait text format used by `substrait-explain`. This format allows you to write Substrait query plans in a concise, readable text format that can be parsed back into full Substrait protobuf plans.

## Overview

The Substrait text format consists of two main sections:

1. **Extensions Section** (optional) - Defines URIs and function/type extensions
2. **Plan Section** - Contains the actual query plan with relations

## Design Principles

The grammar is designed around several concrete choices that make it practical and consistent:

### 1. Single-Line, Structured Relations

All relations follow the same structure: `Name[arguments => columns]`

- **Name**: The relation type (Read, Filter, Project, etc.)
- **Arguments**: Relation-specific: input expressions, field references, or function calls
  - Arguments follow a regular pattern (tuple, input expression, etc.) or combination, and should map directly to Substrait proto fields. Uses tuples for compound arguments, with literals, expressions, and enums for values.
- **Arrow**: `=>` separates arguments from output columns
- **Columns**: Output column names and types

Every relation fits on one line with indentation showing hierarchy. This uniform pattern makes it easy to parse any relation, understand input/output structure, and add new relation types.

### 2. SQL-Like References, Literals, and Enums

- Field references: `$0`, `$1`, etc.
- Types are shown inline with literals and column names: `42:i64`, `'hello':string`
- Nullability is explicit: `string?` for nullable, `string` for non-nullable

This prevents ambiguity and makes plans self-documenting while being familiar to SQL developers.

### 3. Extension Support and Structured Syntax

- Extensions section defines URIs and function/type mappings.
- Function calls can include anchors: `add#10@1($0, $1)`.
- Clear structural boundaries: `[]` for relations, `<>` for types, `()` for functions.
- Maintains full Substrait compatibility while keeping the text format readable and parseable.

### 4. Hierarchical Organization

- Section headers (`===`) separate major components.
- 2-space indentation shows query plan hierarchy.
- Consistent formatting across all document elements.

The format maps directly to Substrait protobuf messages, with relations, expressions, types, and extensions corresponding to their respective protobuf structures.

## Grammar Notation

This document uses **PEG (Parsing Expression Grammar)** notation:

- **`"text"`** - Literal text
- **`element?`** - Optional element
- **`element*`** - Zero or more repetitions
- **`element+`** - One or more repetitions
- **`element1 / element2`** - Choice (try element1 first)
  - _Implementation Note: Pest uses `|` instead of `/`_
- **`element1 element2`** - Sequence
  - _Implementation Note: Pest uses `~` for explicit concatenation_

## Basic Example

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Extensions
URIs:
  @  1: https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml
Functions:
  ## 10 @  1: add
  ## 11 @  1: multiply

=== Plan
Root[result]
  Project[$0, $1, add($0, $1)]
    Read[orders => quantity:i32?, price:i64]
# "#;
#
# let plan = match Parser::parse(plan_text) {
#     Ok(plan) => plan,
#     Err(e) => panic!("{}", e),
# };
# assert_eq!(plan.relations.len(), 1);
```

## Document Structure

A Substrait text format document consists of two main sections with specific formatting rules.

### Sections

The document uses `===` headers to separate major sections:

- **`=== Extensions`** - Defines URIs and function/type mappings (optional)
- **`=== Plan`** - Contains the actual query plan (required)

#### Extension format

```text
=== Extensions
URIs:
  @  uri_anchor: uri
  …
Functions:
  ##  anchor @  uri_anchor: name
  …
Types:
  ##  anchor @  uri_anchor: name
  …
Type Variations:
  ##  anchor @  uri_anchor: name
  …
```

Where `anchor` and `uri_anchor` are integers, `uri` is a text URI, and function, type, and type variation names are identifiers or quoted text.

### Plan Hierarchy and Indentation

Relations use indentation to show the query plan hierarchy:

- **Root level**: No indentation (typically `Root` relation)
- **Child relations**: Indented with 2 spaces per level
- **Each relation**: On its own line with format `Name[arguments => columns]`

#### Example

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Extensions
URIs:
  @  1: https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml
Functions:
  ## 10 @  1: gt

=== Plan
Root[result]                   // Level 0 (no indentation)
  Project[$0, $1]              // Level 1 (2 spaces)
    Filter[gt($0, 10) => $0]   // Level 2 (4 spaces)
      Read[data => a:i64]      // Level 3 (6 spaces)
# "#;
#
# let plan = match Parser::parse(plan_text) {
#     Ok(plan) => plan,
#     Err(e) => panic!("{}", e),
# };
# assert_eq!(plan.relations.len(), 1);
```

## Basic Terminals

### Character Classes

- **`letter`**` := [a-zA-Z]` - Alphabetic characters
- **`digit`**` := [0-9]` - Numeric digits

### `name` and `identifier`

- **`name`**` := identifier / quoted_name`
  - Used for column names, function names, etc. It can be unquoted if it's a valid identifier, or using "double quotes" if special characters are required (much like SQL)
  - Examples: `function_name`, `"quoted name"`
- **`identifier`**` := letter (letter / digit / "_")*`
  - Used for columns, function names, etc. that are proper identifiers.
  - Examples: `table_name`, `my_function`, `col1`
- **`quoted_name`**` := '"' ("\\" . / !'"' .)* '"'`
  - Used for columns, function names, etc. that are not valid as identifiers, and thus need quoting.
  - Examples: `"function name"`, `"table.name"`, `"table\.name"`, `"function \"with some\nescapes\""`

### `enum`

Enum fields in arguments are represented as &-prefixed variants (e.g., `&AscNullsFirst`), matching the Substrait proto definition. This applies to all enum fields in relation arguments.

#### Syntax

`enum := "&" identifier`

#### Examples

- `&AscNullsFirst`, `&AscNullsLast`, `&DescNullsFirst`, `&DescNullsLast` - sort directions

### `literal`

A literal can come in the form of an integer, float, boolean, or string, and can have an optional additional type:

`literal := (float / integer / boolean / string) (":" type)?`

- **`integer`**` := "-"? digit+`
  - Examples: `42`, `-10`, `0`
  - Default to `i64` type; other integer types may be assigned
- **`float`**` := "-"? digit+ "." digit+`
  - Examples: `3.14`, `-2.5`, `1.0`
  - Default to `fp64` type; other float types may be assigned
- **`boolean`**` := "true" / "false"`
  - Examples: `true`, `false`
  - May only be boolean type
- **`string`**` := "'" ("\\" . / !"'" .)* "'"`
  - Examples: `'hello'`, `'table name'`, `'C:\path\to\file'`, `'line1\nline2'`, `'quote\'s here'`
  - Default to `string` type; other types may also be assigned
- **`typed_literal`**` := string ":" type`
  - String literals with type annotations for non-primitive types
  - Examples: `'2023-01-01':date`, `'2023-12-25T14:30:45.123':timestamp`

All literal types (`integer`, `float`, `boolean`, and `string`) are now supported in the current implementation. Support for typed literals (string literals with non-primitive type annotations like `'2023-01-01':date`) remains to be implemented.

## Types

The type syntax in this grammar follows the [standard Substrait type definition syntax](https://substrait.io/types/type_parsing/), with extensions to support anchors and URI references for user-defined types.

### Type Syntax Overview

All types follow this general pattern:

```text
type := "u!"? name anchor? uri_anchor? nullability? parameters?
```

Where:

- **`"u!"`** - Optional prefix for user-defined types
- **`name`** - Type name (case-insensitive, lowercase preferred)
- **`anchor`**` := "#" integer` - Extension anchor (e.g., `#10`)
- **`uri_anchor`**` := "@" integer` - URI anchor (e.g., `@1`)
- **`nullability`**` := "?"` - Optional nullability indicator (defaults to non-nullable)
- **`parameters`**` := "<" (param ("," param)*)? ">"` - Optional type parameters
- **`param`**` := type / integer / name` - Type parameter (type, integer, or name)

### Simple Types

Simple types are the basic Substrait types with optional nullability.

#### Syntax

`simple_type_name nullability?`

#### Simple Type Names

From [official Substrait grammar](https://raw.githubusercontent.com/substrait-io/substrait/refs/heads/main/grammar/SubstraitType.g4), `simple_type_name` can be any of these literal strings:

- `boolean`, `i8`, `i16`, `i32`, `i64`
- `fp32`, `fp64`
- `string`, `binary`
- `timestamp`, `timestamp_tz`, `date`, `time`
- `interval_year`, `uuid`

#### Nullability

- `?` - nullable
- `⁉` - unspecified nullability (not generally valid)
- (nothing) - non-nullable

##### Examples:

```rust
# use substrait_explain::parser::Parser;

let plan_text = r#"
=== Plan
Root[result]
  Project[$0, $1, $2, $3]
    Read[data => int_field:i64, string_field:string?, created_at:timestamp?, user_id:uuid]
"#;
#
# let plan = match Parser::parse(plan_text) {
#     Ok(plan) => plan,
#     Err(e) => panic!("{}", e),
# };
# assert_eq!(plan.relations.len(), 1);
```

### Compound Types

Compound types follow the same syntax as standard Substrait parameterized types.

#### Examples

// TODO: This example uses `map` type, which is not yet implemented in the parser.

```text
use substrait_explain::parser::Parser;

let plan_text = r#"
=== Plan
Root[result]
  Project[$0, $1, $2]
    Read[data => list_field:list<i64>, map_field:map<string, i64>, struct_field:struct<i64, string?>]
"#;

let plan = Parser::parse(plan_text).unwrap();
assert_eq!(plan.relations.len(), 1);
```

### User-Defined Types

User-defined types extend the standard Substrait UDT syntax to support anchors and URI references.

#### Syntax

`"u!"? name anchor? uri_anchor? nullability? parameters?`

#### Key differences from standard Substrait

- The `u!` prefix is optional (can be omitted when anchors are present)
- Adds optional `anchor` and `uri_anchor` for extension references
- Maintains compatibility with standard Substrait UDT syntax

#### Examples

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Extensions
URIs:
  @  1: https://example.com/types
  @  2: https://example.com/functions
Types:
  ##  8 @  1: point
  ##  9 @  1: custom_type
Functions:
  ## 10 @  2: add

=== Plan
Root[result]
  Project[$0, $1, $2]
    Read[data => point_field:point#8@1?<i8>, custom_field:custom_type#9, prefixed_field:u!custom_type]
# "#;
#
# let plan = Parser::parse(plan_text).unwrap();
# assert_eq!(plan.relations.len(), 1);
```

## Expressions

#### Syntax

`expression := function_call / reference / literal`

### Examples

```text
add($3, 10)            // Simple function call
add#10@2(#3, 10):int   // Function call with anchors and type
```

### Field References

Currently, only references to fields in the Relations' input are supported.

#### Syntax

`reference := "$" integer`

#### Examples

```rust
use substrait_explain::parser::Parser;

let plan_text = r#"
=== Plan
Root[result]
  Project[$0, $1, $42]
    Read[data => field0:i64, field1:string, field42:boolean]
"#;

let plan = Parser::parse(plan_text).unwrap();
assert_eq!(plan.relations.len(), 1);
```

### Function Calls

#### Syntax

`function_call := name anchor? uri_anchor? "(" (expression ("," expression)*)? ")" (":" type)?`

#### Components

- `name` - function name
- `anchor` - optional anchor (e.g., `#10`)
- `uri_anchor` - optional URI anchor (e.g., `@1`)
- `expression` - as above
- `type` - optional output type

### Aggregate Measures

Aggregate measures are used in the output of Aggregate relations. They can be either field references (to pass through existing fields) or aggregate function calls (to compute aggregates).

#### Syntax

- `aggregate_measure := name anchor? uri_anchor? "(" expression ")" (":" type)?` - aggregate function call with optional extension anchors and output type
- Field references: `$0`, `$1`, ...

#### Examples

- `sum($2)`
- `count($1)`
- `avg($3):fp64`
- `$0` (field reference to grouping field)

## Relations

Relations represent the operations in a query plan. Each relation is displayed on a single line with indentation showing the hierarchy.

### General Relation Grammar

All relations follow this general pattern:

#### Syntax

```text
relation := name "[" (arguments ("," named_arguments)? ("=>" columns)?)? "]"
columns := name ("," name)* / reference_list
```

Where:

- **`name`**: The type of operation (Read, Filter, Project, Root, etc.)
- **`arguments`**: Input expressions, field references, function calls, or other parameters (optional)
- **`named_arguments`**: Named arguments (optional)
- **`=>`**: Separator between arguments and output columns (optional, only present when both arguments and columns are specified)
- **`columns`**: Output column names and types, or field references for pass-through (all relations specify outputs, but format varies)

#### Example

```text
RelationName[arguments, named_arguments => columns]
```

#### Special cases

- **Root relation**: Only specifies output column names, no arguments or `=>` separator
- **Project relation**: Only specifies expressions, no `=>` separator or output columns
- Some relations may use '...' instead of column names when they pass through all fields

The exact structure varies by relation type, but all follow this basic pattern.

### Arguments

Arguments in relations can be literals, expressions, enums, or tuples thereof.

#### Syntax

```text
argument := literal / expression / enum / tuple
tuple := "(" argument ("," argument)* ")"
arguments := argument ("," argument)*
named_arguments := name "=" argument ("," name "=" argument)*
```

#### Examples

- Simple arguments: `$0`, `42`, `'hello'`, `&AscNullsFirst`
- Tuple arguments: `($0, &AscNullsFirst)`, `(limit=10, offset=5)`
- Named arguments: `limit=10`, `offset=5`

### Root Relation

#### Syntax

`"Root" "[" (name ("," name)*)? "]"`

#### Example

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Plan
Root[c, d]           // root with output columns c and d
  Project[$0, $1]
    Read[data => a:i64, b:string]
# "#;
#
# let plan = Parser::parse(plan_text).unwrap();
# assert_eq!(plan.relations.len(), 1);
```

### Read Relation

#### Syntax

`"Read" "[" table_name "=>" (named_column ("," named_column)*)? "]"`

#### Components

- `table_name := name ("." name)*` - table name, optionally qualified with schema/database
- `named_column := name ":" type` - column name with type annotation

#### Example

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Plan
Root[result]
  Project[$0, $1]
    Read[schema.table => a:i64, b:string?]
Root[result2]
  Project[$0, $1]
    Read[orders => quantity:i32?, price:i64]
# "#;
#
# let plan = Parser::parse(plan_text).unwrap();
# assert_eq!(plan.relations.len(), 2);
```

### Filter Relation

#### Syntax

`"Filter" "[" expression "=>" reference_list "]"`

#### Components

- `expression` - boolean expression for filtering
- `reference_list := reference ("," reference)*` - comma-separated list of field references to pass through

#### Example

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Extensions
URIs:
  @  1: https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml
Functions:
  ## 10 @  1: gt

=== Plan
Root[result]
  Filter[gt($2, 100) => $0, $1, $2]
    Project[$0, $1, $2]
      Read[data => a:i64, b:string, c:i32]
# "#;
#
# let plan = Parser::parse(plan_text).unwrap();
# assert_eq!(plan.relations.len(), 1);
```

### Project Relation

#### Syntax

`"Project" "[" (expression ("," expression)*)? "]"`

#### Components

- `expression` - field reference, function call, or literal (see Expressions section)

#### Example

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Plan
Root[result]
  Project[$1, 42]                    // project field 1 and literal 42
    Read[data => a:i64, b:string]
# "#;
#
# let plan = Parser::parse(plan_text).unwrap();
# assert_eq!(plan.relations.len(), 1);
```

### Aggregate Relation

#### Syntax

`"Aggregate" "[" group_by "=>" aggregate_output "]"`

#### Components

- `group_by := reference_list | "_"` - comma-separated list of field references for grouping, or `_` for global aggregation
- `aggregate_output := (reference | aggregate_measure) ("," (reference | aggregate_measure))*` - comma-separated list of output items
- `aggregate_measure` - field references or aggregate function calls. See [Aggregate Measures section](#aggregate-measures)

#### Example

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Extensions
URIs:
  @  1: https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate.yaml
Functions:
  ## 10 @  1: sum
  ## 11 @  1: count

=== Plan
Root[result]
  Aggregate[$0 => $0, sum($1), count($2)]           // Group by field 0
    Read[orders => category:string, amount:i64]
# "#;
#
# let plan = Parser::parse(plan_text).unwrap();
# assert_eq!(plan.relations.len(), 1);
```

### Sort Relation

The Sort relation specifies sort fields and directions for ordering the input:

Sort[($0, &AscNullsFirst), ($1, &DescNullsLast) => $0, $1]

#### Syntax

```text
sort_relation := "Sort" "[" sort_fields "=>" reference_list "]"
sort_fields := sort_field ("," sort_field)*
sort_field := "(" reference "," sort_direction ")"
sort_direction := "&AscNullsFirst" / "&AscNullsLast" / "&DescNullsFirst" / "&DescNullsLast"
```

#### Components

- Each sort field is a tuple: `(reference, sort_direction)`
- Sort directions follow the general `enum` syntax and specify null handling
- The columns after `=>` specify the output field order (typically a reference list)

### Join Relation

**Syntax**: `"Join" "[" join_type "," expression "=>" reference_list "]"`

**Components**:

- `join_type` - Join type enum with `&` prefix (e.g., `&Inner`, `&Left`, `&Right`, `&Outer`)
- `expression` - Join condition (boolean expression relating left and right inputs)  
- `reference_list` - Comma-separated list of field references for output columns

**Field Reference Mapping**:

For joins, field references map to the combined schema of left and right inputs:
- `$0`, `$1`, ... refer to left input fields
- `$n`, `$n+1`, ... refer to right input fields (where n = number of left fields)

**Example**:

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Extensions
URIs:
  @  1: https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml
Functions:
  ## 10 @  1: eq

=== Plan
Root[user_orders]
  Join[&Inner, eq($0, $2) => $0, $1, $3]
    Read[users => id:i64, name:string]        // Fields $0, $1
    Read[orders => user_id:i64, amount:i32]   // Fields $2, $3
# "#;
#
# let plan = Parser::parse(plan_text).unwrap();
# assert_eq!(plan.relations.len(), 1);
```

## Complete Example

A complete query that joins users and orders tables, calculates total order value, filters for high-value orders, and groups by user to show total revenue per customer:

```rust
# use substrait_explain::parser::Parser;
#
# let plan_text = r#"
=== Extensions
URIs:
  @  1: https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml
  @  2: https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml
  @  3: https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate.yaml
Functions:
  ## 10 @  1: eq
  ## 11 @  1: gt
  ## 12 @  2: multiply
  ## 13 @  3: sum

=== Plan
Root[customer_revenue]
  Aggregate[$0, $1 => $0, $1, sum($3)]
    Filter[gt($3, 100) => $0, $1, $2, $3]
      Project[$0, $1, $2, multiply($4, $5)]
        Join[&Inner, eq($0, $3) => $0, $1, $2, $3, $4, $5]
          Read[users => id:i64, name:string, region:string]
          Read[orders => user_id:i64, quantity:i32, price:i64]
# "#;
#
# let plan = Parser::parse(plan_text).unwrap();
# assert_eq!(plan.relations.len(), 1);
```