xan 0.56.0 - Docs.rs

# Expression language cheatsheet

Some `xan` commands such as `xan map`, `xan filter`, `xan groupby` etc. allow
users to evaluate custom expressions to filter/transform/create values from
the processed CSV data.

The expression language (nicknamed `moonblade` here and there) should be very
reminiscent of most high-level dynamically typed languages like Python or
JavaScript. It is however limited, memory-efficient, tailored for CSV data and
therefore faster than the beforementioned languages.

It does not support anything more than evaluating simple expressions and is loaded
with a lot of helpful functions that you can review using [`xan help functions`](./functions.md)
and that should be enough for most typical tasks. To read about aggregation
capabilities, on the other hand, you should use [`xan help aggs`](./aggs.md) instead.

Note finally that it was not designed to be sandboxed nor to be particularly safe,
so precautions regarding evaluating untrusted expressions provided by users should
apply here too.

## Summary

- [Basic examples](#basic-examples)
- [Literal values & data types](#literal-values--data-types)
- [Referencing columns](#referencing-columns)
- [Operators & calling functions](#operators--calling-functions)
- [Indexing & slicing](#indexing--slicing)
- [Higher-order functions](#higher-order-functions)
- [Constant evaluation](#constant-evaluation)
- [Named expressions](#named-expressions)
- [Multiple lines & comments](#multiple-lines--comments)
- [Implementation details & design choices](#implementation-details--design-choices)

## Basic examples

```python
# Checking that the value of the "count" column is over 10
count > 10

# Lowercasing a "text" column
lower(text)

# Checking that a lowercased name has some desired value (mind the "eq" operator):
lower(name) eq "john"

# Checking that a "name" column is one of the provided values:
name in ["john", "lucy", "mary"]

# Formatting a full name from a "first_name" and "last_name" column:
fmt("{} {}", first_name, last_name)

# Getting the first part of a mime type:
split(mimetype, "/")[0]
```

A common pitfall is to forget that string operators are different than numerical
ones, like in Perl. String equality is actually `eq`, not `==`, and string
concatenation would be `++` not `+`. Read the design choices section if you
want to understand why.

## Referencing columns

Column can be referenced directly by name if they only contain alphanumeric
characters or underscores and don't start with a number:

```python
# Computing the ratio between "tweet_count" and "retweet_count":
tweet_count / retweet_count
```

If the column names contain forbidden characters, or if you need to access
columns with duplicate names,  they can be accessed through the `col` function:

```python
# Column name with spaces:
col("Name of Movie")
# Second column named "text":
col("text", 1)
```

It is also possible to access columns by their zero-based index (negative indices
are also accepted):

```python
# Third column:
col(2)
# Last column:
col(-1)
```

If an identifier or a `col` call tries to access an inexisting column in target
CSV file, `xan` will usually throw an error before even attempting to evaluate
the given expression. This can be problematic sometimes when you want to process
many different files with slightly different column names. To this end, you
can also use "unsure" identifiers, postfixed with `?`, or the `col?` function like so:

```python
# Will return the "text" column or the "content" one if not found
text? || content?
# Same but for column with complicated names etc.
col?("Subscription Date")
```

Note that the language does not have user-defined variables and identifiers always
refer to some column or some program-defined value like in `xan tokenize --flatmap`
or `xan dedup --choose`.

This said, the pipeline operator (`|`) refers to the result of the left-hand side
expression through the special `_` identifier. This can be used to avoid repetition
sometimes:

```python
split(date, '/') | fmt('{}-{}-{}', _[2], _[0], _[1])
```

## Literal values & data types

```python
# Integers
1
# Integers can contain underscores for readability
10_000

# Floats
0.5

# Booleans
true
false

# Null value
null

# Strings (single quotes, double quotes, backticks)
"hello"
'hello'
`hello`

# Typical escaping
"Hello\nThis is world!"
# Supported: \n, \r, \t, \\, \", \', \0, \x67 and \u{1F60A}

# Binary strings (single quotes, double quotes, backticks)
b"hello"
b'hello'
b`hello`

# Regexes
/john/

# Case-insensitive regexes
/john/i

# Lists
[1, 2, 3]
["one", "two"]

# Maps
{"one": 1, "two": 2}
{one: 1, two: 2}
{leaf: "hello", nested: [1, 2, 3]}
```

## Operators & calling functions

Operators:

```python
# Unary operators:
-count
!has_description

# Binary operators:
count1 + count2
count1 < count2

# Nested expressions:
(count1 > 1) || count2
```

Functions:

```python
# Simple call
trim(name)

# Nested call
trim(concat(name, " ", surname))

# Using the operator "." is the same as calling a function with left operand
# as first argument
name.trim()
# is equivalent to:
trim(name)

"data".pathjoin(filename)
# is equivalent to:
pathjoin("data", filename)

# Some functions accepts named arguments:
read(path, encoding="utf8")
```

For a full list of available operators and functions, check out [`xan help functions`](./functions.md).

## Indexing & slicing

Indexing and slicing works a lot like in Python and JavaScript:

```python
# Zero-based indexing:
list[1]

# Negative indexing:
list[-2]

# Slicing:
list[1:4]
list[:4]
list[1:]

# Negative slicing:
list[1:-3]
list[:-2]
list[-4:]

# Key-based indexing:
map["name"]
# Same as:
map.name
```

## Higher-order functions

Higher-order functions, such as `map` or `filter`, also exists in the language
and can be given anonymous functions like so:

```javascript
map(numbers, x => x + 2)
filter(users, name => "john" in name)
```

## Constant evaluation

Note that the language will always perform some level of static analysis of the
given expression to assess which part actually need to run for each of the
processed CSV rows.

This means that constant parts of the expressions will be evaluated only once
when parsed, then folded into a new, simpler expression.

This can be very useful when, for instance, reading some JSON file to perform
one lookup per row like so:

```python
# Here, "config.json" will only be read once when parsing the expression,
# not once per processed CSV row, which is fortunate.
read_json("config.json").name
```

To debug and/or experiment with the expression static analysis, check out the
`xan eval --explain` command.

## Named expressions

Some commands, typically `xan map`, `xan agg`, `xan groupby` and `xan select -e`
let their user provide a series of named expression, separated by comma, rather
than a single expression.

Here is how they work:

```python
# Anonymous expressions (names will be created from the stringified expressions)
sum(retweets), retweets / replies

# Named expressions
sum(retweets) as total_retweets, retweets / replies as ratio

# Names with special characters
sum(retweets) as "Total Retweets"
```

In some commands, namely `xan select -e` & `xan map`, it is even possible for
named expressions to return multiple things at once, in order to avoid repeating
computations (notice the parenthesis around names):

```python
# Splitting full names
full_name.split(" ") as (first_name, last_name)

# Extracting data from a JSON cell
data.parse_json() | [_.name, _.meta[2].age] as (name, age)
```

Note that the expression must return a list of items having exactly the correct
number of values or an error will be raised.

## Multiple lines & comments

Expressions can be written on multiple lines freely:

```python
sum(
  retweets +
  replies
)
```

Comments can be added starting with `#`:

```python
# Summing
sum(
  retweets + # we add retweets,
  replies # and replies
)
```

## Implementation details & design choices

The `moonblade` expression language uses a PEG-like parser and a tree-walker
interpreter over a single `enum` of dynamic data types. It does not rely on
garbage collection to operate.

Since CSV data contain only strings and is not typed whatsoever, some design choices
were made to make sure expressions would stick to this reality, all while
remaining comfortable to write:

- To avoid requiring users to explicitly cast their values to some numerical
representation, for instance, we use two sets of operators for string-like types
and for other types (`eq` vs. `==` for instance). This is reminiscent of Perl.

- Most functions semantically encode what type they will operate on. Consider the
difference between the aggregation functions `min` and `lex_first`. One will consider
numbers, the other one strings.