dbgen 0.4.1

Generate random test cases for databases
Documentation
Template reference
==================

File syntax
-----------

The template file should consist of one CREATE TABLE statement, with expressions telling how a value
should be generated inside `{{ … }}` blocks:

```sql
CREATE TABLE "database"."schema"."table" (
    column_1    COLUMN_TYPE_1,
        {{ value_1 }}
    column_2    COLUMN_TYPE_2,
        {{ value_2 }}
    -- ...
    column_n    COLUMN_TYPE_N,
        {{ value_n }}
    INDEX(some_index),
    INDEX(more_index)
) OPTION_1 = 1 /*, ... */;
```

Expression syntax
-----------------

`dbgen` supports an SQL-like expression syntax. These expressions will be re-evaluated for every new
row generated.

### Literals

`dbgen` supports numbers and string literals.

* **Integers**

    Decimal and hexadecimal numbers are supported. The value must be between 0 and
    2<sup>64</sup> − 1.

    Examples: `0`, `3`, `18446744073709551615`, `0X1234abcd`, `0xFFFFFFFFFFFFFFFF`

* **Floating point numbers**

    Numbers will be stored in IEEE-754 double-precision format.

    Examples: `0.0`, `1.5`, `.5`, `2.`, `1e100`, `1.38e-23`, `6.02e+23`

* **Strings**

    Strings must be encoded as UTF-8, and written between single quotes (double-quoted strings are
    *not* supported). To represent a single quote in the string, use `''`.

    Examples: `'Hello'`, `'10 o''clock'`

### Operators

From highest to lowest precedence:

1. unary `-`, unary `+`, function call
2. `*`, `/`
3. `+`, `-`, `||`
4. `=`, `<>`, `<`, `>`, `<=`, `>=`, `IS`, `IS NOT`
5. unary `NOT`
6. `AND`
7. `OR`
8. `:=`

* **Concatenation `||`**

    The `||` will concatenate two strings together. If either side is not a string, they will first
    be converted into a string.

* **Comparison `=`, `<>`, `<`, `>`, `<=`, `>=`**

    These operators will return TRUE, FALSE or NULL. When comparing two values, `dbgen` follows
    these rules:

    - Comparing with NULL always return NULL.
    - Numbers are ordered by values.
    - Strings are ordered lexicographically in the UTF-8 binary collation.
    - Comparing two values with different types (e.g. `'4' < 5`) will abort the program.

* **Identity `IS`, `IS NOT`**

    These operators will return TRUE or FALSE. `dbgen` follows these rules:

    - `NULL IS NULL` is TRUE.
    - Values having different types are not identical (`'4' IS 5` is FALSE).
    - Values having the same types compare like the `=` and `<>` operators.

    These operators are a generalization of standard SQL's `IS [NOT] {TRUE|FALSE|NULL}` operators.

* **Logical operators `NOT`, `AND`, `OR`**

    These operators will first convert the input into a nullable boolean value
    (TRUE, FALSE or NULL):

    - NULL remains NULL.
    - Nonzero numbers become TRUE, `0` and `0.0` becomes FALSE, and `NaN` becomes NULL.
    - All other types cannot be converted to a boolean and will abort the program.

    The trinary logic operates like this:

    |       AND |  TRUE |  NULL | FALSE |
    |----------:|:-----:|:-----:|:-----:|
    |  **TRUE** |  TRUE |  NULL | FALSE |
    |  **NULL** |  NULL |  NULL | FALSE |
    | **FALSE** | FALSE | FALSE | FALSE |

    |    OR     |  TRUE |  NULL | FALSE |
    |----------:|:-----:|:-----:|:-----:|
    |  **TRUE** |  TRUE |  TRUE |  TRUE |
    |  **NULL** |  TRUE |  NULL |  NULL |
    | **FALSE** |  TRUE |  NULL | FALSE |

    |   NOT     | value |
    |----------:|:-----:|
    |  **TRUE** | FALSE |
    |  **NULL** |  NULL |
    | **FALSE** |  TRUE |

* **Assignment `:=`**

    The assignment expression `@ident := f()` would evaluate the RHS `f()` and save into the local
    variable `@local`. The same value can later be extracted using `@local`. This can be used to
    generate correlated columns, for instance:

    ```sql
    CREATE TABLE _ (
        "first"  BOOLEAN NOT NULL {{ rand.bool(0.5) }},
        "second" BOOLEAN NOT NULL {{ @a := rand.bool(0.5) }},
        "third"  BOOLEAN NOT NULL {{ @a }}
    );
    ```

    The first and second columns are entirely independent, but the second and third column will
    always have the same value.

### Symbols

* **rownum**: The current row number. The first row has value 1.
* **NULL**: The null value.
* **TRUE**: Equals to 1.
* **FALSE**: Equals to 0.

### Random functions

* **rand.regex('[0-9a-z]+', 'i', 100)**

    Generates a random string satisfying the regular expression. The second and third parameters are
    optional. If provided, they specify respectively the regex flags, and maximum repeat count for
    the unbounded repetition operators (`+`, `*` and `{n,}`).

    The input string should satisfy the syntax of the Rust regex package. The flags is a string
    composed of these letters:

    * `x` (ignore whitespace)
    * `i` (case insensitive)
    * `s` (dot matches new-line)
    * `u` (enable Unicode mode)
    * `a` (disable Unicode mode)
    * `o` (recognize octal escapes)

    The flags `m` (multi-line) and `U` (ungreedy) does not affect string generation and are ignored.

* **rand.range(7, 19)**

    Generates a random integer uniformly distributed in the half-open interval 7 ≤ *x* < 19.
    The length of the range must be less than 2<sup>64</sup>.

* **rand.range_inclusive(8, 35)**

    Generates a random integer uniformly distributed in the closed interval 8 ≤ *x* ≤ 35.
    The length of the range must be less than 2<sup>64</sup>.

* **rand.uniform(2.4, 7.5)**

    Generates a random floating point number uniformly distributed in the half-open interval
    2.4 ≤ *x* < 7.5.

* **rand.uniform_inclusive(1.6, 8.4)**

    Generates a random floating point number uniformly distributed in the closed interval
    1.6 ≤ *x* ≤ 8.4.

* **rand.bool(0.3)**

    Generates a random boolean (0 or 1) with probability 0.3 of getting "1". Also known as the
    Bernoulli distribution.

* **rand.zipf(26, 0.8)**

    Generates a random integer in the closed interval 1 ≤ *x* ≤ 26 using [Zipfian distribution]
    with an exponent of 0.8.

    With Zipfian distribution, the smallest values will appear more often.

    [Zipfian distribution]: https://en.wikipedia.org/wiki/Zipf's_law

* **rand.log_normal(2.0, 3.0)**

    Generates a random positive number using the [log-normal distribution]
    (log *N*(*µ*, *σ*<sup>2</sup>)) with *μ* = 2.0 and *σ* = 3.0.

    The median of this distribution is exp(*µ*).

    [log-normal distribution]: https://en.wikipedia.org/wiki/Log-normal_distribution

* **rand.finite_f32()**, **rand.finite_f64()**

    Generates a random finite IEEE-754 binary32 or binary64 floating-point number.

    The numbers are uniform in its *bit-pattern* across the entire supported range
    (±3.4 × 10<sup>38</sup> for `f32`, ±1.8 × 10<sup>308</sup> for `f64`)

### Date and Time

* **TIMESTAMP '2016-01-02 15:04:05.999'**

    Converts an ISO-8601-formatted string into a timestamp, using the time zone specified by the
    `--time-zone` flag. The timestamp is internally stored as UTC.

    If a time zone observes DST, there will be some time values which are impossible or ambiguous.
    Both of these cases will cause an "invalid timestamp" error.

* **TIMESTAMP WITH TIME ZONE '2016-01-02 15:04:05.999 Asia/Hong_Kong'**

    Converts an ISO-8601-formatted string into a timestamp, using the time zone specified inside
    the string. The timestamp is internally stored as UTC.

    Only names in the `tz` database are recognized. The time zone will **not** be printed together
    with the timestamp.

* **INTERVAL 30 MINUTE**

    Creates a time interval. The inner expression should evaluate a number (can be negative). Valid
    units are:

    - MICROSECOND
    - MILLISECOND
    - SECOND
    - MINUTE
    - HOUR
    - DAY
    - WEEK

    Intervals can be added to or subtracted from timestamps, and can therefore be used to generate
    a random timestamp.

* **rand.u31_timestamp()**

    Generates a random timestamp distributed uniformly between 1970-01-01 00:00:01 and
    2038-01-19 03:14:07 (UTC). There are exactly 2<sup>31</sup>−1 seconds between these two time.

### Miscellaneous

* **CASE *value* WHEN *p1* THEN *r1* WHEN *p2* THEN *r2* ELSE *ro* END**

    Equivalent to the SQL `CASE … WHEN` expression.

    If *value* equals to *p1* (i.e. `(value = p1) IS TRUE`), then the expression's value is *r1*,
    etc. If the *value* does not equal to any of the listed pattern, the value *ro* will be
    returned. If the ELSE branch is missing, returns NULL.

* **@local**

    Gets the previous assigned local variable. If the variable was undefined, this will return NULL.

* **greatest(*x*, *y*, *z*)**

    Returns the largest of all given values. NULL values are ignored.

* **least(*x*, *y*, *z*)**

    Returns the smallest of all given values. NULL values are ignored.