Feature: Convert
Convert between Parquet, Avro, ORC, CSV, JSON, YAML, and XLSX file formats.
Scenario: Parquet to Avro
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table.avro`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table.avro"
And the file "$TEMPDIR/table.avro" should exist
Scenario: Avro to Parquet
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.parquet`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5.parquet"
And the file "$TEMPDIR/userdata5.parquet" should exist
Scenario: Avro to ORC
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name --limit 10`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5.orc"
And the file "$TEMPDIR/userdata5.orc" should exist
Scenario: ORC to Parquet
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name --limit 10`
Then the command should succeed
When I run `datu convert $TEMPDIR/userdata5.orc $TEMPDIR/userdata5.parquet`
Then the command should succeed
And the output should contain "Converted $TEMPDIR/userdata5.orc to $TEMPDIR/userdata5.parquet"
And the file "$TEMPDIR/userdata5.parquet" should exist
Scenario: Parquet to ORC
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/table.orc --select id,first_name --limit 10`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/table.orc"
And the file "$TEMPDIR/table.orc" should exist
Scenario: Parquet to CSV
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table.csv`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table.csv"
And the file "$TEMPDIR/table.csv" should exist
And the first line of that file should contain "one,two"
And that file should have 4 lines
Scenario: CSV to Parquet
Given a file "fixtures/table.csv"
When I run `datu convert fixtures/table.csv $TEMPDIR/table_from_csv.parquet`
Then the command should succeed
And the output should contain "Converted fixtures/table.csv to $TEMPDIR/table_from_csv.parquet"
And the file "$TEMPDIR/table_from_csv.parquet" should exist
Scenario: CSV to JSON
Given a file "fixtures/table.csv"
When I run `datu convert fixtures/table.csv $TEMPDIR/table_from_csv.json`
Then the command should succeed
And the output should contain "Converted fixtures/table.csv to $TEMPDIR/table_from_csv.json"
And the file "$TEMPDIR/table_from_csv.json" should exist
And the file "$TEMPDIR/table_from_csv.json" should be valid JSON
And the file "$TEMPDIR/table_from_csv.json" should contain "one"
And the file "$TEMPDIR/table_from_csv.json" should contain "two"
Scenario: CSV to Parquet with --input-headers=false
Given a file "fixtures/no_header.csv"
When I run `datu convert fixtures/no_header.csv $TEMPDIR/no_header.parquet --input-headers=false`
Then the command should succeed
And the output should contain "Converted fixtures/no_header.csv to $TEMPDIR/no_header.parquet"
And the file "$TEMPDIR/no_header.parquet" should exist
And that file should be a valid Parquet file
And that file should have 3 records
Scenario: Avro to CSV
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.csv`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5.csv"
And the file "$TEMPDIR/userdata5.csv" should exist
And the first line of that file should contain "id,first_name"
And that file should have 1001 lines
Scenario: ORC to CSV
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name --limit 10`
Then the command should succeed
When I run `datu convert $TEMPDIR/userdata5.orc $TEMPDIR/userdata5.csv`
Then the command should succeed
And the output should contain "Converted $TEMPDIR/userdata5.orc to $TEMPDIR/userdata5.csv"
And the file "$TEMPDIR/userdata5.csv" should exist
And the first line of that file should contain "id,first_name"
Scenario: Parquet to CSV with --select
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_select.csv --select two,four`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_select.csv"
And the file "$TEMPDIR/table_select.csv" should exist
And the first line of that file should contain "two,four"
And that file should have 4 lines
Scenario: Avro to CSV with --select
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5_select.csv --select id,first_name,email`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5_select.csv"
And the file "$TEMPDIR/userdata5_select.csv" should exist
And the first line of that file should contain "id,first_name,email"
And that file should have 1001 lines
Scenario: Parquet to Avro with --limit
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_limit.avro --limit 2`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_limit.avro"
And the file "$TEMPDIR/table_limit.avro" should exist
Scenario: Avro to CSV with --limit and --select
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5_limit_select.csv --limit 3 --select id,email`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5_limit_select.csv"
And the file "$TEMPDIR/userdata5_limit_select.csv" should exist
And the first line of that file should contain "id,email"
And that file should have 4 lines
Scenario: Parquet to JSON
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table.json`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table.json"
And the file "$TEMPDIR/table.json" should exist
And the file "$TEMPDIR/table.json" should be valid JSON
And the file "$TEMPDIR/table.json" should contain:
```
{"one":-1.0,"two":"foo","three":true,"four":"2022-12-23T00:00:00Z","five":"2022-12-23T11:43:49","__index_level_0__":"a"}
{"two":"bar","three":false,"four":"2021-12-23T00:00:00Z","five":"2021-12-23T12:44:50","__index_level_0__":"b"}
{"one":2.5,"two":"baz","four":"2020-12-23T00:00:00Z","five":"2020-12-23T13:45:51","__index_level_0__":"c"}
```
Scenario: Parquet to JSON (with `--json-pretty`)
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_sparse.json --json-pretty`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_sparse.json"
And the file "$TEMPDIR/table_sparse.json" should exist
And the file "$TEMPDIR/table_sparse.json" should be valid JSON
And the file "$TEMPDIR/table_sparse.json" should contain:
```
[
{
"__index_level_0__": "a",
"five": "2022-12-23T11:43:49",
"four": "2022-12-23T00:00:00Z",
"one": -1.0,
"three": true,
"two": "foo"
},
{
"__index_level_0__": "b",
"five": "2021-12-23T12:44:50",
"four": "2021-12-23T00:00:00Z",
"three": false,
"two": "bar"
},
{
"__index_level_0__": "c",
"five": "2020-12-23T13:45:51",
"four": "2020-12-23T00:00:00Z",
"one": 2.5,
"two": "baz"
}
]
```
Scenario: Avro to JSON
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.json --json-pretty`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5.json"
And the file "$TEMPDIR/userdata5.json" should exist
And the file "$TEMPDIR/userdata5.json" should be valid JSON
Scenario: Parquet to YAML (default sparse)
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table.yaml`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table.yaml"
And the file "$TEMPDIR/table.yaml" should exist
And the file "$TEMPDIR/table.yaml" should be valid YAML
And the file "$TEMPDIR/table.yaml" should contain:
```
- one: -1
two: foo
three: true
four: "2022-12-23T00:00:00Z"
five: "2022-12-23T11:43:49"
__index_level_0__: a
- two: bar
three: false
four: "2021-12-23T00:00:00Z"
five: "2021-12-23T12:44:50"
__index_level_0__: b
- one: 2.5
two: baz
four: "2020-12-23T00:00:00Z"
five: "2020-12-23T13:45:51"
__index_level_0__: c
```
Scenario: Parquet to JSON with sparse=false
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_no_sparse.json --sparse=false`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_no_sparse.json"
And the file "$TEMPDIR/table_no_sparse.json" should exist
And that file should be valid JSON
And that file should contain "one"
And that file should contain "null"
Scenario: Parquet to YAML with sparse=false
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_no_sparse.yaml --sparse=false`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_no_sparse.yaml"
And the file "$TEMPDIR/table_no_sparse.yaml" should exist
And that file should be valid YAML
And that file should contain "one:"
Scenario: Avro to YAML
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.yaml`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5.yaml"
And the file "$TEMPDIR/userdata5.yaml" should exist
And that file should be valid YAML
And that file should contain "id:"
And that file should contain "first_name:"
Scenario: ORC to JSON
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name --limit 10`
Then the command should succeed
When I run `datu convert $TEMPDIR/userdata5.orc $TEMPDIR/userdata5.json --json-pretty`
Then the command should succeed
And the output should contain "Converted $TEMPDIR/userdata5.orc to $TEMPDIR/userdata5.json"
And the file "$TEMPDIR/userdata5.json" should exist
And the file "$TEMPDIR/userdata5.json" should be valid JSON
Scenario: ORC to YAML
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name --limit 10`
Then the command should succeed
When I run `datu convert $TEMPDIR/userdata5.orc $TEMPDIR/userdata5.yaml`
Then the command should succeed
And the output should contain "Converted $TEMPDIR/userdata5.orc to $TEMPDIR/userdata5.yaml"
And the file "$TEMPDIR/userdata5.yaml" should exist
And that file should be valid YAML
And that file should contain "id:"
And that file should contain "first_name:"
Scenario: Parquet to YAML with --select
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_select.yaml --select two,four`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_select.yaml"
And the file "$TEMPDIR/table_select.yaml" should exist
And that file should be valid YAML
And that file should contain "two:"
And that file should contain "four:"
Scenario: Parquet to YAML with --limit
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_limit.yaml --limit 2`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_limit.yaml"
And the file "$TEMPDIR/table_limit.yaml" should exist
And that file should be valid YAML
Scenario: Avro to YAML with .yml extension
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.yml`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5.yml"
And the file "$TEMPDIR/userdata5.yml" should exist
And that file should be valid YAML
And that file should contain "email:"
Scenario: Parquet to XLSX
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table.xlsx`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table.xlsx"
And the file "$TEMPDIR/table.xlsx" should exist
Scenario: Avro to XLSX
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.xlsx`
Then the command should succeed
And the output should contain "Converted fixtures/userdata5.avro to $TEMPDIR/userdata5.xlsx"
And the file "$TEMPDIR/userdata5.xlsx" should exist
Scenario: ORC to CSV with --select
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name,email --limit 10`
Then the command should succeed
When I run `datu convert $TEMPDIR/userdata5.orc $TEMPDIR/userdata5_select.csv --select id,first_name,email`
Then the command should succeed
And the output should contain "Converted $TEMPDIR/userdata5.orc to $TEMPDIR/userdata5_select.csv"
And the file "$TEMPDIR/userdata5_select.csv" should exist
And the first line of that file should contain "id,first_name,email"
Scenario: ORC to XLSX
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name --limit 10`
Then the command should succeed
When I run `datu convert $TEMPDIR/userdata5.orc $TEMPDIR/userdata5.xlsx`
Then the command should succeed
And the output should contain "Converted $TEMPDIR/userdata5.orc to $TEMPDIR/userdata5.xlsx"
And the file "$TEMPDIR/userdata5.xlsx" should exist
Scenario: ORC to Parquet with --limit
Given a file "fixtures/userdata5.avro"
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name --limit 10`
Then the command should succeed
When I run `datu convert $TEMPDIR/userdata5.orc $TEMPDIR/userdata5_limit.parquet --limit 5`
Then the command should succeed
And the output should contain "Converted $TEMPDIR/userdata5.orc to $TEMPDIR/userdata5_limit.parquet"
And the file "$TEMPDIR/userdata5_limit.parquet" should exist
Scenario: Parquet to XLSX with --select
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_select.xlsx --select two,four`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_select.xlsx"
And the file "$TEMPDIR/table_select.xlsx" should exist
Scenario: Parquet to XLSX with --limit
Given a file "fixtures/table.parquet"
When I run `datu convert fixtures/table.parquet $TEMPDIR/table_limit.xlsx --limit 2`
Then the command should succeed
And the output should contain "Converted fixtures/table.parquet to $TEMPDIR/table_limit.xlsx"
And the file "$TEMPDIR/table_limit.xlsx" should exist