datu - a data file utility
datu is intended to be a lightweight, fast, and versatile CLI tool for reading, querying, and converting data in various file formats, such as Parquet, .XLSX, CSV, and even f3.
It is used non-interactively: you invoke a subcommand with arguments on the CLI or from scripts for automated pipelines.
Internally, it also uses a pipeline architecture that aids in extensibility and testing, as well as allowing for parallel processing even of large datasets, if the input/output formats support it.
How it Works Internally
Internally, datu constructs a pipeline based on the command and arguments.
For example, the following invocation
constructs a pipeline that reads the input, selects only the specified columns, and writes the output.
Examples
schema
Display the schema of a Parquet or Avro file (column names, types, and nullability). Useful for inspecting file structure without reading data.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro).
Usage:
Options:
| Option | Description |
|---|---|
--output <FORMAT> |
Output format: csv, json, or yaml. Case insensitive. Default: csv. |
Output formats:
- csv (default): One line per column, e.g.
name: String (UTF8), nullable. - json: Pretty-printed JSON array of objects with
name,data_type,nullable, and optionallyconverted_type(Parquet). - yaml: YAML list of mappings with the same fields.
Examples:
# Default CSV-style output
# JSON output
# YAML output (e.g. for config or tooling)
convert
Convert data between supported formats. Input and output formats are inferred from file extensions.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro).
Supported output formats: CSV (.csv), JSON (.json), Parquet (.parquet, .parq), Avro (.avro), XLSX (.xlsx).
Usage:
Options:
| Option | Description |
|---|---|
--select <COLUMNS>... |
Columns to include. If not specified, all columns are written. Column names can be given as multiple arguments or as comma-separated values (e.g. --select id,name,email or --select id --select name --select email). |
--limit <N> |
Maximum number of records to read from the input. |
Examples:
# Parquet to CSV (all columns)
# Parquet to Avro (first 1000 rows)
# Avro to CSV, only specific columns
# Parquet to Parquet with column subset
# Parquet or Avro to Excel (.xlsx)
# Parquet or Avro to JSON
head
Print the first N rows of a Parquet or Avro file as CSV to stdout.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro).
Usage:
Options:
| Option | Description |
|---|---|
-n, --number <N> |
Number of rows to print. Default: 10. |
--select <COLUMNS>... |
Columns to include. If not specified, all columns are printed. Same format as convert --select. |
Examples:
# First 10 rows (default)
# First 100 rows
# First 20 rows, specific columns
tail
Print the last N rows of a Parquet or Avro file as CSV to stdout.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro).
Usage:
Options:
| Option | Description |
|---|---|
-n, --number <N> |
Number of rows to print. Default: 10. |
--select <COLUMNS>... |
Columns to include. If not specified, all columns are printed. Same format as convert --select. |
Examples:
# Last 10 rows (default)
# Last 50 rows
# Last 20 rows, specific columns
# Redirect tail output to a file
Version
Print the installed datu version: