datu - a data file utility
Datu (Filipino) - a traditional chief or local leader
datu is intended to be a lightweight, fast, and versatile CLI tool for reading, querying, and converting data in various file formats, such as Parquet, .XLSX, CSV, and even f3.
It is used non-interactively: you invoke a subcommand with arguments on the CLI or from scripts for automated pipelines.
Internally, it also uses a pipeline architecture that aids in extensibility and testing, as well as allowing for parallel processing even of large datasets, if the input/output formats support it.
Installation
Prerequisites: Rust ~> 1.95 (or recent stable)
To install from source:
How it Works Internally
Internally, datu constructs a pipeline based on the command and arguments.
For example, the following invocation
constructs a pipeline that reads the input, selects only the specified columns, and writes the output.
Supported Formats
| Format | Read | Write | Display |
|---|---|---|---|
Parquet (.parquet, .parq) |
✓ | ✓ | — |
Avro (.avro) |
✓ | ✓ | — |
ORC (.orc) |
✓ | ✓ | — |
CSV (.csv) |
— | ✓ | ✓ |
JSON (.json) |
— | ✓ | ✓ |
| JSON (pretty) | — | — | ✓ |
XLSX (.xlsx) |
— | ✓ | — |
| YAML | — | — | ✓ |
- Read — Input file formats for
convert,schema,head, andtail. - Write — Output file formats for
convert. - Display — Output format when printing to stdout (
schema,head,tailvia--output: csv, json, json-pretty, yaml).
Examples
schema
Display the schema of a Parquet, Avro, or ORC file (column names, types, and nullability). Useful for inspecting file structure without reading data.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), ORC (.orc).
Usage:
Options:
| Option | Description |
|---|---|
--output <FORMAT> |
Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv. |
Output formats:
- csv (default): One line per column, e.g.
name: String (UTF8), nullable. - json: JSON array of objects with
name,data_type,nullable, and optionallyconverted_type(Parquet). - json-pretty: Same as
jsonbut pretty-printed for readability. - yaml: YAML list of mappings with the same fields.
Examples:
# Default CSV-style output
# JSON output
# JSON pretty-printed
# YAML output (e.g. for config or tooling)
convert
Convert data between supported formats. Input and output formats are inferred from file extensions.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), ORC (.orc).
Supported output formats: CSV (.csv), JSON (.json), Parquet (.parquet, .parq), Avro (.avro), ORC (.orc), XLSX (.xlsx).
Usage:
Options:
| Option | Description |
|---|---|
--select <COLUMNS>... |
Columns to include. If not specified, all columns are written. Column names can be given as multiple arguments or as comma-separated values (e.g. --select id,name,email or --select id --select name --select email). |
--limit <N> |
Maximum number of records to read from the input. |
--sparse |
For JSON/YAML: omit keys with null/missing values. Default: true. Use --no-sparse to include default values (e.g. empty string). |
--json-pretty |
When converting to JSON, format output with indentation and newlines. Ignored for other output formats. |
Examples:
# Parquet to CSV (all columns)
# Parquet to Avro (first 1000 rows)
# Avro to CSV, only specific columns
# Parquet to Parquet with column subset
# Parquet, Avro, or ORC to Excel (.xlsx)
# Parquet or Avro to ORC
# Parquet or Avro to JSON
head
Print the first N rows of a Parquet, Avro, or ORC file to stdout (default CSV; use --output for other formats).
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), ORC (.orc).
Usage:
Options:
| Option | Description |
|---|---|
-n, --number <N> |
Number of rows to print. Default: 10. |
--output <FORMAT> |
Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv. |
--select <COLUMNS>... |
Columns to include. If not specified, all columns are printed. Same format as convert --select. |
Examples:
# First 10 rows (default)
# First 100 rows
# First 20 rows, specific columns
tail
Print the last N rows of a Parquet, Avro, or ORC file to stdout (default CSV; use --output for other formats).
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), ORC (.orc).
Usage:
Options:
| Option | Description |
|---|---|
-n, --number <N> |
Number of rows to print. Default: 10. |
--output <FORMAT> |
Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv. |
--select <COLUMNS>... |
Columns to include. If not specified, all columns are printed. Same format as convert --select. |
Examples:
# Last 10 rows (default)
# Last 50 rows
# Last 20 rows, specific columns
# Redirect tail output to a file
Version
Print the installed datu version: