datu - a data file utility
Datu (Filipino) - a traditional chief or local leader
datu is intended to be a lightweight, fast, and versatile CLI tool for reading, querying, and converting data in various file formats, such as Parquet, Avro, ORC, CSV, JSON, YAML, and .XLSX.
Installation
Prerequisites: Rust ~> 1.95 (or recent stable)
To install from source:
Supported Formats
| Format | Read | Write | Display |
|---|---|---|---|
Parquet (.parquet, .parq) |
✓ | ✓ | — |
Avro (.avro) |
✓ | ✓ | — |
ORC (.orc) |
✓ | ✓ | — |
CSV (.csv) |
✓ | ✓ | ✓ |
JSON (.json) |
✓ | ✓ | ✓ |
XLSX (.xlsx) |
— | ✓ | — |
| JSON (pretty) | — | — | ✓ |
| YAML | — | — | ✓ |
- Read — Input file formats for
convert,count,schema,head, andtail. - Write — Output file formats for
convert. - Display — Output format when printing to stdout (
schema,head,tailvia--output: csv, json, json-pretty, yaml).
File type detection: By default, file types are inferred from extensions. Use --input <TYPE> (-I) to override input format detection, and --output <TYPE> (-O, convert only) to override output format detection. Valid types: avro, csv, json, orc, parquet, xlsx, yaml.
CSV options: When reading CSV files, the --input-headers option controls whether the first row is treated as column names. Omitted or --input-headers means true (header present); --input-headers=false for headerless CSV. Applies to convert, count, schema, head, and tail.
Usage
datu can be used non-interactively as a typical command-line utility, or it can be ran without specifying a command in interactive mode, providing a REPL-like interface.
For example, the command
And, interactively, using the REPL
> read() |) |> write()
Perform the same conversion and column filtering.
Commands
convert
Convert data between supported formats. Input and output formats are inferred from file extensions, or can be specified explicitly with --input and --output.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), JSON (.json), ORC (.orc).
Supported output formats: CSV (.csv), JSON (.json), Parquet (.parquet, .parq), Avro (.avro), ORC (.orc), XLSX (.xlsx).
Usage (CLI):
Usage (REPL):
read("table.parquet") |> select(:id, :email) |> write("table.csv")
Options:
| Option | Description |
|---|---|
-I, --input <TYPE> |
Input file type (avro, csv, json, orc, parquet, xlsx, yaml). Overrides extension-based detection. |
-O, --output <TYPE> |
Output file type (avro, csv, json, orc, parquet, xlsx, yaml). Overrides extension-based detection. |
--select <COLUMNS>... |
Columns to include. If not specified, all columns are written. Column names can be given as multiple arguments or as comma-separated values (e.g. --select id,name,email or --select id --select name --select email). |
--limit <N> |
Maximum number of records to read from the input. |
--sparse |
For JSON/YAML: omit keys with null/missing values. Default: true. Use --sparse=false to include default values (e.g. empty string). |
--json-pretty |
When converting to JSON, format output with indentation and newlines. Ignored for other output formats. |
--input-headers [BOOL] |
For CSV input: whether the first row is a header. Default: true when omitted. Use --input-headers=false for headerless CSV. |
Examples:
# Parquet to CSV (all columns)
# CSV to Parquet (with automatic type inference)
# Parquet to Avro (first 1000 rows)
# Avro to CSV, only specific columns
# CSV to JSON with headerless input
# Parquet to Parquet with column subset
# JSON to CSV or Parquet
# Parquet, Avro, CSV, JSON, or ORC to Excel (.xlsx)
# Parquet, Avro, or JSON to ORC
# Parquet, Avro, or JSON to JSON
schema
Display the schema of a Parquet, Avro, CSV, or ORC file (column names, types, and nullability). Useful for inspecting file structure without reading data. CSV schema uses type inference from the data.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).
Usage (CLI):
Usage (REPL):
read("file") |> schema()
Use schema() after a read to print the schema of the data in the pipeline. For a single file, read("data.parquet") |> schema() is equivalent.
Options:
| Option | Description |
|---|---|
-I, --input <TYPE> |
Input file type (avro, csv, json, orc, parquet, xlsx, yaml). Overrides extension-based detection. |
--output <FORMAT> |
Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv. |
--input-headers [BOOL] |
For CSV input: whether the first row is a header. Default: true when omitted. Use --input-headers=false for headerless CSV. |
Output formats:
- csv (default): One line per column, e.g.
name: String (UTF8), nullable. - json: JSON array of objects with
name,data_type,nullable, and optionallyconverted_type(Parquet). - json-pretty: Same as
jsonbut pretty-printed for readability. - yaml: YAML list of mappings with the same fields.
Examples:
# Default CSV-style output
# JSON output
# JSON pretty-printed
# YAML output (e.g. for config or tooling)
count
Return the number of rows in a Parquet, Avro, CSV, or ORC file.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).
Usage (CLI):
Usage (REPL):
count("file")
Count rows in a file directly. Or use read("file") |> count()
Options:
| Option | Description |
|---|---|
-I, --input <TYPE> |
Input file type (avro, csv, json, orc, parquet, xlsx, yaml). Overrides extension-based detection. |
--input-headers [BOOL] |
For CSV input: whether the first row is a header. Default: true when omitted. Use --input-headers=false for headerless CSV. |
Examples:
# Count rows in a Parquet file
# Count rows in an Avro, CSV, or ORC file
# Count rows in a headerless CSV file
sample
Print N randomly sampled rows from a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use --output for other formats). For Parquet and ORC, sampling uses file metadata to determine total row count and selects random indices; for Avro and CSV, reservoir sampling is used.
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).
Usage (CLI):
Usage (REPL):
read("file") |> sample(n)
Prints n random rows to stdout. Chain |> write("out.csv") to write to a file, e.g. read("data.parquet") |> sample(5) |> write("sample.csv").
Options:
| Option | Description |
|---|---|
-I, --input <TYPE> |
Input file type (avro, csv, json, orc, parquet, xlsx, yaml). Overrides extension-based detection. |
-n, --number <N> |
Number of rows to sample. Default: 10. |
--output <FORMAT> |
Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv. |
--sparse |
For JSON/YAML: omit keys with null/missing values. Default: true. Use --sparse=false to include default values. |
--select <COLUMNS>... |
Columns to include. If not specified, all columns are printed. Same format as convert --select. |
--input-headers [BOOL] |
For CSV input: whether the first row is a header. Default: true when omitted. Use --input-headers=false for headerless CSV. |
Examples:
# 10 random rows (default)
# 5 random rows
# 20 random rows, specific columns
head
Print the first N rows of a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use --output for other formats).
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).
Usage (CLI):
Usage (REPL):
read("file") |> head(n)
Prints the first n rows to stdout. Chain |> write("out.csv") to write to a file, e.g. read("data.parquet") |> head(10) |> write("first10.csv").
Options:
| Option | Description |
|---|---|
-I, --input <TYPE> |
Input file type (avro, csv, json, orc, parquet, xlsx, yaml). Overrides extension-based detection. |
-n, --number <N> |
Number of rows to print. Default: 10. |
--output <FORMAT> |
Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv. |
--sparse |
For JSON/YAML: omit keys with null/missing values. Default: true. Use --sparse=false to include default values. |
--select <COLUMNS>... |
Columns to include. If not specified, all columns are printed. Same format as convert --select. |
--input-headers [BOOL] |
For CSV input: whether the first row is a header. Default: true when omitted. Use --input-headers=false for headerless CSV. |
Examples:
# First 10 rows (default)
# First 100 rows
# First 20 rows, specific columns
# Head from a headerless CSV file
tail
Print the last N rows of a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use --output for other formats).
Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).
Note: For Avro and CSV files,
tailrequires a full file scan since these formats do not support random access to the end of the file.
Usage (CLI):
Usage (REPL):
read("file") |> tail(n)Prints the last n rows to stdout. Chain
|> write("out.csv")to write to a file, e.g.read("data.parquet") |> tail(10) |> write("last10.csv").
Options:
| Option | Description |
|---|---|
-I, --input <TYPE> |
Input file type (avro, csv, json, orc, parquet, xlsx, yaml). Overrides extension-based detection. |
-n, --number <N> |
Number of rows to print. Default: 10. |
--output <FORMAT> |
Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv. |
--sparse |
For JSON/YAML: omit keys with null/missing values. Default: true. Use --sparse=false to include default values. |
--select <COLUMNS>... |
Columns to include. If not specified, all columns are printed. Same format as convert --select. |
--input-headers [BOOL] |
For CSV input: whether the first row is a header. Default: true when omitted. Use --input-headers=false for headerless CSV. |
Examples:
# Last 10 rows (default)
# Last 50 rows
# Last 20 rows, specific columns
# Redirect tail output to a file
Version
Print the installed datu version.
Usage (CLI):
Usage (REPL):
No equivalent; run
datu versionfrom the shell.
Interactive Mode (REPL)
Running datu without any command starts an interactive REPL (Read-Eval-Print Loop):
>
In the REPL, you compose data pipelines using the |> (pipe) operator to chain functions together. The general pattern is:
read("input") |> ... |> write("output")
Functions
read(path)
Read a data file. Supported formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), JSON (.json), ORC (.orc). CSV files are assumed to have a header row by default.
> read("data.parquet") |> write("data.csv")
> read("data.csv") |> write("data.parquet")
write(path)
Write data to a file. The output format is inferred from the file extension. Supported formats: CSV (.csv), JSON (.json), YAML (.yaml), Parquet (.parquet, .parq), Avro (.avro), ORC (.orc), XLSX (.xlsx).
> read("data.parquet") |> write("output.json")
select(columns...)
Select and reorder columns. Columns can be specified using symbol syntax (:name) or string syntax ("name").
> read("data.parquet") |> select(:id, :email) |> write("subset.csv")
> read("data.parquet") |> select("id", "email") |> write("subset.csv")
Columns appear in the output in the order they are listed, so select can also be used to reorder columns:
> read("data.parquet") |> select(:email, :id) |> write("reordered.csv")
head(n)
Take the first n rows.
> read("data.parquet") |> head(10) |> write("first10.csv")
sample(n)
Take n random rows from the data. Default: 10.
> read("data.parquet") |> sample(5) |> write("sampled.csv")
tail(n)
Take the last n rows.
> read("data.parquet") |> tail(10) |> write("last10.csv")
Composing Pipelines
Functions can be chained in any order to build more complex pipelines:
> read("users.avro") |> select(:id, :first_name, :email) |> head(5) |> write("top5.json")
> read("data.parquet") |> select(:two, :one) |> tail(1) |> write("last_row.csv")
> read("data.parquet") |> select(:id, :email) |> sample(5) |> write("sample.csv")
How it Works Internally
Internally, datu constructs a pipeline based on the command and arguments. When possible, it uses the Datafusion DataFrame API for efficiency and performance. However, ORC, .XLSX, YAML, and JSON (pretty) aren't natively supported by Datafusion, and datu uses internal adapters for those file formats.
For example, the following invocation
constructs a pipeline that's composed of:
- a (DataFrame) Parquet reader step that reads the
input.parquetfile and filters for only theid,name, andemailcolumns, that chains to - a YAML writer step, that writes the
id,name, andemailcolumns frominput.parquettooutput.csv