datu 0.2.2

datu - a data file utility
Documentation

datu - a data file utility

Datu (Filipino) - a traditional chief or local leader

datu is intended to be a lightweight, fast, and versatile CLI tool for reading, querying, and converting data in various file formats, such as Parquet, .XLSX, CSV, and even f3.

It is used non-interactively: you invoke a subcommand with arguments on the CLI or from scripts for automated pipelines.

Internally, it also uses a pipeline architecture that aids in extensibility and testing, as well as allowing for parallel processing even of large datasets, if the input/output formats support it.

Installation

Prerequisites: Rust ~> 1.95 (or recent stable)

cargo install datu

To install from source:

cargo install --git https://github.com/aisrael/datu

How it Works Internally

Internally, datu constructs a pipeline based on the command and arguments.

For example, the following invocation

datu convert input.parquet output.csv --select id,name,email

constructs a pipeline that reads the input, selects only the specified columns, and writes the output.

Supported Formats

Format Read Write Display
Parquet (.parquet, .parq)
Avro (.avro)
ORC (.orc)
CSV (.csv)
JSON (.json)
JSON (pretty)
XLSX (.xlsx)
YAML
  • Read — Input file formats for convert, schema, head, and tail.
  • Write — Output file formats for convert.
  • Display — Output format when printing to stdout (schema, head, tail via --output: csv, json, json-pretty, yaml).

Examples

schema

Display the schema of a Parquet, Avro, or ORC file (column names, types, and nullability). Useful for inspecting file structure without reading data.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), ORC (.orc).

Usage:

datu schema <FILE> [OPTIONS]

Options:

Option Description
--output <FORMAT> Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv.

Output formats:

  • csv (default): One line per column, e.g. name: String (UTF8), nullable.
  • json: JSON array of objects with name, data_type, nullable, and optionally converted_type (Parquet).
  • json-pretty: Same as json but pretty-printed for readability.
  • yaml: YAML list of mappings with the same fields.

Examples:

# Default CSV-style output
datu schema data.parquet

# JSON output
datu schema data.parquet --output json

# JSON pretty-printed
datu schema data.parquet --output json-pretty

# YAML output (e.g. for config or tooling)
datu schema events.avro --output yaml
datu schema events.avro -o YAML

convert

Convert data between supported formats. Input and output formats are inferred from file extensions.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), ORC (.orc).

Supported output formats: CSV (.csv), JSON (.json), Parquet (.parquet, .parq), Avro (.avro), ORC (.orc), XLSX (.xlsx).

Usage:

datu convert <INPUT> <OUTPUT> [OPTIONS]

Options:

Option Description
--select <COLUMNS>... Columns to include. If not specified, all columns are written. Column names can be given as multiple arguments or as comma-separated values (e.g. --select id,name,email or --select id --select name --select email).
--limit <N> Maximum number of records to read from the input.
--sparse For JSON/YAML: omit keys with null/missing values. Default: true. Use --no-sparse to include default values (e.g. empty string).
--json-pretty When converting to JSON, format output with indentation and newlines. Ignored for other output formats.

Examples:

# Parquet to CSV (all columns)
datu convert data.parquet data.csv

# Parquet to Avro (first 1000 rows)
datu convert data.parquet data.avro --limit 1000

# Avro to CSV, only specific columns
datu convert events.avro events.csv --select id,timestamp,user_id

# Parquet to Parquet with column subset
datu convert input.parq output.parquet --select one,two,three

# Parquet, Avro, or ORC to Excel (.xlsx)
datu convert data.parquet report.xlsx

# Parquet or Avro to ORC
datu convert data.parquet data.orc

# Parquet or Avro to JSON
datu convert data.parquet data.json

head

Print the first N rows of a Parquet, Avro, or ORC file to stdout (default CSV; use --output for other formats).

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), ORC (.orc).

Usage:

datu head <INPUT> [OPTIONS]

Options:

Option Description
-n, --number <N> Number of rows to print. Default: 10.
--output <FORMAT> Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv.
--select <COLUMNS>... Columns to include. If not specified, all columns are printed. Same format as convert --select.

Examples:

# First 10 rows (default)
datu head data.parquet

# First 100 rows
datu head data.parquet -n 100
datu head data.avro --number 100
datu head data.orc --number 100

# First 20 rows, specific columns
datu head data.parquet -n 20 --select id,name,email

tail

Print the last N rows of a Parquet, Avro, or ORC file to stdout (default CSV; use --output for other formats).

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), ORC (.orc).

Usage:

datu tail <INPUT> [OPTIONS]

Options:

Option Description
-n, --number <N> Number of rows to print. Default: 10.
--output <FORMAT> Output format: csv, json, json-pretty, or yaml. Case insensitive. Default: csv.
--select <COLUMNS>... Columns to include. If not specified, all columns are printed. Same format as convert --select.

Examples:

# Last 10 rows (default)
datu tail data.parquet

# Last 50 rows
datu tail data.parquet -n 50
datu tail data.avro --number 50
datu tail data.orc --number 50

# Last 20 rows, specific columns
datu tail data.parquet -n 20 --select id,name,email

# Redirect tail output to a file
datu tail data.parquet -n 1000 > last1000.csv

Version

Print the installed datu version:

datu version