Crate datu

Expand description

§datu - a data file utility

Datu (Filipino) - a traditional chief or local leader

datu is intended to be a lightweight, fast, and versatile CLI tool for reading, querying, and converting data in various file formats, such as Parquet, Avro, ORC, CSV, JSON, YAML, and .XLSX.

§Installation

Prerequisites: Rust ~> 1.95 (or recent stable)

cargo install datu

To install from source:

cargo install --git https://github.com/aisrael/datu

§Supported Formats

Format	Read	Write	Display
Parquet (`.parquet`, `.parq`)	✓	✓	—
Avro (`.avro`)	✓	✓	—
ORC (`.orc`)	✓	✓	—
CSV (`.csv`)	✓	✓	✓
JSON (`.json`)	✓	✓	✓
XLSX (`.xlsx`)	—	✓	—
JSON (pretty)	—	—	✓
YAML	—	—	✓

Read — Input file formats for convert, count, schema, head, and tail.
Write — Output file formats for convert.
Display — Output format when printing to stdout (schema, head, tail via --output: csv, json, json-pretty, yaml).

File type detection: By default, file types are inferred from extensions. Use --input <TYPE> (-I) to override input format detection, and --output <TYPE> (-O, convert only) to override output format detection. Valid types: avro, csv, json, orc, parquet, xlsx, yaml.

CSV options: When reading CSV files, the --input-headers option controls whether the first row is treated as column names. Omitted or --input-headers means true (header present); --input-headers=false for headerless CSV. Applies to convert, count, schema, head, and tail.

§Usage

datu can be used non-interactively as a typical command-line utility, or it can be ran without specifying a command in interactive mode, providing a REPL-like interface.

For example, the command

datu convert table.parquet --select id,email table.csv

And, interactively, using the REPL

datu
> read("table.parquet") |> select(:id, :email) |> write("table.csv")

Perform the same conversion and column filtering.

§Commands

§`convert`

Convert data between supported formats. Input and output formats are inferred from file extensions, or can be specified explicitly with --input and --output.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), JSON (.json), ORC (.orc).

Supported output formats: CSV (.csv), JSON (.json), Parquet (.parquet, .parq), Avro (.avro), ORC (.orc), XLSX (.xlsx).

Usage (CLI):

datu convert <INPUT> <OUTPUT> [OPTIONS]

Usage (REPL):

read("table.parquet") |> select(:id, :email) |> write("table.csv")

Options:

Option	Description
`-I`, `--input <TYPE>`	Input file type (`avro`, `csv`, `json`, `orc`, `parquet`, `xlsx`, `yaml`). Overrides extension-based detection.
`-O`, `--output <TYPE>`	Output file type (`avro`, `csv`, `json`, `orc`, `parquet`, `xlsx`, `yaml`). Overrides extension-based detection.
`--select <COLUMNS>...`	Columns to include. If not specified, all columns are written. Column names can be given as multiple arguments or as comma-separated values (e.g. `--select id,name,email` or `--select id --select name --select email`).
`--limit <N>`	Maximum number of records to read from the input.
`--sparse`	For JSON/YAML: omit keys with null/missing values. Default: `true`. Use `--sparse=false` to include default values (e.g. empty string).
`--json-pretty`	When converting to JSON, format output with indentation and newlines. Ignored for other output formats.
`--input-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--input-headers=false` for headerless CSV.

Examples:

# Parquet to CSV (all columns)
datu convert data.parquet data.csv

# CSV to Parquet (with automatic type inference)
datu convert data.csv data.parquet

# Parquet to Avro (first 1000 rows)
datu convert data.parquet data.avro --limit 1000

# Avro to CSV, only specific columns
datu convert events.avro events.csv --select id,timestamp,user_id

# CSV to JSON with headerless input
datu convert data.csv output.json --input-headers=false

# Parquet to Parquet with column subset
datu convert input.parq output.parquet --select one,two,three

# JSON to CSV or Parquet
datu convert data.json data.csv
datu convert data.json data.parquet

# Parquet, Avro, CSV, JSON, or ORC to Excel (.xlsx)
datu convert data.parquet report.xlsx

# Parquet, Avro, or JSON to ORC
datu convert data.parquet data.orc

# Parquet, Avro, or JSON to JSON
datu convert data.parquet data.json

§`schema`

Display the schema of a Parquet, Avro, CSV, or ORC file (column names, types, and nullability). Useful for inspecting file structure without reading data. CSV schema uses type inference from the data.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Usage (CLI):

datu schema <FILE> [OPTIONS]

Usage (REPL):

read("file") |> schema()

Use schema() after a read to print the schema of the data in the pipeline. For a single file, read("data.parquet") |> schema() is equivalent.

Options:

Option	Description
`-I`, `--input <TYPE>`	Input file type (`avro`, `csv`, `json`, `orc`, `parquet`, `xlsx`, `yaml`). Overrides extension-based detection.
`--output <FORMAT>`	Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`.
`--input-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--input-headers=false` for headerless CSV.

Output formats:

csv (default): One line per column, e.g. name: String (UTF8), nullable.
json: JSON array of objects with name, data_type, nullable, and optionally converted_type (Parquet).
json-pretty: Same as json but pretty-printed for readability.
yaml: YAML list of mappings with the same fields.

Examples:

# Default CSV-style output
datu schema data.parquet

# JSON output
datu schema data.parquet --output json

# JSON pretty-printed
datu schema data.parquet --output json-pretty

# YAML output (e.g. for config or tooling)
datu schema events.avro --output yaml
datu schema events.avro -o YAML

§`count`

Return the number of rows in a Parquet, Avro, CSV, or ORC file.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Usage (CLI):

datu count <FILE> [OPTIONS]

Usage (REPL):

count("file")

Count rows in a file directly. Or use read("file") |> count()

Options:

Option	Description
`-I`, `--input <TYPE>`	Input file type (`avro`, `csv`, `json`, `orc`, `parquet`, `xlsx`, `yaml`). Overrides extension-based detection.
`--input-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--input-headers=false` for headerless CSV.

Examples:

# Count rows in a Parquet file
datu count data.parquet

# Count rows in an Avro, CSV, or ORC file
datu count events.avro
datu count data.csv
datu count data.orc

# Count rows in a headerless CSV file
datu count data.csv --input-headers=false

§`sample`

Print N randomly sampled rows from a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use --output for other formats). For Parquet and ORC, sampling uses file metadata to determine total row count and selects random indices; for Avro and CSV, reservoir sampling is used.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Usage (CLI):

datu sample <INPUT> [OPTIONS]

Usage (REPL):

read("file") |> sample(n)

Prints n random rows to stdout. Chain |> write("out.csv") to write to a file, e.g. read("data.parquet") |> sample(5) |> write("sample.csv").

Options:

Option	Description
`-I`, `--input <TYPE>`	Input file type (`avro`, `csv`, `json`, `orc`, `parquet`, `xlsx`, `yaml`). Overrides extension-based detection.
`-n`, `--number <N>`	Number of rows to sample. Default: 10.
`--output <FORMAT>`	Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`.
`--sparse`	For JSON/YAML: omit keys with null/missing values. Default: `true`. Use `--sparse=false` to include default values.
`--select <COLUMNS>...`	Columns to include. If not specified, all columns are printed. Same format as `convert --select`.
`--input-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--input-headers=false` for headerless CSV.

Examples:

# 10 random rows (default)
datu sample data.parquet

# 5 random rows
datu sample data.parquet -n 5
datu sample data.avro --number 5
datu sample data.csv -n 5
datu sample data.orc --number 5

# 20 random rows, specific columns
datu sample data.parquet -n 20 --select id,name,email

§`head`

Print the first N rows of a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use --output for other formats).

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Usage (CLI):

datu head <INPUT> [OPTIONS]

Usage (REPL):

read("file") |> head(n)

Prints the first n rows to stdout. Chain |> write("out.csv") to write to a file, e.g. read("data.parquet") |> head(10) |> write("first10.csv").

Options:

Option	Description
`-I`, `--input <TYPE>`	Input file type (`avro`, `csv`, `json`, `orc`, `parquet`, `xlsx`, `yaml`). Overrides extension-based detection.
`-n`, `--number <N>`	Number of rows to print. Default: 10.
`--output <FORMAT>`	Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`.
`--sparse`	For JSON/YAML: omit keys with null/missing values. Default: `true`. Use `--sparse=false` to include default values.
`--select <COLUMNS>...`	Columns to include. If not specified, all columns are printed. Same format as `convert --select`.
`--input-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--input-headers=false` for headerless CSV.

Examples:

# First 10 rows (default)
datu head data.parquet

# First 100 rows
datu head data.parquet -n 100
datu head data.avro --number 100
datu head data.csv -n 100
datu head data.orc --number 100

# First 20 rows, specific columns
datu head data.parquet -n 20 --select id,name,email

# Head from a headerless CSV file
datu head data.csv --input-headers=false

§`tail`

Print the last N rows of a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use --output for other formats).

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Note: For Avro and CSV files, tail requires a full file scan since these formats do not support random access to the end of the file.

Usage (CLI):

datu tail <INPUT> [OPTIONS]

Usage (REPL):

read("file") |> tail(n)

Prints the last n rows to stdout. Chain |> write("out.csv") to write to a file, e.g. read("data.parquet") |> tail(10) |> write("last10.csv").

Options:

Option	Description
`-I`, `--input <TYPE>`	Input file type (`avro`, `csv`, `json`, `orc`, `parquet`, `xlsx`, `yaml`). Overrides extension-based detection.
`-n`, `--number <N>`	Number of rows to print. Default: 10.
`--output <FORMAT>`	Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`.
`--sparse`	For JSON/YAML: omit keys with null/missing values. Default: `true`. Use `--sparse=false` to include default values.
`--select <COLUMNS>...`	Columns to include. If not specified, all columns are printed. Same format as `convert --select`.
`--input-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--input-headers=false` for headerless CSV.

Examples:

# Last 10 rows (default)
datu tail data.parquet

# Last 50 rows
datu tail data.parquet -n 50
datu tail data.avro --number 50
datu tail data.csv -n 50
datu tail data.orc --number 50

# Last 20 rows, specific columns
datu tail data.parquet -n 20 --select id,name,email

# Redirect tail output to a file
datu tail data.parquet -n 1000 > last1000.csv

§Version

Print the installed datu version.

Usage (CLI):

datu version

Usage (REPL):

No equivalent; run datu version from the shell.

§Interactive Mode (REPL)

Running datu without any command starts an interactive REPL (Read-Eval-Print Loop):

datu
>

In the REPL, you compose data pipelines using the |> (pipe) operator to chain functions together. The general pattern is:

read("input") |> ... |> write("output")

§Functions

§`read(path)`

Read a data file. Supported formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), JSON (.json), ORC (.orc). CSV files are assumed to have a header row by default.

> read("data.parquet") |> write("data.csv")
> read("data.csv") |> write("data.parquet")

§`write(path)`

Write data to a file. The output format is inferred from the file extension. Supported formats: CSV (.csv), JSON (.json), YAML (.yaml), Parquet (.parquet, .parq), Avro (.avro), ORC (.orc), XLSX (.xlsx).

> read("data.parquet") |> write("output.json")

§`select(columns...)`

Select and reorder columns. Columns can be specified using symbol syntax (:name) or string syntax ("name").

> read("data.parquet") |> select(:id, :email) |> write("subset.csv")
> read("data.parquet") |> select("id", "email") |> write("subset.csv")

Columns appear in the output in the order they are listed, so select can also be used to reorder columns:

> read("data.parquet") |> select(:email, :id) |> write("reordered.csv")

§`head(n)`

Take the first n rows.

> read("data.parquet") |> head(10) |> write("first10.csv")

§`sample(n)`

Take n random rows from the data. Default: 10.

> read("data.parquet") |> sample(5) |> write("sampled.csv")

§`tail(n)`

Take the last n rows.

> read("data.parquet") |> tail(10) |> write("last10.csv")

§Composing Pipelines

Functions can be chained in any order to build more complex pipelines:

> read("users.avro") |> select(:id, :first_name, :email) |> head(5) |> write("top5.json")
> read("data.parquet") |> select(:two, :one) |> tail(1) |> write("last_row.csv")
> read("data.parquet") |> select(:id, :email) |> sample(5) |> write("sample.csv")

§How it Works Internally

Internally, datu constructs a pipeline based on the command and arguments. When possible, it uses the Datafusion DataFrame API for efficiency and performance. However, ORC, .XLSX, YAML, and JSON (pretty) aren’t natively supported by Datafusion, and datu uses internal adapters for those file formats.

For example, the following invocation

datu convert input.parquet output.yaml --select id,name,email

constructs a pipeline that’s composed of:

a (DataFrame) Parquet reader step that reads the input.parquet file and filters for only the id, name, and email columns, that chains to
a YAML writer step, that writes the id, name, and email columns from input.parquet to output.csv

Re-exports§

pub use errors::Error;
pub use file_type::DISPLAY_PIPELINE_INPUTS_FOR_CLI;
pub use file_type::FileType;
pub use file_type::resolve_file_type;
pub use utils::get_total_rows_result;

Modules§

cli: Contains shared types for the datu CLI and implementation.
errors: Error types for the datu crate.
file_type
pipeline: The pipeline module is the core of the datu crate.
utils

Macros§

select_spec: Macro to build a SelectSpec of ColumnSpec::Exact entries from column names.

Constants§

VERSION: The datu crate version

Type Aliases§

Result: Result type alias for datu operations.

Crate datu

Crate datu Copy item path

§datu - a data file utility

§Installation

§Supported Formats

§Usage

§Commands

§convert

§schema

§count

§sample

§head

§tail

§Version

§Interactive Mode (REPL)

§Functions

§read(path)

§write(path)

§select(columns...)

§head(n)

§sample(n)

§tail(n)

§Composing Pipelines

§How it Works Internally

Re-exports§

Modules§

Macros§

Constants§

Type Aliases§

Crate datu

§`convert`

§`schema`

§`count`

§`sample`

§`head`

§`tail`

§`read(path)`

§`write(path)`

§`select(columns...)`

§`head(n)`

§`sample(n)`

§`tail(n)`