arrs

A small command-line tool for inspecting Arrow-based datasets. Today it reads Lance datasets; the core is format-agnostic so other Arrow-backed formats can be added without touching commands or output.

Features

Stream or random-access any Lance dataset from the shell.
Print rows as JSONL, CSV, or a pretty table.
Project columns with --columns / --exclude-columns.
Choose how binary payloads are rendered: hidden behind a placeholder, hex (\xHH), or base64.
ISO-8601 timestamps, NaN/Infinity handled, nested lists & structs preserved in JSONL.

Install

Via `uv`

uv tool install rust-arrs

From the repository

# From a clone of this repo:
cargo install --path .

# Or run directly from the checkout:
cargo run --release -- <command> [args…]

Commands

Command	What it does
`cat`	Concatenate one or more datasets and print every row.
`head`	Print the first `N` rows (default `10`).
`tail`	Print the last `N` rows (default `10`).
`take`	Print specific rows by index (see grammar below).
`rowcount`	Print the number of rows.
`sample`	Print `N` random rows, no replacement. `--seed` for reproducibility.
`schema`	Print the logical (Arrow) or physical (Lance-native) schema.
`versions`	(Lance) List versions of the dataset.
`branches`	(Lance) List branches of the dataset.
`tags`	(Lance) List tags across every branch.
`indices`	(Lance) List indices defined on the dataset.

Global flags

Flag	Default	Purpose
`--format <csv\|jsonl\|table>`	per-cmd	Output format. Defaults to `table` for `versions`/`branches`/`tags`/`indices`, `jsonl` everywhere else.
`--binary-format <...>`	`none`	`none` → `BINARY_DATA` placeholder; `hex` → `\xHH`; `base64`.
`--columns <a,b,…>`	–	Comma-separated include list. User order is preserved.
`--exclude-columns <a,b,…>`	–	Comma-separated exclude list. Takes precedence over `--columns`.

Examples

# How many rows?
arrs rowcount dataset.lance

# Peek at the first 5 rows as JSONL (default).
arrs head -n 5 dataset.lance

# Last 3 rows, CSV, dropping a noisy column.
arrs tail -n 3 --format csv --exclude-columns raw_tokens dataset.lance

# Specific rows by index: last row, row 0, rows 2 through 4.
arrs take --indices '-1,0,2:4' dataset.lance

# Reproducible random sample of 100 rows.
arrs sample -n 100 --seed 42 dataset.lance

# Concatenate two partitions (must share the same schema) and keep two columns.
arrs cat --columns id,score part_a.lance part_b.lance

# Inspect schemas.
arrs schema dataset.lance                 # arrow (logical)
arrs schema --type physical dataset.lance # lance-native (field ids, encoding…)

Lance versioning, branches and tags

Lance datasets carry a per-branch linear version history; tags are named references to specific (branch, version) pairs. Three flags select which state to read from:

Flag	Meaning
`--branch <name>`	Read from the named branch (default: `main`).
`--version <N>`	Read version `N` on the chosen branch. (default: latest version)
`--tag <name>`	Read the tagged `(branch, version)`

--version and --tag are mutually exclusive.

# Inspect a previous snapshot.
arrs head -n 5 --version 3 dataset.lance
arrs rowcount --tag release-2026-04 dataset.lance
arrs cat --branch dev --columns id,score dataset.lance

# List metadata.
arrs versions dataset.lance                       # every version on main
arrs versions --tagged-only dataset.lance         # only tagged versions
arrs versions --branch dev dataset.lance          # every version on `dev`
arrs branches dataset.lance
arrs tags dataset.lance                           # cross-branch tag listing
arrs indices dataset.lance

Binary columns

Binary payloads can blow up output size and clutter a terminal, so by default they are collapsed to a placeholder:

$ arrs head -n 1 dataset.lance
{"id":1,"data":"BINARY_DATA",…}

$ arrs head -n 1 --binary-format hex dataset.lance
{"id":1,"data":"\\x48\\x65\\x6c\\x6c\\x6f",…}

$ arrs head -n 1 --binary-format base64 dataset.lance
{"id":1,"data":"SGVsbG8=",…}

The placeholder semantics apply recursively: binary nested inside a struct or list is also rendered as BINARY_DATA under the default.

`--indices` grammar

take --indices accepts a comma-separated list of expressions. Order is preserved and duplicates are emitted as-is.

Expression	Meaning
`N`	single row (negatives count from the end)
`A:B`	inclusive range, both ends
`:B`	rows 0 through B
`A:`	row A through the last row
`-5:-1`	last 5 rows
`3,1,1,0:2`	`[3, 1, 1, 0, 1, 2]`

Output format notes

JSONL

One JSON object per line; keys match the projected column order.
NaN / ±Infinity emit the strings "NaN" / "Infinity" / "-Infinity".
Timestamps are ISO-8601 (2024-01-01T00:00:00.000000, with offset when the arrow type carries a timezone).
Lists → JSON arrays; structs → JSON objects; maps → JSON objects with stringified keys.

CSV

First line is a header row: col1,col2,col3. Column names containing ,, newlines, or quotes are quoted per RFC 4180.
Nulls emit as empty cells; NaN / inf / -inf for floats.
Nested types (list, struct, map, duration, interval) are rejected, use JSONL for those.

Table

Pretty Unicode borders on a TTY, ASCII grid when piped (so arrs … | grep stays sane).
Same primitive rendering as CSV (ISO-8601, NaN/inf, empty for null).
Nested cells (list, struct, map) are JSON-encoded inside the cell — e.g. ["id"] for a single-element list. Strictly more permissive than CSV.
Buffers all rows before emitting (column widths require the full table). Default for the four metadata commands (small row counts), opt-in for row-producing commands; prefer jsonl/csv when streaming large datasets.

arrs-cli 0.1.3