arrs-cli 0.1.3

Command-line tool for inspecting Lance and other Arrow-based datasets.
Documentation

arrs

A small command-line tool for inspecting Arrow-based datasets. Today it reads Lance datasets; the core is format-agnostic so other Arrow-backed formats can be added without touching commands or output.

Features

  • Stream or random-access any Lance dataset from the shell.
  • Print rows as JSONL, CSV, or a pretty table.
  • Project columns with --columns / --exclude-columns.
  • Choose how binary payloads are rendered: hidden behind a placeholder, hex (\xHH), or base64.
  • ISO-8601 timestamps, NaN/Infinity handled, nested lists & structs preserved in JSONL.

Install

Via uv

uv tool install rust-arrs

From the repository

# From a clone of this repo:
cargo install --path .

# Or run directly from the checkout:
cargo run --release -- <command> [args…]

Commands

Command What it does
cat Concatenate one or more datasets and print every row.
head Print the first N rows (default 10).
tail Print the last N rows (default 10).
take Print specific rows by index (see grammar below).
rowcount Print the number of rows.
sample Print N random rows, no replacement. --seed for reproducibility.
schema Print the logical (Arrow) or physical (Lance-native) schema.
versions (Lance) List versions of the dataset.
branches (Lance) List branches of the dataset.
tags (Lance) List tags across every branch.
indices (Lance) List indices defined on the dataset.

Global flags

Flag Default Purpose
--format <csv|jsonl|table> per-cmd Output format. Defaults to table for versions/branches/tags/indices, jsonl everywhere else.
--binary-format <...> none noneBINARY_DATA placeholder; hex\xHH; base64.
--columns <a,b,…> Comma-separated include list. User order is preserved.
--exclude-columns <a,b,…> Comma-separated exclude list. Takes precedence over --columns.

Examples

# How many rows?
arrs rowcount dataset.lance

# Peek at the first 5 rows as JSONL (default).
arrs head -n 5 dataset.lance

# Last 3 rows, CSV, dropping a noisy column.
arrs tail -n 3 --format csv --exclude-columns raw_tokens dataset.lance

# Specific rows by index: last row, row 0, rows 2 through 4.
arrs take --indices '-1,0,2:4' dataset.lance

# Reproducible random sample of 100 rows.
arrs sample -n 100 --seed 42 dataset.lance

# Concatenate two partitions (must share the same schema) and keep two columns.
arrs cat --columns id,score part_a.lance part_b.lance

# Inspect schemas.
arrs schema dataset.lance                 # arrow (logical)
arrs schema --type physical dataset.lance # lance-native (field ids, encoding…)

Lance versioning, branches and tags

Lance datasets carry a per-branch linear version history; tags are named references to specific (branch, version) pairs. Three flags select which state to read from:

Flag Meaning
--branch <name> Read from the named branch (default: main).
--version <N> Read version N on the chosen branch. (default: latest version)
--tag <name> Read the tagged (branch, version)

--version and --tag are mutually exclusive.

# Inspect a previous snapshot.
arrs head -n 5 --version 3 dataset.lance
arrs rowcount --tag release-2026-04 dataset.lance
arrs cat --branch dev --columns id,score dataset.lance

# List metadata.
arrs versions dataset.lance                       # every version on main
arrs versions --tagged-only dataset.lance         # only tagged versions
arrs versions --branch dev dataset.lance          # every version on `dev`
arrs branches dataset.lance
arrs tags dataset.lance                           # cross-branch tag listing
arrs indices dataset.lance

Binary columns

Binary payloads can blow up output size and clutter a terminal, so by default they are collapsed to a placeholder:

$ arrs head -n 1 dataset.lance
{"id":1,"data":"BINARY_DATA",…}

$ arrs head -n 1 --binary-format hex dataset.lance
{"id":1,"data":"\\x48\\x65\\x6c\\x6c\\x6f",…}

$ arrs head -n 1 --binary-format base64 dataset.lance
{"id":1,"data":"SGVsbG8=",…}

The placeholder semantics apply recursively: binary nested inside a struct or list is also rendered as BINARY_DATA under the default.

--indices grammar

take --indices accepts a comma-separated list of expressions. Order is preserved and duplicates are emitted as-is.

Expression Meaning
N single row (negatives count from the end)
A:B inclusive range, both ends
:B rows 0 through B
A: row A through the last row
-5:-1 last 5 rows
3,1,1,0:2 [3, 1, 1, 0, 1, 2]

Output format notes

JSONL

  • One JSON object per line; keys match the projected column order.
  • NaN / ±Infinity emit the strings "NaN" / "Infinity" / "-Infinity".
  • Timestamps are ISO-8601 (2024-01-01T00:00:00.000000, with offset when the arrow type carries a timezone).
  • Lists → JSON arrays; structs → JSON objects; maps → JSON objects with stringified keys.

CSV

  • First line is a header row: col1,col2,col3. Column names containing ,, newlines, or quotes are quoted per RFC 4180.
  • Nulls emit as empty cells; NaN / inf / -inf for floats.
  • Nested types (list, struct, map, duration, interval) are rejected, use JSONL for those.

Table

  • Pretty Unicode borders on a TTY, ASCII grid when piped (so arrs … | grep stays sane).
  • Same primitive rendering as CSV (ISO-8601, NaN/inf, empty for null).
  • Nested cells (list, struct, map) are JSON-encoded inside the cell — e.g. ["id"] for a single-element list. Strictly more permissive than CSV.
  • Buffers all rows before emitting (column widths require the full table). Default for the four metadata commands (small row counts), opt-in for row-producing commands; prefer jsonl/csv when streaming large datasets.