arrs
A small command-line tool for inspecting Arrow-based datasets. Today it reads Lance datasets; the core is format-agnostic so other Arrow-backed formats can be added without touching commands or output.
Features
- Stream or random-access any Lance dataset from the shell.
- Print rows as JSONL, CSV, or a pretty table.
- Project columns with
--columns/--exclude-columns. - Choose how binary payloads are rendered: hidden behind a placeholder, hex
(
\xHH), or base64. - ISO-8601 timestamps,
NaN/Infinityhandled, nested lists & structs preserved in JSONL.
Install
Via uv
From the repository
# From a clone of this repo:
# Or run directly from the checkout:
Commands
| Command | What it does |
|---|---|
cat |
Concatenate one or more datasets and print every row. |
head |
Print the first N rows (default 10). |
tail |
Print the last N rows (default 10). |
take |
Print specific rows by index (see grammar below). |
rowcount |
Print the number of rows. |
sample |
Print N random rows, no replacement. --seed for reproducibility. |
schema |
Print the logical (Arrow) or physical (Lance-native) schema. |
versions |
(Lance) List versions of the dataset. |
branches |
(Lance) List branches of the dataset. |
tags |
(Lance) List tags across every branch. |
indices |
(Lance) List indices defined on the dataset. |
Global flags
| Flag | Default | Purpose |
|---|---|---|
--format <csv|jsonl|table> |
per-cmd | Output format. Defaults to table for versions/branches/tags/indices, jsonl everywhere else. |
--binary-format <...> |
none |
none → BINARY_DATA placeholder; hex → \xHH; base64. |
--columns <a,b,…> |
– | Comma-separated include list. User order is preserved. |
--exclude-columns <a,b,…> |
– | Comma-separated exclude list. Takes precedence over --columns. |
Examples
# How many rows?
# Peek at the first 5 rows as JSONL (default).
# Last 3 rows, CSV, dropping a noisy column.
# Specific rows by index: last row, row 0, rows 2 through 4.
# Reproducible random sample of 100 rows.
# Concatenate two partitions (must share the same schema) and keep two columns.
# Inspect schemas.
Lance versioning, branches and tags
Lance datasets carry a per-branch linear version history; tags are named
references to specific (branch, version) pairs. Three flags select which
state to read from:
| Flag | Meaning |
|---|---|
--branch <name> |
Read from the named branch (default: main). |
--version <N> |
Read version N on the chosen branch. (default: latest version) |
--tag <name> |
Read the tagged (branch, version) |
--version and --tag are mutually exclusive.
# Inspect a previous snapshot.
# List metadata.
Binary columns
Binary payloads can blow up output size and clutter a terminal, so by default they are collapsed to a placeholder:
}
}
}
The placeholder semantics apply recursively: binary nested inside a struct or
list is also rendered as BINARY_DATA under the default.
--indices grammar
take --indices accepts a comma-separated list of expressions. Order is
preserved and duplicates are emitted as-is.
| Expression | Meaning |
|---|---|
N |
single row (negatives count from the end) |
A:B |
inclusive range, both ends |
:B |
rows 0 through B |
A: |
row A through the last row |
-5:-1 |
last 5 rows |
3,1,1,0:2 |
[3, 1, 1, 0, 1, 2] |
Output format notes
JSONL
- One JSON object per line; keys match the projected column order.
NaN/±Infinityemit the strings"NaN"/"Infinity"/"-Infinity".- Timestamps are ISO-8601 (
2024-01-01T00:00:00.000000, with offset when the arrow type carries a timezone). - Lists → JSON arrays; structs → JSON objects; maps → JSON objects with stringified keys.
CSV
- First line is a header row:
col1,col2,col3. Column names containing,, newlines, or quotes are quoted per RFC 4180. - Nulls emit as empty cells;
NaN/inf/-inffor floats. - Nested types (list, struct, map, duration, interval) are rejected, use JSONL for those.
Table
- Pretty Unicode borders on a TTY, ASCII grid when piped (so
arrs … | grepstays sane). - Same primitive rendering as CSV (ISO-8601,
NaN/inf, empty for null). - Nested cells (list, struct, map) are JSON-encoded inside the cell —
e.g.
["id"]for a single-element list. Strictly more permissive than CSV. - Buffers all rows before emitting (column widths require the full table).
Default for the four metadata commands (small row counts), opt-in for
row-producing commands; prefer
jsonl/csvwhen streaming large datasets.