# arrs
A small command-line tool for inspecting Arrow-based datasets. Today it reads
[Lance](https://lance.org/) datasets; the core is format-agnostic
so other Arrow-backed formats can be added without touching commands or output.
## Features
- Stream or random-access any Lance dataset from the shell.
- Print rows as **JSONL**, **CSV**, or a
pretty **table**.
- Project columns with `--columns` / `--exclude-columns`.
- Choose how binary payloads are rendered: hidden behind a placeholder, hex
(`\xHH`), or base64.
- ISO-8601 timestamps, `NaN`/`Infinity` handled, nested lists & structs
preserved in JSONL.
## Install
### Via `uv`
```sh
uv tool install rust-arrs
```
### From the repository
```sh
# From a clone of this repo:
cargo install --path .
# Or run directly from the checkout:
cargo run --release -- <command> [args…]
```
## Commands
| `cat` | Concatenate one or more datasets and print every row. |
| `head` | Print the first `N` rows (default `10`). |
| `tail` | Print the last `N` rows (default `10`). |
| `take` | Print specific rows by index (see grammar below). |
| `rowcount` | Print the number of rows. |
| `sample` | Print `N` random rows, no replacement. `--seed` for reproducibility.|
| `schema` | Print the logical (Arrow) or physical (Lance-native) schema. |
| `versions` | (Lance) List versions of the dataset. |
| `branches` | (Lance) List branches of the dataset. |
| `tags` | (Lance) List tags across every branch. |
| `indices` | (Lance) List indices defined on the dataset. |
## Global flags
| `--format <csv\|jsonl\|table>` | per-cmd | Output format. Defaults to `table` for `versions`/`branches`/`tags`/`indices`, `jsonl` everywhere else. |
| `--binary-format <...>` | `none` | `none` → `BINARY_DATA` placeholder; `hex` → `\xHH`; `base64`.|
| `--columns <a,b,…>` | – | Comma-separated include list. User order is preserved. |
| `--exclude-columns <a,b,…>` | – | Comma-separated exclude list. Takes precedence over `--columns`.|
## Examples
```sh
# How many rows?
arrs rowcount dataset.lance
# Peek at the first 5 rows as JSONL (default).
arrs head -n 5 dataset.lance
# Last 3 rows, CSV, dropping a noisy column.
arrs tail -n 3 --format csv --exclude-columns raw_tokens dataset.lance
# Specific rows by index: last row, row 0, rows 2 through 4.
arrs take --indices '-1,0,2:4' dataset.lance
# Reproducible random sample of 100 rows.
arrs sample -n 100 --seed 42 dataset.lance
# Concatenate two partitions (must share the same schema) and keep two columns.
arrs cat --columns id,score part_a.lance part_b.lance
# Inspect schemas.
arrs schema dataset.lance # arrow (logical)
arrs schema --type physical dataset.lance # lance-native (field ids, encoding…)
```
### Lance versioning, branches and tags
Lance datasets carry a per-branch linear version history; tags are named
references to specific `(branch, version)` pairs. Three flags select which
state to read from:
| `--branch <name>` | Read from the named branch (default: `main`). |
| `--version <N>` | Read version `N` on the chosen branch. (default: latest version) |
| `--tag <name>` | Read the tagged `(branch, version)` |
`--version` and `--tag` are mutually exclusive.
```sh
# Inspect a previous snapshot.
arrs head -n 5 --version 3 dataset.lance
arrs rowcount --tag release-2026-04 dataset.lance
arrs cat --branch dev --columns id,score dataset.lance
# List metadata.
arrs versions dataset.lance # every version on main
arrs versions --tagged-only dataset.lance # only tagged versions
arrs versions --branch dev dataset.lance # every version on `dev`
arrs branches dataset.lance
arrs tags dataset.lance # cross-branch tag listing
arrs indices dataset.lance
```
### Binary columns
Binary payloads can blow up output size and clutter a terminal, so by default
they are collapsed to a placeholder:
```sh
$ arrs head -n 1 dataset.lance
{"id":1,"data":"BINARY_DATA",…}
$ arrs head -n 1 --binary-format hex dataset.lance
{"id":1,"data":"\\x48\\x65\\x6c\\x6c\\x6f",…}
$ arrs head -n 1 --binary-format base64 dataset.lance
{"id":1,"data":"SGVsbG8=",…}
```
The placeholder semantics apply recursively: binary nested inside a struct or
list is also rendered as `BINARY_DATA` under the default.
### `--indices` grammar
`take --indices` accepts a comma-separated list of expressions. Order is
preserved and duplicates are emitted as-is.
| `N` | single row (negatives count from the end) |
| `A:B` | inclusive range, both ends |
| `:B` | rows 0 through B |
| `A:` | row A through the last row |
| `-5:-1` | last 5 rows |
| `3,1,1,0:2` | `[3, 1, 1, 0, 1, 2]` |
## Output format notes
**JSONL**
- One JSON object per line; keys match the projected column order.
- `NaN` / `±Infinity` emit the strings `"NaN"` / `"Infinity"` / `"-Infinity"`.
- Timestamps are ISO-8601 (`2024-01-01T00:00:00.000000`, with offset when the
arrow type carries a timezone).
- Lists → JSON arrays; structs → JSON objects; maps → JSON objects with
stringified keys.
**CSV**
- First line is a header row: `col1,col2,col3`. Column names
containing `,`, newlines, or quotes are quoted per RFC 4180.
- Nulls emit as empty cells; `NaN` / `inf` / `-inf` for floats.
- Nested types (list, struct, map, duration, interval) are rejected, use
JSONL for those.
**Table**
- Pretty Unicode borders on a TTY, ASCII grid when piped (so
`arrs … | grep` stays sane).
- Same primitive rendering as CSV (ISO-8601, `NaN`/`inf`, empty for null).
- Nested cells (list, struct, map) are JSON-encoded inside the cell —
e.g. `["id"]` for a single-element list. Strictly more permissive than CSV.
- **Buffers all rows** before emitting (column widths require the full table).
Default for the four metadata commands (small row counts), opt-in for
row-producing commands; prefer `jsonl`/`csv` when streaming large datasets.