tva

Tab-separated Values Assistant. Fast, reliable TSV processing toolkit in Rust.

Overview

tva (pronounced "Tee-Va") is a high-performance command-line toolkit written in Rust for processing tabular data. It brings the safety and speed of modern systems programming to the classic Unix philosophy.

Inspiration

eBay's tsv-utils (discontinued): The primary reference for functionality and performance.
GNU Datamash: Statistical operations.
R's tidyr: Reshaping concepts (longer, wider).

Use Cases

"Middle Data": Files too large for Excel/Pandas but too small for distributed systems (Spark/Hadoop).
Data Pipelines: Robust CLI-based ETL steps compatible with awk, sort, etc.
Exploration: Fast summary statistics, sampling, and filtering on raw data.

Design Principles

Single Binary: A standalone executable with no dependencies, easy to deploy.
Header Aware: Manipulate columns by name or index.
Fail-fast: Strict validation ensures data integrity (no silent truncation).
Streaming: Stateless processing designed for infinite streams and large files.
TSV-first: Prioritizes the reliability and simplicity of tab-separated values.
Performance: Single-pass execution with minimal memory overhead.

See Design Documentation for details.

Read the documentation online

Installation

Current release: 0.2.4

# Clone the repository and install via cargo
cargo install --force --path .

Or install the pre-compiled binary via the cross-platform package manager cbp (supports older Linux systems with glibc 2.17+):

cbp install tva

You can also download the pre-compiled binaries from the Releases page.

Running Examples

The examples in the documentation use sample data located in the docs/data/ directory. To run these examples yourself, we recommend cloning the repository:

git clone https://github.com/wang-q/tva.git
cd tva

Then you can run the commands exactly as shown in the docs (e.g., tva select -f 1 docs/data/input.csv).

Alternatively, you can download individual files from the docs/data directory on GitHub.

Commands

Selection & Sampling

select: Select and reorder columns.
slice: Slice rows by index (keep or drop). Supports multiple ranges and header preservation.
sample: Randomly sample rows (Bernoulli, reservoir, weighted).

Filtering

filter: Filter rows based on numeric, string, regex, or date criteria.

Ordering

sort: Sorts rows based on one or more key fields.
reverse: Reverses the order of lines (like tac), optionally keeping the header at the top.
transpose: Swaps rows and columns (matrix transposition).

Statistics & Summary

stats: Calculate summary statistics (sum, mean, median, min, max, etc.) with grouping.
bin: Discretize numeric values into bins (useful for histograms).
uniq: Deduplicate rows or count unique occurrences (supports equivalence classes).

Reshaping

longer: Reshape wide to long (unpivot). Requires a header row.
wider: Reshape long to wide (pivot). Supports aggregation via --op (sum, count, etc.).
fill: Fill missing values in selected columns (down/LOCF, const).
blank: Replace consecutive identical values in selected columns with empty strings (sparsify).

Combining & Splitting

join: Join two files based on common keys (inner, left, outer, anti).
append: Concatenate multiple TSV files, handling headers correctly.
split: Split a file into multiple files (by size, key, or random).

Formatting & Utilities

check: Validate TSV file structure (column counts, encoding).
nl: Add line numbers to rows.
keep-header: Run a shell command on the body of a TSV file, preserving the header.

Import & Export

from: Convert other formats to TSV (csv, xlsx, html).
to: Convert TSV to other formats (csv, xlsx, md).

Author

Qiang Wang wang-q@outlook.com

tva 0.2.4

tva