tva
Tab-separated Values Assistant. Fast, reliable TSV processing toolkit in Rust.
Overview
tva (pronounced "Tee-Va") is a high-performance command-line toolkit written in Rust for processing tabular data. It brings the safety and speed of modern systems programming to the classic Unix philosophy.
Inspiration
- eBay's tsv-utils (discontinued): The primary reference for functionality and performance.
- GNU Datamash: Statistical operations.
- R's tidyr: Reshaping concepts (
longer,wider).
Use Cases
- "Middle Data": Files too large for Excel/Pandas but too small for distributed systems (Spark/Hadoop).
- Data Pipelines: Robust CLI-based ETL steps compatible with
awk,sort, etc. - Exploration: Fast summary statistics, sampling, and filtering on raw data.
Design Principles
- Single Binary: A standalone executable with no dependencies, easy to deploy.
- Header Aware: Manipulate columns by name or index.
- Fail-fast: Strict validation ensures data integrity (no silent truncation).
- Streaming: Stateless processing designed for infinite streams and large files.
- TSV-first: Prioritizes the reliability and simplicity of tab-separated values.
- Performance: Single-pass execution with minimal memory overhead.
See Design Documentation for details.
Installation
Current release: 0.2.4
# Clone the repository and install via cargo
Or install the pre-compiled binary via the cross-platform package manager cbp (supports older Linux systems with glibc 2.17+):
You can also download the pre-compiled binaries from the Releases page.
Running Examples
The examples in the documentation use sample data located in the docs/data/ directory. To run these examples yourself, we recommend cloning the repository:
Then you can run the commands exactly as shown in the docs (e.g., tva select -f 1 docs/data/input.csv).
Alternatively, you can download individual files from the docs/data directory on GitHub.
Commands
Selection & Sampling
select: Select and reorder columns.slice: Slice rows by index (keep or drop). Supports multiple ranges and header preservation.sample: Randomly sample rows (Bernoulli, reservoir, weighted).
Filtering
filter: Filter rows based on numeric, string, regex, or date criteria.
Ordering
sort: Sorts rows based on one or more key fields.reverse: Reverses the order of lines (liketac), optionally keeping the header at the top.transpose: Swaps rows and columns (matrix transposition).
Statistics & Summary
stats: Calculate summary statistics (sum, mean, median, min, max, etc.) with grouping.bin: Discretize numeric values into bins (useful for histograms).uniq: Deduplicate rows or count unique occurrences (supports equivalence classes).
Reshaping
longer: Reshape wide to long (unpivot). Requires a header row.wider: Reshape long to wide (pivot). Supports aggregation via--op(sum, count, etc.).fill: Fill missing values in selected columns (down/LOCF, const).blank: Replace consecutive identical values in selected columns with empty strings (sparsify).
Combining & Splitting
join: Join two files based on common keys (inner, left, outer, anti).append: Concatenate multiple TSV files, handling headers correctly.split: Split a file into multiple files (by size, key, or random).
Formatting & Utilities
check: Validate TSV file structure (column counts, encoding).nl: Add line numbers to rows.keep-header: Run a shell command on the body of a TSV file, preserving the header.
Import & Export
from: Convert other formats to TSV (csv,xlsx,html).to: Convert TSV to other formats (csv,xlsx,md).
Author
Qiang Wang wang-q@outlook.com
License
MIT. Copyright by Qiang Wang.