tva 0.2.4

Tab-separated Values Assistant
Documentation

tva

Tab-separated Values Assistant. Fast, reliable TSV processing toolkit in Rust.

Build codecov Crates.io license Documentation

Overview

tva (pronounced "Tee-Va") is a high-performance command-line toolkit written in Rust for processing tabular data. It brings the safety and speed of modern systems programming to the classic Unix philosophy.

Inspiration

Use Cases

  • "Middle Data": Files too large for Excel/Pandas but too small for distributed systems (Spark/Hadoop).
  • Data Pipelines: Robust CLI-based ETL steps compatible with awk, sort, etc.
  • Exploration: Fast summary statistics, sampling, and filtering on raw data.

Design Principles

  • Single Binary: A standalone executable with no dependencies, easy to deploy.
  • Header Aware: Manipulate columns by name or index.
  • Fail-fast: Strict validation ensures data integrity (no silent truncation).
  • Streaming: Stateless processing designed for infinite streams and large files.
  • TSV-first: Prioritizes the reliability and simplicity of tab-separated values.
  • Performance: Single-pass execution with minimal memory overhead.

See Design Documentation for details.

Read the documentation online

Installation

Current release: 0.2.4

# Clone the repository and install via cargo
cargo install --force --path .

Or install the pre-compiled binary via the cross-platform package manager cbp (supports older Linux systems with glibc 2.17+):

cbp install tva

You can also download the pre-compiled binaries from the Releases page.

Running Examples

The examples in the documentation use sample data located in the docs/data/ directory. To run these examples yourself, we recommend cloning the repository:

git clone https://github.com/wang-q/tva.git
cd tva

Then you can run the commands exactly as shown in the docs (e.g., tva select -f 1 docs/data/input.csv).

Alternatively, you can download individual files from the docs/data directory on GitHub.

Commands

Selection & Sampling

  • select: Select and reorder columns.
  • slice: Slice rows by index (keep or drop). Supports multiple ranges and header preservation.
  • sample: Randomly sample rows (Bernoulli, reservoir, weighted).

Filtering

  • filter: Filter rows based on numeric, string, regex, or date criteria.

Ordering

  • sort: Sorts rows based on one or more key fields.
  • reverse: Reverses the order of lines (like tac), optionally keeping the header at the top.
  • transpose: Swaps rows and columns (matrix transposition).

Statistics & Summary

  • stats: Calculate summary statistics (sum, mean, median, min, max, etc.) with grouping.
  • bin: Discretize numeric values into bins (useful for histograms).
  • uniq: Deduplicate rows or count unique occurrences (supports equivalence classes).

Reshaping

  • longer: Reshape wide to long (unpivot). Requires a header row.
  • wider: Reshape long to wide (pivot). Supports aggregation via --op (sum, count, etc.).
  • fill: Fill missing values in selected columns (down/LOCF, const).
  • blank: Replace consecutive identical values in selected columns with empty strings (sparsify).

Combining & Splitting

  • join: Join two files based on common keys (inner, left, outer, anti).
  • append: Concatenate multiple TSV files, handling headers correctly.
  • split: Split a file into multiple files (by size, key, or random).

Formatting & Utilities

  • check: Validate TSV file structure (column counts, encoding).
  • nl: Add line numbers to rows.
  • keep-header: Run a shell command on the body of a TSV file, preserving the header.

Import & Export

  • from: Convert other formats to TSV (csv, xlsx, html).
  • to: Convert TSV to other formats (csv, xlsx, md).

Author

Qiang Wang wang-q@outlook.com

License

MIT. Copyright by Qiang Wang.