qsv 0.26.2

A high performance CSV command line toolkit.
qsv-0.26.2 is not a library.

qsv: Ultra-fast CSV data-wrangling CLI toolkit

Linux build status Windows build status macOS build status Security audit Crates.io Minimum supported Rust version Discussions Docs Downloads
qsv is a command line program for indexing, slicing, analyzing, splitting, enriching, validating & joining CSV files. Commands are simple, fast and composable.

NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's release, along with additional features & commands for data-wrangling. See FAQ for more details. (NEW and EXTENDED commands are marked accordingly).

Available commands

Command Description
apply Apply series of string, date, currency & geocoding transformations to a CSV column. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex & language detection). (NEW)
behead Drop headers from a CSV. (NEW)
cat Concatenate CSV files by row or by column.
count[^1] Count the rows in a CSV file. (Instantaneous with an index.)
dedup[^2] Remove redundant rows. (NEW)
enum Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value. (NEW)
exclude[^1] Removes a set of CSV data from another set based on the specified columns. (NEW)
explode Explode rows into multiple ones by splitting a column value based on the given separator. (NEW)
fetch Fetches HTML/data from web pages or web services for every row in a URL column. (NEW/WIP)
fill Fill empty values. (NEW)
fixlengths Force a CSV to have same-length records by either padding or truncating them.
flatten A flattened view of CSV records. Useful for viewing one record at a time.e.g. qsv slice -i 5 data.csv | qsv flatten.
fmt Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.) (EXTENDED)
foreach Loop over a CSV to execute bash commands. (*nix only) (NEW)
frequency[^1][^3] Build frequency tables of each column. (Uses parallelism to go faster if an index is present.)
generate Generate test data by profiling a CSV using Markov decision process machine learning. (NEW)
headers Show the headers of a CSV. Or show the intersection of all headers between many CSV files.
index Create an index for a CSV. This is very quick & provides constant time indexing into the CSV file.
input Read a CSV with exotic quoting/escaping rules.
join[^1] Inner, outer, cross, anti & semi joins. Uses a simple hash index to make it fast. (EXTENDED)
jsonl Convert newline-delimited JSON to CSV. (NEW)
lua Execute a Lua script over CSV lines to transform, aggregate or filter them. (NEW)
partition Partition a CSV based on a column value.
pseudo Pseudonymise the value of the given column by replacing them with an incremental identifier. (NEW)
rename Rename the columns of a CSV efficiently. (NEW)
replace Replace CSV data using a regex. (NEW)
reverse[^2] Reverse order of rows in a CSV. Unlike the sort --reverse command, it preserves the order of rows with the same key. (NEW)
sample[^1] Randomly draw rows (with optional seed) from a CSV using reservoir sampling (i.e., use memory proportional to the size of the sample). (EXTENDED)
search Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows. (EXTENDED)
searchset Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows. (NEW)
select Select, re-order, duplicate or drop columns. (EXTENDED)
slice[^1][^2] Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice). (EXTENDED)
sort Sorts CSV data in alphabetical, numerical, reverse or random (with optional seed) order. (EXTENDED)
split[^1][^3] Split one CSV file into many CSV files of N chunks. (EXTENDED)
stats[^1][^2][^3] Show data type & descriptive statistics of each column in a CSV. (i.e., sum, min/max, min/max length, mean, stddev, variance, quartiles, IQR, lower/upper fences, skew, median, mode, cardinality & nullcount) (EXTENDED)
table[^2] Show aligned output of a CSV using elastic tabstops. (EXTENDED)
transpose[^2] Transpose rows/columns of a CSV. (NEW)

[^1]: uses an index when available. join always uses indices.
[^2]: loads the entire CSV into memory. Note that stats & transpose have modes that do not load the entire CSV into memory.
[^3]: runs parallel jobs by default (use --jobs option to adjust)

Installation

Binaries for Windows, Linux and macOS are available from Github.

Alternatively, you can compile from source by installing Cargo (Rust's package manager) and installing qsv using Cargo:

cargo install qsv --path .

If you encounter compilation errors, ensure you're using the exact version of the dependencies qsv was built with by issuing:

cargo install qsv --path . --frozen

Compiling from this repository also works similarly:

git clone git://github.com/jqnatividad/qsv
cd qsv
cargo build --release
# or if you encounter compilation errors
cargo build --release --frozen

The compiled binary will end up in ./target/release/qsv.

To enable optional features, use the --features or --all-features options, e.g.:

cargo install qsv --features apply,generate,selfupdate,lua,foreach
# or
cargo install qsv --all-features

# or when compiling from a local repo
cargo build --release --features apply,generate,selfupdate,lua,foreach
# or
cargo build --release --all-features

Minimum Supported Rust Version

Building qsv requires Rust version 1.56+.

Tab Completion

qsv's command-line options are quite extensive. Thankfully, since it uses docopt for CLI processing, we can take advantage of docopt.rs' tab completion support to make it easier to use qsv at the command-line (currently, only bash shell is supported):

# install docopt-wordlist
cargo install docopt

# IMPORTANT: run these commands from the root directory of your qsv git repository
# to setup bash qsv tab completion
echo "DOCOPT_WORDLIST_BIN=\"$(which docopt-wordlist)"\" >> $HOME/.bash_completion
echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bash_completion
echo "complete -F _docopt_wordlist_commands qsv" >> $HOME/.bash_completion

Recognized file formats

qsv recognizes CSV (.csv file extension) and TSV files (.tsv and .tab file extensions). CSV files are assummed to have "," (comma) as a delimiter, and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter command-line option or with the QSV_DEFAULT_DELIMITER environment variable.

Environment Variables

  • QSV_DEFAULT_DELIMITER - single ascii character to use as delimiter. Overrides --delimeter option. Defaults to "," (comma) for CSV files and "\t" (tab) for TSV files, when not set. Note that this will also set the delimiter for qsv's output.
  • QSV_NO_HEADERS - when set, the first row will NOT be interpreted as headers. Supersedes QSV_TOGGLE_HEADERS.
  • QSV_TOGGLE_HEADERS - if set to 1, toggles header setting - i.e. inverts qsv header behavior, with no headers being the default, and setting --no-headers will actually mean headers will not be ignored.
  • QSV_MAX_JOBS - number of jobs to use for multi-threaded commands (currently frequency, split and stats). If not set, max_jobs is set to number of logical processors divided by three. See Parallelization for more info.
  • QSV_REGEX_UNICODE - if set, makes search, searchset and replace commands unicode-aware. For increased performance, these commands are not unicode-aware and will ignore unicode values when matching and will panic when unicode characters are used in the regex.
  • QSV_RDR_BUFFER_CAPACITY - set to change reader buffer size (bytes - default when not set: 16384)
  • QSV_WTR_BUFFER_CAPACITY - set to change writer buffer size (bytes - default when not set: 65536)
  • QSV_COMMENTS - set to a comment character which will ignore any lines (including the header) that start with this character (default: comments disabled).
  • QSV_LOG_LEVEL - set to desired level (default - off, error, warn, info, trace, debug).
  • QSV_LOG_DIR - when logging is enabled, the directory where the log files will be stored. If the specified directory does not exist, qsv will attempt to create it. If not set, the log files are created in the directory where qsv was started. See Logging for more info.
  • QSV_NO_UPDATE - prohibit self-update version check of the latest qsv release published on GitHub.

NOTE: To get a list of all environment variables with the QSV_ prefix, run qsv --envlist.

Feature Flags

  • mimalloc (default) - use the mimalloc allocator.
  • apply - enable apply command. This swiss-army knife of CSV transformations is very powerful, but it has a lot of dependencies that increases both compile time and binary size.
  • generate - enable generate command. The test data generator also has a large dependency tree.

Both of the following commands are also very powerful that can be abused and present "foot-shooting" scenarios.

  • lua - enable lua command.
  • foreach - enable foreach command.

Performance Tuning

CPU Optimization

Modern CPUs have various features that the Rust compiler can take advantage of to increase performance. If you want the compiler to take advantage of these CPU-specific speed-ups, set this environment variable BEFORE installing/compiling qsv:

On Linux and macOS:

export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'

On Windows Powershell:

$env:CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'

Do note though that the resulting binary will only run on machines with the same architecture as the machine you installed/compiled from.
To find out your CPU architecture and other valid values for target-cpu:

rustc --print target-cpus

# to find out what CPU features are used by the Rust compiler WITHOUT specifying target-cpu
rustc --print cfg | grep -i target_feature

# to find out what additional CPU features will be used by the Rust compiler when you specify target-cpu=native
rustc --print cfg -C target-cpu=native | grep -i target_feature

# to get a short explanation of each CPU target-feature
rustc --print target-features

Memory Allocator

By default, qsv uses an alternative allocator - mimalloc, a performance-oriented allocator from Microsoft. If you want to use the standard allocator, use the --no-default-features flag when installing/compiling qsv, e.g.:

cargo install qsv --path . --no-default-features

or

cargo build --release --no-default-features

To find out what memory allocator qsv is using, run qsv --version. After the qsv version number, the allocator used is displayed ("standard" or "mimalloc"). Note that mimalloc is not supported on the x86_64-pc-windows-gnu and arm targets, and you'll need to use the "standard" allocator on those platforms.

Buffer size

Depending on your filesystem's configuration (e.g. block size, file system type, writing to remote file systems (e.g. sshfs, efs, nfs), SSD or rotating magnetic disks, etc.), you can also fine-tune qsv's read/write buffers.

By default, the read buffer size is set to 16k, you can change it by setting the environment variable QSV_RDR_BUFFER_CAPACITY in bytes.

The same is true with the write buffer (default: 64k) with the QSV_WTR_BUFFER_CAPACITY environment variable.

Parallelization

Several commands support parallelization/multi-threading - stats, frequency and split.

Previously, these commands spawned several jobs equal to the number of logical processors. After extensive benchmarking, it turns out doing so often results in the multi-threaded runs running slower than single-threaded runs.

Parallelized jobs do increase performance - to a point. After a certain number of threads, there are not only diminishing returns, the parallelization overhead actually results in slower runs.

Starting with qsv 0.22.0, a heuristic of setting the maximum number of jobs to the number of logical processors divided by 3 is applied. The user can still manually override this using the --jobs command-line option or the QSV_MAX_JOBS environment variable, but testing shows negative returns start at around this point.

These observations were gathered using the benchmark script, using a relatively large file (520mb, 41 column, 1M row sample of NYC's 311 data). Performance will vary based on environment - CPU architecture, amount of memory, operating system, I/O speed, and the number of background tasks, so this heuristic will not work for every situation.

To find out your jobs setting, call qsv --version. The second to the last number is the number of jobs qsv will use for multi-threaded commands. The last number is the number of logical processors detected by qsv.

Benchmarking for Performance

Use and fine-tune the benchmark script when tweaking qsv's performance to your environment. Don't be afraid to change the benchmark data and the qsv commands to something that is more representative of your workloads.

Use the generated benchmark TSV files to meter and compare performance across platforms. You'd be surprised how performance varies across environments - e.g. qsv's join performs abysmally on Windows's WSL running Ubuntu 20.04 LTS, taking 172.44 seconds. On the same machine, running in a VirtualBox VM at that with the same Ubuntu version, join was done in 1.34 seconds - two orders of magnitude faster!

However, stats performs two times faster on WSL vs the VirtualBox VM - 2.80 seconds vs 5.33 seconds for the stats_index benchmark.

License

Dual-licensed under MIT or the UNLICENSE.

Sponsor

qsv was made possible by datHere - Data Infrastructure Engineering.
Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used.

Naming Collision

This project is unrelated to Intel's Quick Sync Video.