qsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Commands should be simple, fast and composable:
- Simple tasks should be easy.
- Performance trade offs should be exposed in the CLI interface.
- Composition should not come at the expense of performance.
This README contains information on how to
install qsv
, in addition to
a quick tour of several commands.
Dual-licensed under MIT or the UNLICENSE.
NOTE: qsv is a fork of the popular xsv CSV utility, merging several pending PRs since the 0.13.0 release that I found useful. It also implements percentage sampling. |
---|
Available commands
- cat - Concatenate CSV files by row or by column.
- count - Count the rows in a CSV file. (Instantaneous with an index.)
- fill - Fill empty values. (NEW)
- fixlengths - Force a CSV file to have same-length records by either padding or truncating them.
- flatten - A flattened view of CSV records. Useful for viewing one record
at a time. e.g.,
qsv slice -i 5 data.csv | qsv flatten
. - fmt - Reformat CSV data with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
- frequency - Build frequency tables of each column in CSV data. (Uses parallelism to go faster if an index is present.)
- headers - Show the headers of CSV data. Or show the intersection of all headers between many CSV files.
- index - Create an index for a CSV file. This is very quick and provides constant time indexing into the CSV file.
- input - Read CSV data with exotic quoting/escaping rules.
- join - Inner, outer and cross joins. Uses a simple hash index to make it fast.
- partition - Partition CSV data based on a column value.
- sample - Randomly draw rows from CSV data using reservoir sampling (i.e., use memory proportional to the size of the sample). (EXTENDED)
- rename - Rename the columns of CSV data efficiently. (NEW)
- reverse - Reverse order of rows in CSV data.
- search - Run a regex over CSV data. Applies the regex to each field individually and shows only matching rows.
- select - Select or re-order columns from CSV data. (EXTENDED)
- slice - Slice rows from any part of a CSV file. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
- sort - Sort CSV data.
- split - Split one CSV file into many CSV files of N chunks.
- stats - Show basic types and statistics of each column in the CSV file. (i.e., mean, standard deviation, median, range, nullcount, etc.) (EXTENDED)
- table - Show aligned output of any CSV data using elastic tabstops.
- transpose - Transpose rows/columns of CSV data. (NEW)
A whirlwind tour
Let's say you're playing with some of the data from the Data Science Toolkit, which contains several CSV files. Maybe you're interested in the population counts of each city in the world. So grab the data and start examining it:
The next thing you might want to do is get an overview of the kind of data that
appears in each column. The stats
command will do this for you:
|
The qsv table
command takes any CSV data and formats it into aligned columns
using elastic tabstops. You'll
notice that it even gets alignment right with respect to Unicode characters.
So, this command takes about 12 seconds to run on my machine, but we can speed it up by creating an index and re-running the command:
|
Which cuts it down to about 8 seconds on my machine. (And creating the index takes less than 2 seconds.)
Notably, the same type of "statistics" command in another CSV command line toolkit takes about 2 minutes to produce similar statistics on the same data set.
Creating an index gives us more than just faster statistics gathering. It also makes slice operations extremely fast because only the sliced portion has to be parsed. For example, let's say you wanted to grab the last 10 records:
|
These commands are instantaneous because they run in time and memory proportional to the size of the slice (which means they will scale to arbitrarily large CSV data).
Switching gears a little bit, you might not always want to see every column in the CSV data. In this case, maybe we only care about the country, city and population. So let's take a look at 10 random rows:
| |
Whoops! It seems some cities don't have population counts. How pervasive is that?
)
(The qsv frequency
command builds a frequency table for each column in the
CSV data. This one only took 5 seconds.)
So it seems that most cities do not have a population count associated with them at all. No matter—we can adjust our previous command so that it only shows rows with a population count:
| | | |
:warning: NOTE: The
tee
command reads from standard input and writes to both standard output and one or more files at the same time. We do this so we can create thesample.csv
file we need for the next step, and pipe it to theqsv table
command.
Erk. Which country is at
? No clue, but the Data Science Toolkit has a CSV
file called countrynames.csv
. Let's grab it and do a join so we can see which
countries these are:
|
|
| | | | | |
Whoops, now we have two columns called Country
and an Abbrev
column that we
no longer need. This is easy to fix by re-ordering columns with the qsv select
command:
| |
|
| | | | | |
Perhaps we can do this with the original CSV data? Indeed we can—because
joins in qsv
are fast.
|
|
|
|
The !Abbrev,Country[1]
syntax means, "remove the Abbrev
column and remove
the second occurrence of the Country
column." Since we joined with
countrynames.csv
first, the first Country
name (fully expanded) is now
included in the CSV data.
This qsv join
command takes about 7 seconds on my machine. The performance
comes from constructing a very simple hash index of one of the CSV data files
given. The join
command does an inner join by default, but it also has left,
right and full outer join support too.
Installation
Binaries for Windows, Linux and macOS are available from Github.
Alternatively, you can compile from source by
installing Cargo
(Rust's package manager)
and installing qsv
using Cargo:
Compiling from this repository also works similarly:
Compilation will probably take a few minutes depending on your machine. The
binary will end up in ./target/release/qsv
.
If you want to squeeze more performance from your build, set this environment variable before compiling:
Do note though that the resulting binary will only run on machines with the
same architecture as the machine you compiled from. To find out your CPU
architecture and other valid values for target-cpu
:
Benchmarks
I've compiled some very rough
benchmarks of
various qsv
commands.