scrubcsv: Remove bad lines from a CSV file and normalize the rest
This is a CSV cleaning tool based on BurntSushi's
excellent csv
library. It's
intended to be used for cleaning up and normalizing large data sets before
feeding them to other CSV parsers, at the cost of discarding the occasional
row. This program may further mangle syntactically-invalid CSV data!
See below for details.
Installing and using
To install, first install Rust if you haven't already:
|
Then install scrubcsv
using Cargo:
Run it:
)
For more options, run:
Data cleaning notes
We assume that, given hundreds of gigabytes of CSV from many sources, many files will contain a few unparsable lines.
Lines of the following form:
Name,Phone
"Robert "Bob" Smith",(202) 555-1212
...are invalid according the RFC 4180 because the quotes around "Bob"
are
not escaped. The creator the file probably intended to write:
Name,Phone
"Robert ""Bob"" Smith",(202) 555-1212
scrubcsv
will currently output this as:
Name,Phone
"Robert Bob"" Smith""",(202) 555-1212
If the resulting line has the wrong number of columns, it will be discarded. The precise details of cleanup and discarding are subject to change. The goal is to preserve data in valid CSV files, and to make a best effort to salvage or discard records that can't be parsed without being too picky about the details.
Performance notes
This is designed to be relatively fast. For comparison purposes, on particular laptop:
cat /dev/zero | pv > /dev/null
shows a throughput of about 5 GB/s.- The original raw output string-writing routines in
scrubcsv
could reach about 3.5 GB/s. - The
csv
parser can reach roughly 235 MB/s in zero-copy mode. - With various levels of processing,
scrubcsv
hits 49 to 125 MB/s. - A lot of old-school C command-line tools hit about 50 to 75 MB/s.
Unfortunately, we can't really use csv
's zero-copy mode because we need
to see an entire row at once to decide whether or not it's valid before
deciding to output it. We could, I suppose, memmove
each field as we see
it into an existing buffer to avoid malloc
overhead (which is almost
certianly the bottleneck here), but that would require more code. Still,
file an issue if performance is a problem. We could probably make this a
maybe two to four times faster (and it would be fun to optimize).