simd_csv/lib.rs
1/*!
2The `simd-csv` crate provides specialized readers & writers of CSV data able
3to leverage [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) instructions.
4
5It has been designed to fit the [xan](https://github.com/medialab/xan) command line tool's
6requirements, but can be used by anyone to speed up CSV parsing.
7
8Is is less flexible and user-friendly than the [`csv`](https://docs.rs/csv/) crate,
9so one should make sure the performance gain is worth it before going further.
10
11This crate is not a port of [simdjson](https://arxiv.org/abs/1902.08318) branchless logic
12applied to CSV parsing. It uses a somewhat novel approach instead, mixing traditional state
13machine logic with [`memchr`](https://docs.rs/memchr/latest/memchr/)-like SIMD-accelerated
14string searching. See the [design notes](#design-notes) for more details.
15
16# Examples
17
18*Reading a CSV file while amortizing allocations*
19
20```
21use std::fs::File;
22use simd_csv::{Reader, ByteRecord};
23
24let mut reader = Reader::from_reader(File::open("data.csv")?);
25let mut record = ByteRecord::new();
26
27while reader.read_byte_record(&mut record)? {
28 for cell in record.iter() {
29 dbg!(cell);
30 }
31}
32```
33
34*Using a builder to configure your reader*
35
36```
37use std::fs::File;
38use simd_csv::ReaderBuilder;
39
40let mut reader = ReaderBuilder::new()
41 .delimiter(b'\t')
42 .buffer_capacity(16 * (1 << 10))
43 .from_reader(File::open("data.csv")?);
44```
45
46*Using the zero-copy reader*
47
48```
49use std::fs::File;
50use simd_csv::ZeroCopyReader;
51
52let mut reader = ZeroCopyReader::from_reader(File::open("data.csv")?);
53
54while let Some(record) = reader.read_byte_record()? {
55 // Only unescaping third column:
56 dbg!(record.unescape(2));
57}
58```
59
60*Counting records as fast as possible using the splitter*
61
62```
63use std::fs::File;
64use simd_csv::Splitter;
65
66let mut splitter = Splitter::from_reader(File::open("data.csv")?);
67
68println!("{}", splitter.count_records()?);
69```
70
71# Readers
72
73From least to most performant. Also from most integrated to most barebone.
74
75- [`Reader`], [`ReaderBuilder`]: a streaming copy reader, unescaping quoted data on the fly.
76 This is the closest thing you will find to the [`csv`](https://docs.rs/csv/) crate `Reader`.
77- [`ZeroCopyReader`], [`ZeroCopyReaderBuilder`]: a streaming zero-copy reader that only find cell
78 delimiters and does not unescape quoted data.
79- [`Splitter`], [`SplitterBuilder`]: a streaming zero-copy splitter that will only
80 find record delimitations, but not cell delimiters at all.
81- [`LineReader`]: a streaming zero-copy line splitter that does not handle quoting at all.
82
83You can also find more exotic readers like:
84
85- [`TotalReader`], [`TotalReaderBuilder`]: a reader optimized to work with uses-cases when
86 CSV data is fully loaded into memory or with memory maps.
87- [`Seeker`], [`SeekerBuilder`]: a reader able to find record start positions in a seekable CSV stream.
88 This can be very useful for parallelization, or more creative uses like performing binary
89 search in a sorted file.
90- [`ReverseReader`], [`ReaderBuilder`]: a reader able to read a seekable CSV stream in reverse, in amortized linear time.
91
92# Writers
93
94- [`Writer`], [`WriterBuilder`]: a typical CSV writer.
95
96# Supported targets
97
98- On `x86_64` targets, `sse2` instructions are used. `avx2` instructions
99 will also be used if their availability is detected at runtime.
100- On `aarch64` targets, `neon` instructions are used.
101- On `wasm` targets, `simd128` instructions are used.
102- Everywhere else, the library will fallback to [SWAR](https://en.wikipedia.org/wiki/SWAR)
103 techniques or scalar implementations.
104
105Using `RUSTFLAGS='-C target-cpu=native'` should not be required when compiling
106this crate because it either uses SIMD instructions tied to your `target_arch`
107already and because it will rely on runtime detection to find better SIMD
108instructions (typically `avx2`).
109
110# Design notes
111
112## Regarding performance
113
114This crate's CSV parser has been cautiously designed to offer "reasonable" performance
115by combinining traditional state machine logic with SIMD-accelerated string searching.
116
117I say "reasonable" because you cannot expect to parse 16/32 times faster than a state-of-the-art
118scalar implementation like the [`csv`](https://docs.rs/csv/) crate. What's more, the
119throughput of the SIMD-accelerated parser remains very data-dependent. Sometimes
120you will go up to ~8 times faster, sometimes you will only go as fast as scalar code.
121(Remember also that CSV parsing is often an IO-bound task, even more so than with other
122data formats usually expected to fit into memory like JSON etc.)
123
124As a rule of thumb, the larger your records and cells, the greater the
125performance boost vs. a scalar byte-by-byte implementation. This also means that
126for worst cases, this crate's parser will just be on par with scalar code. I
127have made everything in my power to ensure this SIMD parser is never slower (I think
128one of the reasons why SIMD CSV parsers are not yet very prevalent is that they
129tend to suffer real-life cases where scalar code outperform them).
130
131Also, note that this crate is geared towards parsing **streams** of CSV data
132only quoted when needed (e.g. not written with a `QUOTE_ALWAYS` policy).
133
134To paint a more faithful picture, here is a table summing up the performance boost
135you might observe vs. the [`csv`](https://docs.rs/csv/) crate while parsing
136CSV data streamed from SSD:
137
138| file | split | zero_copy | copy |
139| ------------------ | ----- | --------- | ----- |
140| articles.csv | ~4.8x | ~4.7x | ~4.0x |
141| blogs.csv | ~4.2x | ~2.5x | ~2.0x |
142| numbers.csv | ~3.6x | ~2.0x | ~1.5x |
143| quote-always.csv | ~1.1x | ~1.2x | ~1.0x |
144| random.csv | ~2.8x | ~2.1x | ~1.7x |
145| range.csv | ~1.2x | ~1.4x | ~1.1x |
146| tweets.csv | ~6.0x | ~2.9x | ~2.5x |
147| worldcitiespop.csv | ~3.1x | ~2.0x | ~1.7x |
148| worst-case.csv | ~1.0x | ~1.1x | ~0.9x |
149| mediapart.tsv | ~3.1x | ~3.1x | ~2.7x |
150
151The **split** column report gains using a [`Splitter`], the **zero_copy** column
152report gains using a [`ZeroCopyReader`] and finally the **copy** column report
153gains using a [`Reader`].
154
155Comparison is done including IO, on `x86_64` with `avx2` support.
156
157Here is a description of the files:
158
159- **articles.csv**: 11GB summing up 3M press articles from a French
160 media outlet, with a column containing full text of the articles.
161- **blogs.csv**: 1GB with ~10 medium-sized columns summing up information about
162 1M blogs.
163- **numbers.csv**: 200MB of the same 9 columns containing a number from 1 to 9,
164 repeated 10M times.
165- **quote-always.csv**: same as **random.csv** but written using a needless
166 `QUOTE_ALWAYS` policy.
167- **random.csv**: 500MB of 6 columns containing a random `f32` number.
168- **range.csv**: 400MB with a single column containing an incremental integer
169 up to ~50M.
170- **tweets.csv**: 6.5GB representing ~7M tweets. Lots of columns, most of them
171 mostly empty.
172- **worldcitiespop.csv**: 500MB of repeated CSV data from world cities population open data.
173- **worst-case.csv**: 100MB repeating a single column containing just "1" 50M times.
174- **mediapart.tsv**: 3GB representing an export of a tar archive containing 10k
175 articles downloaded from a French media outlet with a column containing raw HTML.
176
177## Regarding simdjson techniques
178
179I have tried very hard to apply [simdjson](https://arxiv.org/abs/1902.08318) tricks
180to make this crate's parser as branchless as possible but I couldn't make it as
181fast as the state-machine/SIMD string searching hybrid.
182
183`PCLMULQDQ` & shuffling tricks in this context only add more complexity and overhead
184to the SIMD sections of the code, all while making it less "democratic" since you need
185specific SIMD instructions that are not available everywhere, if you don't want to
186fallback to slower instructions.
187
188Said differently, those techniques seem overkill in practice for CSV parsing.
189But it is also possible I am not competent enough to make them work properly and
190I won't hesitate to move towards them if proven wrong.
191
192## Hybrid design
193
194This crate's CSV parser follows a hybrid approach where we maintain a traditional
195state machine, but search for structural characters in the byte stream using
196SIMD string searching techniques like the ones implemented in the excellent
197[`memchr`](https://docs.rs/memchr/latest/memchr/) crate:
198
199The idea is to compare 16/32 bytes of data at once with splats of structural
200characters like `\n`, `"` or `,`. We then extract a "move mask" that will
201be handled as a bit string so we can find whether and where some character
202was found using typical bit-twiddling.
203
204This ultimately means that branching happens on each structural character rather
205than on each byte, which is very good. But this is also the reason why CSV data
206with a very high density of structural characters will not get parsed much faster
207than with the equivalent scalar code.
208
209## Two-speed SIMD branches
210
211This crate's CSV parser actually uses two different modes of SIMD string searching:
212
2131. when reading unquoted CSV data, the parser uses an amortized variant of the
214 [`memchr`](https://docs.rs/memchr/latest/memchr/) routines where move masks
215 containing more than a single match are kept and consumed progressively on subsequent calls,
216 instead of restarting a search from the character just next to an earlier match, as the
217 `memchr_iter` routine does.
2182. when reading quoted CSV data, the parser uses the optmized & unrolled functions
219 of the [`memchr`](https://docs.rs/memchr/latest/memchr/) crate directly to find the next
220 quote as fast as possible.
221
222This might seem weird but this is the best tradeoff for performance. Counter-intuitively,
223using larger SIMD registers like `avx2` for 1. actually hurts overall performance.
224Similarly, using the amortized routine to scan quoted data is actually slower than
225using the unrolled functions of [`memchr`](https://docs.rs/memchr/latest/memchr/).
226
227This actually makes sense if you consider that the longer a field is, the more
228probable it is to contain a character requiring the field to be quoted. What's more
229the density of quotes to be found in a quoted field is usually lower that structural
230characters in an unquoted CSV stream. So if you use larger SIMD registers in the
231unquoted stream you will end up 1. throttling the SIMD part of the code too much
232because of the inner branching (when hitting a delimiter or a newline) and 2. you
233will often discard too much work when hitting a record end or a quoted field.
234
235## Copy amortization
236
237Copying tiny amounts of data often is quite detrimental to overall performance.
238As such, and to make sure the copying [`Reader`] remains as fast as possible,
239I decided to change the design of the [`ByteRecord`] to save fields as fully-fledged
240ranges over the underlying byte slice instead of only delimiting them implicitly
241by the offsets separating them as it is done in the [`csv`](https://docs.rs/csv/) crate.
242
243This means I am able to copy large swathes of unquoted data at once instead of
244copying fields one by one. This also means I keep delimiter characters and sometimes
245inconsequential double quotes in the underlying byte slice (but don't worry, the
246user will never actually see them), so that copies remain as vectorized as possible.
247
248# Caveats
249
250## "Nonsensical" CSV data
251
252To remain as fast as possible, "nonsensical" CSV data is handled by this
253crate differently than it might traditionally be done.
254
255For instance, this crate's CSV parser has no concept of "beginning of field",
256which means opening quotes in the middle of a field might corrupt the output.
257(I would say this is immoral to do so in the first place but traditional parsers
258tend to deal with this case more graciously).
259
260For instance, given the following CSV data:
261
262```txt
263name,surname\njoh"n,landis\nbéatrice,babka
264```
265
266Cautious parsers would produce the following result:
267
268| name | surname |
269| -------- | ------- |
270| joh"n | landis |
271| béatrice | babka |
272
273While this crate's parser would produce the following unaligned result:
274
275| name | surname |
276| ---------------------------- | ------- |
277| joh"n,landis\nbéatrice,babka | \<eof\> |
278
279Keep also in mind that fields opening and closing quotes multiple
280times might lose some characters here & there (especially whitespace) because
281the parser's state machine is not geared towards this at all.
282
283Rest assured that morally valid & sensical CSV data will still be parsed
284correctly ;)
285
286## Regarding line terminators
287
288To avoid needless branching and SIMD overhead, this crate's CSV parser
289expect line terminators to be either CRLF or single LF, but not single CR.
290
291Also, to avoid state machine overhead related to CRLF at buffer boundaries
292when streaming and to make sure we skip empty lines of the file (we
293don't parse them as empty records),
294one edge case has been deemed an acceptable loss: leading CR characters
295will be trimmed from the beginning of records.
296
297For instance, given the following CSV data:
298
299```txt
300name,surname\n\rjohn,landis\r\nbéatrice,babka
301```
302
303A morally correct parser recognizing CRLF or LF line terminators should return:
304
305| name | surname |
306| -------- | ------- |
307| \rjohn | landis |
308| béatrice | babka |
309
310While the hereby crate returns:
311
312| name | surname |
313| -------- | ------- |
314| john | landis |
315| béatrice | babka |
316
317*/
318#[allow(unused_macros)]
319macro_rules! brec {
320 () => {{
321 $crate::records::ByteRecord::new()
322 }};
323
324 ($($x: expr),*) => {{
325 let mut r = $crate::records::ByteRecord::new();
326
327 $(
328 r.push_field($x.as_bytes());
329 )*
330
331 r
332 }};
333}
334
335mod buffer;
336mod core;
337mod debug;
338mod error;
339mod ext;
340mod line_reader;
341mod reader;
342mod records;
343mod searcher;
344mod seeker;
345mod splitter;
346mod total_reader;
347mod utils;
348mod writer;
349mod zero_copy_reader;
350
351pub use error::{Error, ErrorKind, Result};
352pub use line_reader::LineReader;
353pub use reader::{Reader, ReaderBuilder, ReverseReader};
354pub use records::{ByteRecord, ZeroCopyByteRecord};
355pub use searcher::searcher_simd_instructions;
356pub use seeker::{Seeker, SeekerBuilder};
357pub use splitter::{Splitter, SplitterBuilder};
358pub use total_reader::{TotalReader, TotalReaderBuilder};
359pub use utils::unescape;
360pub use writer::{Writer, WriterBuilder};
361pub use zero_copy_reader::{ZeroCopyReader, ZeroCopyReaderBuilder};