# qsv-dateparser optimization — session handoff
**Repo to work in:** https://github.com/jqnatividad/qsv-dateparser (you own it)
**Crate version in use by qsv:** `qsv-dateparser = "0.15"` (qsv `Cargo.toml:262`)
**Goal:** speed up date parsing — the dominant cost of `qsv stats --infer-dates`
and, via the stats cache, of `frequency`, `schema`, `tojsonl`, `sqlp`, `joinp`,
`pivotp`, `describegpt`.
## Why this, and not qsv/stats.rs
`src/cmd/stats.rs` is already near-optimal for byte-identical output:
- `FieldType::from_sample` (stats.rs:5210) already parses each value **once** and
returns `(FieldType, i64 timestamp, f64 value)`; `Stats::add` (stats.rs:3897)
consumes those directly via `add_with_parsed` / `add_numeric_value`. No
double-parse, no per-value `Vec` allocation in the numeric path.
- Date inference is gated by `--dates-whitelist` (default `sniff`), so only
sniff-identified columns are even attempted in the normal path.
The remaining cost lives entirely inside this crate's parse dispatch.
## Measured baseline (M4 Max, release `-F all_features`, 514MB NYC-311, 1M rows)
| `stats` | 1.46s |
| `stats --everything` | 2.30s |
| `stats --everything --infer-dates` | **5.65s** |
| `stats --typesonly` | 0.78s |
| `stats --typesonly --infer-dates` | 4.11s |
`--infer-dates` adds ~3.4s — larger than all stats accumulation combined.
### Decomposition (controlled single 1M-row columns)
| genuine ISO datetimes `2017-11-17 05:00:30` | 77ms | 123ms | +46ms |
| genuine slash datetimes `11/17/2017 05:00:30 AM` | — | 130ms | ~+53ms |
| **non-date string** `category_value_N` | 72ms | 121ms | **+49ms** |
**Key finding:** parsing a *real* date is cheap (~50ns/value). The expense is
**failed** parse attempts on non-date string columns — every value runs the full
regex `is_match` dispatch chain before failing. NYC-311 has many such columns, so
the aggregate failed-parse cost is what produces the +3.4s.
At 10M rows the non-date string column went 574ms → 766ms with `--infer-dates`
(+192ms / +33%), confirming the overhead scales linearly with row count and is
attributable to per-value dispatch, not setup.
## How the current dispatch works
Entry: `parse_with_preference(input, dmy)` (lib.rs:247) →
`Parse::new_with_preference(&Utc, midnight, dmy).parse(input)`.
`Parse::parse` (src/datetime.rs:61) is a sequential `or_else` chain:
```
rfc2822 → unix_timestamp → slash_mdy_family → slash_ymd_family
→ ymd_family → month_ymd → month_mdy_family → month_dmy_family
→ Err("did not match any formats")
```
Each `*_family` (datetime.rs:74,91,105,116,136) starts with a `regex!` (a
`OnceLock<Regex>`) `is_match` guard, then tries sub-formats. For a non-date
string, the worst case runs: `parse_from_rfc2822` (fails), `fast_float2::parse`
(fails), then **5–6 regex `is_match` probes**, all failing, per value.
`regex!` macro = lazily-compiled case-insensitive `Regex` in a `OnceLock`
(datetime.rs:9). Compilation is amortized; the per-call cost is the `is_match`
scans.
## Optimization ideas (in priority order)
1. **Cheap structural pre-filter before the regex chain.** Before any regex,
do a single byte scan of `input` and bail to `Err` early when it cannot be a
date: e.g. no ASCII digit present, or length outside plausible bounds, or
first byte neither digit nor ASCII letter. Most non-date strings die here for
the cost of one short scan instead of 5–6 regex probes. This is the biggest,
lowest-risk win because the hot case is *failures*.
2. **Single-pass classification via `regex::RegexSet`.** Replace the sequential
`is_match` chain in `Parse::parse` with one `RegexSet` over the family
anchored patterns; the set reports which family (if any) matches in a single
pass, then dispatch to that family's sub-parser. Avoids re-scanning the input
once per family.
3. **Order families by real-world frequency** (after the pre-filter): ISO `ymd`
and `slash_mdy` dominate real data; `rfc2822` (tried first today) is rare and
does a full `parse_from_rfc2822` attempt on every value. Move rare/expensive
formats later.
4. **Avoid `fast_float2::parse` on obvious non-numbers** in `unix_timestamp`
(datetime.rs:149) — the structural pre-filter (#1) can also short-circuit this
for inputs containing `/`, `-` in date positions, or letters.
## Verification plan (do this in qsv-dateparser)
1. The crate has a bench harness: `benches/parse.rs` and unit tests in
`src/lib.rs` (`parse_in_local`, `parse_with_timezone_in_utc`,
`parse_with_preference_and_timezone_in_utc`, plus `parse_unambiguous_dmy`,
`parse_iso_t_no_tz`). Run `cargo test` — **all accepted-format cases in the
lib.rs doc comment (lines 28-119, 135-150) must still pass.** Correctness is
non-negotiable: any reordering/pre-filter must not change which strings parse
or what they parse to.
2. Add benches covering BOTH hot cases: (a) a genuine date string, (b) a non-date
string that fails — the failure path is what we're optimizing. `cargo bench`
before/after.
3. Integration check back in qsv: bump the path/version dependency to the local
qsv-dateparser, rebuild `cargo build --release -F all_features`, and re-run the
decomposition benchmarks above. Target: shrink the +49ms non-date-string
overhead substantially while the genuine-date +46ms stays flat.
4. Regenerate/verify nothing changes in qsv's stats output:
`cargo test stats -F all_features` (stats tests compare exact output).
## Repro snippets
Generate the test columns (1M rows):
```bash
# non-date string column (the failure hot path)
awk 'BEGIN{print "s"; for(i=0;i<1000000;i++) printf "category_value_%d\n", i%500}' > strcol.csv
# genuine ISO datetimes
awk 'BEGIN{srand(7); print "d"; for(i=0;i<1000000;i++){y=2010+int(rand()*15);m=1+int(rand()*12);d=1+int(rand()*28);printf "%04d-%02d-%02d %02d:%02d:%02d\n",y,m,d,int(rand()*24),int(rand()*60),int(rand()*60)}}' > iso_dt.csv
BIN=target/release/qsv
hyperfine --warmup 1 --runs 3 \
--command-name strcol_nodate "$BIN stats --force strcol.csv" \
--command-name strcol_dates "$BIN stats --force --infer-dates strcol.csv" \
--command-name iso_dates "$BIN stats --force --infer-dates iso_dt.csv"
```
## Files to touch (qsv-dateparser)
- `src/datetime.rs` — `Parse::parse` (dispatch chain, ~line 61), the `*_family`
guards (74/91/105/116/136), `unix_timestamp` (149). Pre-filter goes at the top
of `parse`.
- `src/lib.rs` — public `parse*` entry points (no signature change needed); tests.
- `benches/parse.rs` — add failure-path bench.
After a release: bump `qsv-dateparser` version, update `qsv/Cargo.toml:262`.