csv-managed 1.0.2

High-performance command-line toolkit for streaming, indexing, and transforming CSV datasets
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
# csv-managed

`csv-managed` is a Rust command-line utility for high‑performance exploration and transformation of CSV data at scale. It emphasizes streaming, typed operations, and reproducible workflows via schema (`-schema.yml`) and index (`.idx`) files.

## Implemented Features

| Area | Description |
|------|-------------|
| Delimiters & Encodings | Read/write comma, tab, pipe, semicolon, or any single ASCII delimiter; independent `--input-encoding` / `--output-encoding`; stdin/stdout streaming (`-`). See: [process](#process), [index](#index). |
| Schema Discovery (probe / infer) | Fast sample (`--sample-rows`) or full scan detection of String, Integer, Float, Boolean, Date, DateTime, Time, Guid, Currency; optional mapping & replace scaffolds (`--mapping`, `--replace-template`); overrides via `--override`; NA placeholder normalization (`--na-behavior`, `--na-fill`); unified diff comparisons against existing schemas (`--diff existing-schema.yml`). See: [schema](#schema). |
| Headerless CSV Detection | Automatically detects presence/absence of a header row (sampling up to first 6 rows). Override detection with `--assume-header=<true\|false>`. For headerless files synthetic column names `field_0..field_N` are generated and persisted with `has_headers: false`; downstream commands (`process`, `stats`, `verify`, `append`) treat the first row as data. Edit the schema's `has_headers` flag if further adjustments are needed. |
| Schema Preview | `schema infer --preview` emits the probe table and resulting YAML (respecting `--replace-template`, overrides, and NA placeholder handling) without writing files—use with `-o` for a dry run before persisting. See: [schema](#schema). |
| Manual Schema Authoring | Inline column specs (`-c name:type->Alias`), value replacements (`--replace column=value->new`), persisted to `-schema.yml` (legacy `-schema.yml` accepted). See: [schema](#schema). |
| Snapshot Regression | `--snapshot <file>` for `schema probe` / `schema infer` writes or validates golden layout & inferred types; guards against formatting/inference drift. See: [Snapshot vs Schema Verify](#snapshot-vs-schema-verify). |
| Column Listing | `schema columns` renders column positions, types, and aliases derived from schema mapping. See: [schema columns](#schema-columns). |
| Value Normalization | Per-column `replace` arrays applied before parsing; flexible boolean token parsing with selectable output format (`process --boolean-format`). See: [schema](#schema), [process](#process). |
| Datatype Transformations | Schema-driven `datatype_mappings` chains convert and standardize values (string→datetime→date, float rounding, string casing) before replacements; toggle via `process --apply-mappings` / `--skip-mappings`. See: [schema](#schema), [process](#process). |
| Fixed Decimal Datatype | Columns may declare `decimal(precision,scale)` (up to 28 digits of precision). Parsing enforces integer/scale limits, supports `round`/`truncate` strategies, and normalizes output for downstream analytics. See: [schema](#schema), [datatype mappings](#datatype-mappings). |
| Indexing | Multi-variant B-tree index files with mixed asc/desc columns; named specs (`--spec name=col:asc,...`) and covering expansion (`--covering`) for prefix/direction permutations. See: [index](#index) and detailed guide: [Indexing & Sorting](docs/indexing-and-sorting.md). |
| Sort & Stream Processing | `process` selects best index variant (longest matching prefix) or falls back to stable in-memory multi-column sort while streaming transformations. See: [process](#process). |
| Filtering & Projection | Typed comparison filters (`= != > >= < <= contains startswith endswith`), multi-flag AND semantics; Evalexpr predicates (`--filter-expr`) with temporal helpers; column include/exclude; row limiting; optional 1-based row numbers. See: [process](#process), [Expression Reference](#expression-reference). |
| Temporal Expression Helpers | Functions like `date_diff_days`, `datetime_format`, `time_diff_seconds` usable in derives and `--filter-expr`. See: [process](#process), [Expression Reference](#expression-reference). |
| Derived Columns | Evalexpr-based expressions referencing header names or positional aliases (`cN`); arithmetic, string, conditional, temporal operations. See: [process](#process), [Expression Reference](#expression-reference). |
| Append | Concatenate multiple inputs with header (and optional schema) validation, enforcing consistent types pre-merge. See: [append](#append). |
| Verification | `schema verify` streams each row against declared types; reporting tiers via `--report-invalid[:detail[:summary]] [LIMIT]`. Non‑zero exit code when any invalid value encountered. See: [schema](#schema), [Snapshot vs Schema Verify](#snapshot-vs-schema-verify). |
| Statistics & Frequency | `stats` computes count, mean, median, min, max, std dev for numeric & temporal columns; `--frequency` distinct counts with optional `--top`; filters apply prior to aggregation. See: [stats](#stats). |
| Preview & Table Rendering | `process --preview` elastic table for quick inspection (defaults `--limit` to 10); `process --table` formatted output when streaming to stdout. See: [process](#process). |
| Joins (engine) | Hash-join engine retained for upcoming streaming pipelines; CLI command temporarily disabled while v1.6.0 join workflows are redesigned. |
| Installation & Tooling | `install` convenience wrapper around `cargo install`; tag-based release workflow; logging via `RUST_LOG`. See: [install](#install). |
| Streaming & Memory Efficiency | Forward-only iteration for verify, stats, filtering, projection, and indexed sorted reads; minimizes heap usage for large files. See: [process](#process), [schema](#schema). |
| Error Reporting & Diagnostics | Contextual errors (I/O, parsing, schema mismatch, expression eval); highlighted invalid cells; snapshot mismatch failures surface layout drifts early. See: [schema](#schema), [process](#process). |

### Mini Derived & Filter Expression Cheat Sheets

#### Derived Expression Patterns

| Pattern | Example | Description |
|---------|---------|-------------|
| Header reference | `total_with_tax=amount*1.0825` | Multiply numeric column values. |
| Positional alias | `margin=c5-c3` | Use `cN` alias (0-based). |
| Conditional flag | `high_value=if(amount>1000,1,0)` | 1/0 indicator via `if(cond, then, else)`. |
| Date math | `ship_eta=date_add(ordered_at,2)` | Add days to a date column. |
| Date diff | `ship_lag=date_diff_days(shipped_at,ordered_at)` | Days between two dates. |
| Time diff | `window=time_diff_seconds(end_time,start_time)` | Seconds between two times. |
| Boolean normalization | `is_shipped=if(status="shipped",true,false)` | Emit canonical booleans. |
| String concat | `channel_tag=concat(channel,"-",region)` | Join string columns. |
| Guid passthrough | `id_copy=id` | Duplicate a Guid column. |
| Row number | `row_index=row_number` | Sequential number (with `--row-numbers`). |

#### Filter vs Filter-Expr Cheat Sheet

| Aspect | --filter | --filter-expr |
|--------|----------|---------------|
| Syntax style | Simple operator tokens | Full Evalexpr expression |
| Supported operators | = != > >= < <= contains startswith endswith | arithmetic(+ - * / %), logic(AND OR), functions(if(), concat(), date_*, time_*) |
| Type awareness | Typed parsing per column | Evaluates on parsed typed values |
| Temporal helpers | Compare canonical values | date_diff_days, date_add, time_diff_seconds, datetime_format, etc. |
| Boolean logic | Repeat flag = AND chain | logical operators (AND, OR) or if(...) nesting |
| String literals | Bare or quoted if spaces | Must use double quotes (outer shell may use single) |
| Column reference | Header name | Header name or positional alias cN |
| Row number | Provided when --row-numbers | Variable row_number when --row-numbers |
| Example | --filter "status = shipped" | --filter-expr 'if(amount>1000 && status="shipped", true, false)' |
| Temporal example | --filter "ordered_at >= 2024-01-01" | --filter-expr 'date_diff_days(shipped_at, ordered_at) >= 2' |
| Complex gating | Multiple --filter flags | Single rich conditional expression |

Common `--filter-expr` snippets:

```text
date_diff_days(shipped_at, ordered_at) > 1
concat(channel, "-", region) = "web-US"
if(amount * 1.0825 > 500, 1, 0)
time_diff_seconds(end_time, start_time) >= 3600
```

### Expression Reference

Unified reference for derived column expressions, filter vs filter-expr usage, temporal helpers, common pitfalls, and a combined example.

#### 1. Derived Columns (Recap)

Use `--derive name=expression`. Expressions may reference:

* Header names (normalized after schema mapping)
* Positional aliases `cN` (0-based, so `c0` is first data column)
* Functions (see Temporal Helpers below)
* `row_number` (only when `--row-numbers` enabled)

#### 2. Filter vs Filter-Expr (Recap)

Two parallel mechanisms:

* `--filter` provides concise typed comparisons (auto-parsed per datatype; AND chaining across repeats).
* `--filter-expr` evaluates a full Evalexpr expression after parsing typed values (supports arithmetic, string, conditional, temporal helpers, boolean logic).

Mix them freely; all are combined with AND semantics overall (i.e. row must satisfy every filter and every filter-expr that evaluates true).

#### 3. Temporal Helpers (Full List)

| Function | Description |
|----------|-------------|
| `date_add(date, days)` / `date_sub(date, days)` | Shift a date forward or backward by whole days. |
| `date_diff_days(end, start)` | Difference in days between two dates (can be negative). |
| `date_format(date, "%d %b %Y")` | Render a date with a custom chrono-compatible format string. |
| `datetime_add_seconds(ts, seconds)` | Shift a datetime by an offset in seconds. |
| `datetime_diff_seconds(end, start)` | Difference between datetimes in seconds. |
| `datetime_to_date(ts)` / `datetime_to_time(ts)` | Extract date or time portions from a datetime. |
| `datetime_format(ts, "%Y-%m-%dT%H:%M")` | Custom formatting for datetimes. |
| `time_add_seconds(time, seconds)` | Shift an `HH:MM[:SS]` time of day by seconds. |
| `time_diff_seconds(end, start)` | Difference between two times (seconds). |

All helpers accept canonical strings (`YYYY-MM-DD`, `YYYY-MM-DD HH:MM:SS`, `HH:MM:SS`). Time arguments accept `HH:MM`. Fractional numeric offsets are truncated.

#### 4. Common Pitfalls

| Pitfall | Guidance |
|---------|----------|
| Quoting (PowerShell) | Wrap the whole expression in single quotes; use double quotes for string literals inside: `'channel="web"'`. |
| Quoting (cmd.exe) | Escape inner quotes: `"web"`. |
| Positional alias indexing | `c0` is first column, not `c1`; verify header order after mapping. |
| Mixed filter logic | Multiple --filter flags AND together; to OR conditions use --filter-expr with (a OR b). |
| Row number usage | `row_number` available only if `--row-numbers` was set before derives/filters execute. |
| Temporal comparisons | Prefer helpers (e.g. `date_diff_days`) over manual string comparison for correctness across formats. |
| Transform vs replace ordering | `datatype_mappings` run first, followed by schema `replace` mappings, then typed parsing & expressions; design expressions based on the fully normalized values. |
| Boolean output format | `--boolean-format` affects derived boolean rendering; logic still works with internal canonical bool. |
| Performance & median | Median and large numeric derives may retain many values; limit columns or avoid heavy expressions for huge files. |
| Using snapshots | Snapshots guard inference output only; they do not validate expression correctness. |

#### 5. Combined Filtering Example

Example mixing concise filters and one complex temporal expression:

```powershell
./target/release/csv-managed.exe process \
  -i ./data/orders.csv \
  -m ./data/orders-schema.yml \
  --filter "status = shipped" \
  --filter "amount >= 100" \
  --filter-expr 'date_diff_days(shipped_at, ordered_at) >= 2 && (region = "US" || region = "CA")' \
  --derive 'ship_lag_days=date_diff_days(shipped_at, ordered_at)' \
  --row-numbers \
  -C order_id,ordered_at,shipped_at,ship_lag_days,amount,status,region,row_number \
  --limit 25
```

#### 6. Quick Expression Validation Tip

Start with a narrower column selection (`-C`) and a small `--limit` to confirm derived outputs before removing the limit for full processing.

#### 7. Function Index (Alphabetical)

Helper functions usable in `--derive` and `--filter-expr` (temporal & formatting):

`date_add`, `date_diff_days`, `date_format`, `date_sub`, `datetime_add_seconds`, `datetime_diff_seconds`, `datetime_format`, `datetime_to_date`, `datetime_to_time`, `time_add_seconds`, `time_diff_seconds`

#### 8. Debug Tip

Set an environment variable to increase internal logging verbosity:

PowerShell:

```powershell
$env:RUST_LOG='csv_managed=debug'
```

cmd.exe:

```batch
set RUST_LOG=csv_managed=debug
```

Future enhancement: a debug mode may emit expression parse/normalize traces (e.g., tokenization, type coercions). When added, they will appear at `debug` level tagged with `expr:` prefixes. This placeholder documents intended usage; if absent, no expression AST logging is currently implemented.

cmd.exe:

```batch
./target/release/csv-managed.exe process ^
  -i ./data/orders.csv ^
  -m ./data/orders-schema.yml ^
  -x ./data/orders.idx ^
  --index-variant default ^
  --sort order_date:asc,customer_id:asc ^
  --filter "status = shipped" ^
  --filter "amount >= 100" ^
  --derive "total_with_tax=1" ^
  --derive "channel=\"online\"" ^
  -C order_id,customer_id,amount,total_with_tax ^
  --exclude-columns internal_flag ^
  --row-numbers ^
  --boolean-format one-zero ^
  --output-delimiter pipe
```

If `--index-variant` is omitted, `process` automatically chooses the variant that covers the longest prefix of the requested `--sort` columns and directions.
When no indexed variant matches the requested sort signature, the command falls back to an in-memory stable sort while continuing to stream rows wherever possible.

## Installation

```bash
cargo build --release
```

Binary (Windows): `target\release\csv-managed.exe`

Install from crates.io:

```bash
cargo install csv-managed
```

Install locally from the workspace (useful when developing):

```bash
cargo install --path .
```

After building, the CLI can re-run installation on the current machine:

```powershell
./target/release/csv-managed.exe install --locked
```

The helper wraps `cargo install csv-managed` and accepts `--version`, `--force`, `--locked`, and `--root` pass-through options.

> Release automation: push a tag like `v0.1.0` and provide a `CRATES_IO_TOKEN` repository secret; the GitHub Actions release workflow will build archives and execute `cargo publish --locked` automatically.

Logging examples:

```powershell
$env:RUST_LOG='info'
```

```batch
set RUST_LOG=info
```

## Quick Start

```powershell
# 1. Infer schema
./target/release/csv-managed.exe schema infer -i ./data/orders.csv -o ./data/orders-schema.yml --sample-rows 0
# 2. Build index (optional for sorted reads)
./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx --spec default=order_date:asc,customer_id:asc --spec recent=order_date:desc --schema ./data/orders-schema.yml
# 3. Process with filters / derives / sort
./target/release/csv-managed.exe process -i ./data/orders.csv -m ./data/orders-schema.yml -x ./data/orders.idx --index-variant default --sort order_date:asc,customer_id:asc --filter "status = shipped" --derive 'total_with_tax=amount*1.0825' --row-numbers -o ./data/orders_filtered.csv
# 4. Normalize legacy tokens via schema replacements
./target/release/csv-managed.exe process -i ./data/orders.csv -o ./data/orders_clean.csv --schema ./data/orders-schema.yml
# 5. Summary statistics
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml
# 5b. Temporal summary statistics
./target/release/csv-managed.exe stats -i ./tests/data/stats_temporal.csv -m ./tests/data/stats_temporal-schema.yml --columns ordered_at --columns ordered_at_ts --columns ship_time
# 6. Frequency counts (top 10)
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders-schema.yml --frequency --top 10
# 7. Preview first 15 rows
./target/release/csv-managed.exe process -i ./data/orders.csv --preview --limit 15
# 8. Append monthly extracts
./target/release/csv-managed.exe append -i jan.csv -i feb.csv -i mar.csv -m orders-schema.yml -o q1.csv
# 9. Verify integrity (summary default)
./target/release/csv-managed.exe schema verify -m orders-schema.yml -i q1.csv
#     Investigate failures with highlighted samples (optional limit)
./target/release/csv-managed.exe schema verify -m orders-schema.yml -i orders_invalid.csv --report-invalid:detail:summary 5
```

For more advanced derived column and filtering patterns (bucketing, temporal calculations, chained logic), see `docs/expressions.md`.

## Command Reference

Detailed `--help` output for every command is mirrored in `docs/cli-help.md` for quick reference.

### schema

Define schemas manually or discover them via `probe` / `infer`; verify datasets against a saved schema and optionally enforce value replacements.

| Subcommand / Flag | Description |
|-------------------|-------------|
| `schema probe` | Display inferred columns and types in a console table (no file written). |
| `schema infer` | Infer and optionally persist a `-schema.yml` file (`-o/--output`). Legacy `*-schema.yml` written only if explicitly requested. |
| `schema verify` | Stream-validate one or more files against a schema (`-m/--schema`). |
| `-i, --input <FILE>` | Input CSV for `probe` or `infer`. Repeat `-i` for multiple inputs in `verify`. |
| `-o, --output <FILE>` | Destination schema file (alias `--schema` retained). Prefer `-schema.yml` for new files; legacy `*-schema.yml` still accepted. |
| `-m, --schema <FILE>` | Schema file to use with `verify` (or as destination alias with `infer`). |
| `-c, --column <SPEC>` | Manual column definitions (`name:type`, or `name:type->Alias`). Repeatable / comma list. |
| `--replace <SPEC>` | Value replacement directive (`column=value->replacement`) for manual schema authoring. |
| `--sample-rows <N>` | Rows to sample during inference (`0` = full scan). |
| `--delimiter <VAL>` | Override input delimiter (`comma`, `tab`, `semicolon`, `pipe`, or single ASCII). |
| `--input-encoding <ENC>` | Character encoding of input (defaults `utf-8`). |
| `--mapping` | Emit column mapping templates (aliases) to stdout when probing/infering, including a `suggested` snake_case column for quick copy/paste. |
| `--replace-template` | Inject empty `replace` arrays per column when inferring. |
| `--override <SPEC>` | Force specific inferred types (`amount:Float`, `id:Integer`). Repeatable. |
| `--snapshot <PATH>` | Capture/compare probe or infer output against a golden snapshot. Writes if missing, fails on drift; see [Snapshot Internals](#snapshot-internals). |
| `--report-invalid[:detail[:summary]] [LIMIT]` | (verify) Base flag adds summary of invalid columns; `:detail` adds ANSI highlighted row samples; `:summary` adds aggregate column violation counts (combine both as needed). LIMIT (after flags) caps row samples when `:detail` present. |

Behavior notes:

* `schema probe` renders an elastic table of inferred columns plus sample-based hints; footer indicates scan scope and any decoding skips.
* `schema infer` shares all probe options and adds persistence, mapping templates, and optional replace scaffolding. Pass `--preview` to review the resulting YAML (including `--replace-template` scaffolding and NA placeholder replacements); mapping templates still emit when `--mapping` is set. Combine with `--diff existing-schema.yml` to emit a unified diff against a saved schema (and optionally `-o` to perform a dry run before persisting).
* `schema verify` streams every row, applying `replace` mappings before type parsing; any invalid value triggers non‑zero exit code. Reporting tiers: base (`--report-invalid`), detail (`--report-invalid:detail [LIMIT]`), combined (`--report-invalid:detail:summary [LIMIT]`).
* `--snapshot` applies to `probe` and `infer`, guarding the textual layout & inference heuristics (see [Snapshot Internals](#snapshot-internals) and [Snapshot vs Schema Verify](#snapshot-vs-schema-verify)).
* Datatype inference uses a majority-based voting algorithm over the sampled (or full) row set; tie scenarios fall back to the most general viable type.
* Header presence is auto-detected: up to the first 6 physical rows are sampled. If the first row appears header-like (alphabetic tokens, dictionary matches, or position vs data heuristics) it is treated as the header; otherwise synthetic names `field_0`, `field_1`, ... are generated and the schema records `has_headers: false` so downstream operations read the first physical row as data.

#### Headerless CSV Support

Some datasets (exported sensors, legacy ETL extracts) omit a header row. `csv-managed` now infers this automatically:

* Samples up to the first 6 rows even when a schema is not yet available.
* Classifies each token in row 1 as header-like (contains letters, or matches a curated dictionary such as `id`, `date`, `status`, `amount`) or data-like (parses as numeric, boolean, date/time, GUID, etc.).
* Compares row‑1 tokens to subsequent rows column-by-column to accumulate header vs data “signals”.
* Resolves ties using dictionary hits and relative counts of header-like vs data-like tokens.
* When determined headerless, assigns zero-based synthetic names `field_0..field_{N-1}`. These are treated exactly like real headers (may be referenced directly or via positional aliases `cN`). You can later rename via schema edits or mapping templates.
* Explicit override: pass `--assume-header true|false` to `schema probe` or `schema infer` when you already know whether the first row is a header. The flag bypasses the heuristic and persists the choice into the generated schema.

Schema persistence:

```yaml
schema_version: 1.0
has_headers: false
columns:
  - name: field_0
    datatype: Integer
  - name: field_1
    datatype: Float
```

To override a mistaken classification edit the saved schema file and set `has_headers: true` (or `false`) and, if necessary, adjust column names. All commands that accept `-m/--schema` honor this flag; if omitted it defaults to `true` for backward compatibility.

Quick probe example (headerless):

```powershell
./target/release/csv-managed.exe schema probe -i ./tests/data/sensor_readings_no_header.csv --preview
```

Processing with synthetic names:

```powershell
./target/release/csv-managed.exe process -i sensor_readings_no_header.csv -m sensor_readings_no_header-schema.yml -C field_0,field_2 --limit 5 --preview
```

If you prefer stable, semantic names, run `schema infer --mapping` and then rename the generated YAML entries (or apply the mapping template) before further processing.

##### FAQ: Header Detection Issues

**Q: The tool decided my file was headerless but the first row actually contains column names.**  
Cause: Most first-row tokens parsed as data-like (numbers, dates) and lacked alphabetic/dictionary signals.  
Fix: Open the saved schema, set `has_headers: true`, and rename the `field_#` entries to proper names (or rerun `schema infer` after inserting a temporary clearly alphabetic header line). Future runs using the schema will honor your correction.

**Q: My first data row was treated as a header.**  
Cause: Tokens looked alphabetic (e.g. sensor IDs like `A1`, `B2`) or matched common header words; limited sample rows produced ambiguous signals.  
Fix: Edit schema to `has_headers: false`. If you want semantic names, rename the auto-generated `field_#` entries after toggling. Optionally add an explicit dummy header row upstream for clarity, or add an additional early data row before inferring.

**Q: Different files in a batch mix headered and headerless layouts.**  
Fix: Normalize upstream or create two schemas (one with `has_headers: true`, one with `false`) and process in separate invocations before combining.

**Q: Can I force behavior via a CLI flag?**  
Yes. Append `--assume-header true` or `--assume-header false` to `schema probe` / `schema infer`. The inferred schema records the chosen value in `has_headers`, so downstream commands inherit the override.

#### Column Naming Conventions (snake_case Preference)

`csv-managed` recommends (not enforces) snake_case header names. You will see a `suggested` snake_case column emitted when using `--mapping` with `schema probe` or `schema infer`. Adopting this convention yields several practical benefits:

* Shell & CLI ergonomics – No spaces or punctuation to quote/escape in PowerShell, cmd.exe, Bash, or when passing expressions (e.g. `--derive total_with_tax=amount*1.0825`).
* Expression reliability – Evalexpr tokenization stays simple: lowercase + underscore avoids accidental case mismatches or the need for backticks/quotes around names containing spaces or symbols.
* Cross‑language portability – Aligns with common JSON, Python, Rust, and many data engineering tool defaults; easier re‑use of column names across scripts, notebooks, and code generators.
* Stable diffs & snapshots – Lowercase, delimiter‑normalized headers reduce noisy changes caused by accidental capitalization or extra spaces, keeping `--snapshot` hash churn low.
* Safer normalizations – Underscores avoid confusion with minus signs, arithmetic operators, or CSV delimiters inside derived expressions and filters.
* Interoperability – Tools like `awk`, `grep`, SQL-like DSLs, and future pipeline integrations consume simple tokens more predictably.
* YAML friendliness – Snake_case keys avoid quoting in YAML and reduce risk of special-character parsing surprises.
* Index & mapping predictability – Consistent naming improves human recognition of multi-column index variants and reduces ambiguity in covering expansions.

Recommended guidelines:

1. Begin with a letter (`a-z`). If the raw header starts with a digit, prefix with `col_` or a domain hint (e.g., `col_2024_sales` → `sales_2024`).
2. Lowercase all characters.
3. Replace contiguous spaces or punctuation (`[\s\-\/\.#$%&()]`) with a single underscore.
4. Collapse multiple underscores into one.
5. Remove trailing underscores.
6. Preserve digits where meaningful (e.g., `http2_requests`).
7. Resolve collisions by appending a domain qualifier or ordinal (`amount` vs `amount_original`).

Typical conversions:

| Original Header | Suggested | Rationale |
|-----------------|-----------|-----------|
| `Order Date` | `order_date` | Spaces → underscore, lowercase. |
| `OrderDate` | `order_date` | CamelCase split. |
| `  Gross$ Amount (USD) ` | `gross_amount_usd` | Trim, punctuation stripped, semantic unit retained. |
| `Customer-ID` | `customer_id` | Hyphen → underscore. |
| `SKU#` | `sku` | Symbol removed; short code already unique. |
| `Total.Net` | `total_net` | Period → underscore. |
| `ShipTime(s)` | `ship_time_s` | Parentheses removed; unit suffix preserved. |

Deriving snake_case from existing headers:

1. Trim leading/trailing whitespace.
2. Replace punctuation/spaces with underscores.
3. Insert underscores between lowercase/uppercase boundaries in CamelCase (`OrderDate` → `Order_Date`).
4. Lowercase the result.
5. Collapse multiple underscores.

Why not enforce? Some domains require legacy casing (`CustomerID`, `MFRPartNo`). The schema mapping mechanism (`--mapping`) lets you retain raw names while assigning snake_case aliases for processing/derives. Use whichever name feels most natural in downstream expressions—aliases appear post‑mapping.

Rule of thumb: If a header would need quoting in a shell, convert it to snake_case before committing the schema.

#### Datatype Inference Overview

The precise inference algorithm, placeholder handling rules, currency promotion thresholds, decimal precision calculation, and tie behaviors are documented in detail in `docs/schema-inference.md`. The summary below highlights only the most essential operational points:

* Sampling: `--sample-rows <N>` (default 2000) or full scan with `--sample-rows 0`.
* Voting: Each non-empty, non-placeholder value registers observations for boolean, numeric (integer/decimal/float), date, datetime, time, guid, and currency suitability. Placeholder tokens (e.g. `NA`, `N/A`, `#NA`, `#N/A`, `null`, `missing`, dashed lines) are ignored for voting but can be converted to replacements depending on `--na-behavior`.
* Decimal vs Float: Consistent fixed-scale numbers within precision limits (<=28) yield `decimal(p,s)`; overflow or mixed scales degrade to `Float`.
* Currency: Promoted ahead of Decimal/Float only when 100% of non-empty numeric values fit allowed currency scales (0,2,4) AND ≥30% bear a symbol; otherwise normal majority rules apply (majority currency with at least one symbol can still win later in the decision chain).
* Unclassified Values: Presence of any unclassified token forces a fallback to `String` for that column (guard against accidental narrow type assignment).
* Overrides: `--override name:Type` applies post-inference and is indicated in the probe table's override column (`type`).
* NA Placeholders: `--na-behavior fill --na-fill <TOKEN>` injects replacement entries mapping each observed placeholder to the fill token; `--na-behavior empty` maps them to empty strings.
* Snapshots: `--snapshot` captures the rendered probe/infer output (including hash, summaries, placeholder suggestions) for regression safety.

For full pseudocode, edge cases (leading zeros, exponential numeric forms, parentheses negatives, precision overflow), tie scenarios, and troubleshooting guidance, read: [`docs/schema-inference.md`](docs/schema-inference.md). Additional usage examples remain in [`docs/schema-examples.md`](docs/schema-examples.md).

#### Overrides vs Mappings vs Replacements

Decision guide for choosing schema mutation mechanisms:

| Purpose | Overrides (`--override name:Type`) | Datatype Mappings (`datatype_mappings`) | Value Replacements (`replace:` or `--replace`) |
|---------|------------------------------------|-------------------------------------------|------------------------------------------------|
| Force a final datatype despite mixed raw values | Yes | No (transforms, not votes) | No |
| Parse & convert string representations (dates, numbers) | Sometimes (if inference failed) | Yes (String→DateTime→Date, String→Float, etc.) | No |
| Normalize token casing / trim whitespace | No | Yes (String→String with `trim`, `lowercase`, `uppercase`) | Sometimes (for isolated tokens) |
| Round/truncate numeric precision or currency scale | No | Yes (`strategy: round` / `truncate`) | No |
| Canonicalize categorical variants (Pending→Open) | No | Rarely (String→String strategies) | Yes (map each variant) |
| Standardize NA-style placeholders | No | Not required | Yes (auto via `--na-behavior`, or manual) |
| Preserve raw column while exposing cleaned alias | Use with `--mapping` if type must differ | Yes (chain to target type + rename) | Yes (post-mapping cleanup) |
| Stabilize dirty ID column with stray non-numeric tokens | Yes (override to Integer) | Optional (String→Integer) | Yes (replace stray tokens) |
| Enforce fixed decimal spec (decimal(p,s)) | Yes if inference chose Float | Yes (String/Float→decimal) | No |
| Simplify DateTime to Date | Prefer mappings DateTime→Date | Yes | No |

Quick heuristics:

* Prefer inference first; apply overrides only for true domain guarantees.
* Introduce replacements early for categorical harmonization and placeholder cleanup.
* Use mappings for structural/format transformations that must occur before validation.
* Avoid overrides when a mapping chain can safely yield the desired final representation.
* Re-run `schema infer --diff existing.yml` after adding mappings or replacements to audit changes.

See `docs/schema-inference.md` for algorithm specifics influencing when overrides are necessary.

PowerShell (inference mode):

```powershell
./target/release/csv-managed.exe schema infer `
  -i ./data/orders.csv `
  -o ./data/orders-schema.yml `
  --delimiter tab `
  --sample-rows 0 `
  --mapping `
  --replace-template
  --na-behavior fill --na-fill NULL
```

PowerShell (explicit columns with replacements):

```powershell
./target/release/csv-managed.exe schema `
  -o ./schemas/orders-schema.yml `
  -c id:integer->Identifier `
  -c customer_id:integer->Customer ID,order_date:date,amount:float,status:string `
  --replace status=Pending->Open `
  --replace "status=Closed (Legacy)->Closed"
```

cmd.exe:

```batch
./target/release/csv-managed.exe schema ^
  -o ./schemas/orders-schema.yml ^
  -c id:integer->Identifier ^
  -c customer_id:integer->Customer ID,order_date:date,amount:float,status:string
```

### index

Build a B-Tree index for specified key columns (ascending order optimization).

| Flag | Description |
|------|-------------|
| `-i, --input <FILE>` | Input CSV file. |
| `-o, --index <FILE>` | Output `.idx` file. |
| `-C, --columns <LIST>` | Legacy single ascending index (comma list). Superseded by `--spec`. |
| `--spec <SPEC>` | Repeatable: `name=col_a:asc,col_b:desc` or `col_a:asc`. Builds named variants per index file. |
| `--covering <SPEC>` | Generate prefix combinations with optional direction branches using `\|`, e.g. `geo=date:asc\|desc,customer:asc`. |
| `-m, --schema <FILE>` | Optional schema file. |
| `--limit <N>` | Stop after N rows (partial index). |
| `--delimiter <VAL>` | Input delimiter. |

PowerShell:

```powershell
./target/release/csv-managed.exe index `
  -i ./data/orders.csv `
  -o ./data/orders.idx `
  --spec default=order_date:asc,customer_id:asc `
  --spec recent=order_date:desc `
  -m ./data/orders-schema.yml
```

cmd.exe:

```batch
./target/release/csv-managed.exe index ^
  -i ./data/orders.csv ^
  -o ./data/orders.idx ^
  --spec default=order_date:asc,customer_id:asc ^
  --spec recent=order_date:desc ^
  -m ./data/orders-schema.yml
```

`--spec` accepts comma-separated `column:direction` tokens. Prefix with `name=` to label the variant (e.g. `fast=col_a:asc,col_b:desc`). When omitted, the variant is anonymous but still usable for automatic matching.

### Snapshot Internals

Snapshot files captured by `schema probe --snapshot` or `schema infer --snapshot` now contain structured diagnostics to make regression reviews easier:

* **Header+Type Hash** – a SHA-256 digest that locks the column ordering and inferred datatypes. Any change to headers or datatypes produces a new hash even if table formatting stays the same.
* **Observations Column** – the probe table now carries the per-column sampling summary inline (`non_empty`, `empty`, representative samples, and placeholder mentions), keeping drift diagnostics attached to the header row without duplicating content in separate sections.

When a snapshot mismatch occurs, these diagnostics highlight exactly which aspect changed—structure, type inference, or observed value distribution—before you decide whether to refresh the snapshot.

### process

Transform pipeline: sort, filter, derive, project, exclude, boolean formatting, row numbers, delimiter changes, optional table output.

| Flag | Description |
|------|-------------|
| `-i, --input <FILE>` | Input CSV (required). |
| `-o, --output <FILE>` | Output file (stdout if omitted). |
| `-m, --schema <FILE>` | Schema file. |
| `-x, --index <FILE>` | Index file for accelerated sort matching asc/desc directives. |
| `--index-variant <NAME>` | Pin to a named variant stored in the index file (requires matching `--sort`). |
| `--sort <SPEC>` | Repeatable: `column[:asc or :desc]`. Comma list or multiple uses. |
| `-C, --columns <LIST>` | Inclusion list (repeatable). |
| `--exclude-columns <LIST>` | Exclusion list (repeatable). |
| `--derive <name=expr>` | Derived column (repeatable). |
| `--filter <expr>` | Filter expression (repeatable; AND). |
| `--filter-expr <EXPR>` | Evalexpr-based filter evaluated per row; supports temporal helpers (`date_add`, `date_diff_days`, `time_diff_seconds`, etc.). Use double-quoted string literals for constants (e.g., `"06:00:00"`). |
| `--row-numbers` | Prepend `row_number`. |
| `--limit <N>` | Emit at most N rows. |
| `--delimiter <VAL>` | Input delimiter. |
| `--output-delimiter <VAL>` | Output delimiter override. |
| `--boolean-format <FORMAT>` | Normalize boolean output. Formats: `original`, `true-false`, `one-zero`. |
| `--preview` | Render a preview table on stdout (defaults `--limit` to 10; cannot be combined with `--output`). |
| `--table` | Render as formatted table when streaming to stdout (ignored when writing to a file). |
| (see Expression Reference) | Advanced derived, filter, and temporal helper syntax. |

PowerShell:

```powershell
./target/release/csv-managed.exe process `
  -i ./data/orders.csv `
  -m ./data/orders-schema.yml `
  -x ./data/orders.idx `
  --index-variant default `
  --sort order_date:asc,customer_id:asc `
  --filter "status = shipped" `
  --filter "amount >= 100" `
  --derive 'total_with_tax=amount*1.0825' `
  --derive 'channel="online"' `
  -C order_id,customer_id,amount,total_with_tax `
  --exclude-columns internal_flag `
  --row-numbers `
  --boolean-format one-zero `
  --output-delimiter pipe
```

cmd.exe:

```batch
./target/release/csv-managed.exe process ^
  -i ./data/orders.csv ^
  -m ./data/orders-schema.yml ^
  -x ./data/orders.idx ^
  --index-variant default ^
  --sort order_date:asc,customer_id:asc ^
  --filter "status = shipped" ^
  --filter "amount >= 100" ^
  --derive "total_with_tax=amount*1.0825" ^
  --derive "channel=\"online\"" ^
  -C order_id,customer_id,amount,total_with_tax ^
  --exclude-columns internal_flag ^
  --row-numbers ^
  --boolean-format one-zero ^
  --output-delimiter pipe
```

If `--index-variant` is omitted, `process` automatically chooses the variant that covers the longest prefix of the requested `--sort` columns and directions.

Schema-driven replacements defined in the `-schema.yml` file are always applied before parsing, so `process` can clean and transform data in a single pass.

### append

Append multiple CSV files into a single output. Ensures consistent headers (baseline or schema enforced).

| Flag | Description |
|------|-------------|
| `-i, --input <FILE>` | Repeatable input files (first defines header). |
| `-o, --output <FILE>` | Output file (stdout if omitted). |
| `-m, --schema <FILE>` | Optional schema for strict validation. |
| `--delimiter <VAL>` | Delimiter for all inputs. |

Example:

```powershell
./target/release/csv-managed.exe append -i jan.csv -i feb.csv -i mar.csv -m orders-schema.yml -o q1.csv
```

### Snapshot vs Schema Verify

The `--snapshot` flag (used with `schema probe` or `schema infer`) and the `schema verify` subcommand serve **complementary but distinct** purposes:

| Aspect | Snapshot (`schema probe --snapshot` / `schema infer --snapshot`) | Schema verify (`schema verify`) |
|--------|------------------------------------------------------------------|---------------------------------|
| Primary goal | Guard against unintended changes in probe/infer formatting or inference heuristics (layout, ordering, inferred types) | Enforce that actual CSV row values conform to a declared schema (types, headers, replacements) |
| Domain | Developer regression / output stability | Data quality / contractual correctness |
| Data scanned | Headers + optional sampled rows (based on `--sample-rows`) | Entire file(s), streaming every row |
| Artifact | A snapshot text file (golden layout); created if missing, compared if present | No artifact on success; optional ANSI-highlighted report on failure |
| Validation granularity | Whole rendered output string (byte/line comparison) | Per‑cell parsing & typed normalization |
| Failure cause | Rendered output differs from saved snapshot | Any cell cannot be parsed/mapped to its declared datatype |
| Typical CI use | Lock down formatting & inference behavior so docs/tests stay stable | Block ingestion of malformed or schema‑incompatible data |
| Performance profile | Very fast (sample + render) | Potentially heavy for large files; optimized via streaming |
| Update workflow | Rerun with `--snapshot` intentionally to refresh after accepted changes | Update schema file separately as data definitions evolve |

#### When to Use Which

Use a snapshot when you want to ensure the *presentation and inference logic* of schema discovery has not drifted (e.g., after refactors or heuristic tweaks). Use `schema verify` when validating real datasets prior to append, stats, indexing, joins, or downstream ML pipelines.

#### Example Workflow

```powershell
# 1. Infer schema and create/update snapshot of inference layout
./target/release/csv-managed.exe schema infer -i data.csv -o data-schema.yml --snapshot infer.snap --sample-rows 0

# 2. Commit both data-schema.yml and infer.snap

# 3. Later, validate new extracts against the frozen schema
./target/release/csv-managed.exe schema verify -m data-schema.yml -i new_extract.csv --report-invalid:detail:summary 25
```

If inference heuristics or display formatting changes intentionally, refresh the snapshot:

```powershell
./target/release/csv-managed.exe schema probe -i data.csv --snapshot infer.snap --sample-rows 10
```

This will overwrite (if removed first) or fail (if differing) to prompt a conscious review. Keep snapshots small by combining them with modest `--sample-rows` values—full scans are unnecessary for layout regression.

#### Summary

*Snapshot = regression guard on inferred schema presentation.*  
*Verify = runtime enforcement of data correctness against a schema.*

### preview

Display first N rows in an elastic table.

| Flag | Description |
|------|-------------|
| `-i, --input <FILE>` | Input file. |
| `--rows <N>` | Number of data rows (default 10). |
| `--delimiter <VAL>` | Input delimiter. |

### stats

Summary statistics for numeric and temporal columns.

Supported types:

* Numeric: Integer, Float
* Temporal: Date, DateTime, Time

Temporal values are internally converted to numeric metrics for aggregation:

* Date => days from Common Era (CE)
* DateTime => epoch seconds (UTC naive)
* Time => seconds from midnight

They are rendered back to canonical forms; standard deviation for Date reports `days` and for DateTime/Time reports `seconds`.

| Flag | Description |
|------|-------------|
| `-i, --input <FILE or ->` | Input file or `-` (stdin; requires schema). |
| `-m, --schema <FILE>` | Schema file (recommended). |
| `-C, --columns <LIST>` | Restrict to listed columns (defaults to numeric & temporal columns, or all columns when `--frequency` is used). |
| `--delimiter <VAL>` | Input delimiter. |
| `--frequency` | Emit distinct value counts instead of summary statistics. |
| `--top <N>` | Limit to the top N values per column when `--frequency` is used (0 = all). |
| `--limit <N>` | Scan at most N rows (0 = all). |
| (see Expression Reference) | Extended filter / temporal helper functions. |

> NOTE (Piped / Streaming Usage): When feeding `stats` from a prior `process` stage via stdin (`-i -`), the incoming header row must match the schema exactly (same columns, order, and count). Avoid adding derived columns, dropping columns (`--columns` / `--exclude-columns`), or reordering headers if you will reuse the original schema downstream. Restrict upstream changes to row-level filters, sorts, limits, encoding normalization. See: [Designing Multi-Stage Pipelines](docs/pipelines.md#header-shape-invariance-between-typed-stages).

#### Temporal stats example

Given a temporal schema file:

```yaml
schema_version: "1.0"
columns:
  - name: id
    datatype: Integer
  - name: ordered_at
    datatype: Date
  - name: ordered_at_ts
    datatype: DateTime
  - name: shipped_at
    datatype: Date
  - name: shipped_at_ts
    datatype: DateTime
  - name: ship_time
    datatype: Time
  - name: status
    datatype: String
```

Run stats over temporal columns:

```powershell
./target/release/csv-managed.exe stats -i ./data/orders_temporal.csv -m ./data/orders_temporal-schema.yml \
  --columns ordered_at --columns ordered_at_ts --columns ship_time
```

Sample output (elastic table formatting):

```text
| column         | count | min                | max                | mean                | median              | std_dev        |
| ordered_at     | 4     | 2024-01-01         | 2024-02-10         | 2024-01-31          | 2024-01-06          | 15.56 days     |
| ordered_at_ts  | 4     | 2024-01-01 04:45:00| 2024-02-10 14:00:00| 2024-01-30 17:03:45 | 2024-01-06 05:57:30 | 1345678 seconds|
| ship_time      | 4     | 06:00:00           | 16:30:00           | 09:37:30            | 08:00:00            | 12810 seconds  |
```

Mean and median for Time represent the central tendency of seconds-from-midnight values, rendered back into `HH:MM:SS`.

Apply filters to restrict the rows included in the calculation:

```powershell
./target/release/csv-managed.exe stats -i ./data/stats_schema.csv -m ./data/stats_schema-schema.yml \
  --columns quantity --filter "status=good"
```

`--filter` accepts the same column comparisons as `process --filter`. For complex predicates, repeat `--filter` or add `--filter-expr` for Evalexpr-based expressions. Filters apply to both summary statistics and `--frequency` output.

#### Frequency counts (--frequency)

`stats --frequency` reports distinct value counts per column. By default, every column is included; use `-C/--columns` to target a subset. Combine with `--top` to cap the number of values displayed per column (0 = all).

Example combining `--frequency` with filters over the Big 5 dataset:

```powershell
./target/release/csv-managed.exe stats `
  -i ./tests/data/big_5_players_stats_2023_2024.csv `
  --frequency `
  -C Squad `
  --filter "Player=Max Aarons"
```

Sample output (elastic table formatting):

```text
| column | value       | count | percent |
| Squad  | Bournemouth | 1     | 100.00% |
```

### join

The hash-join engine remains part of the codebase, but the standalone `join` subcommand has been withdrawn from the CLI while we redesign a streaming-friendly workflow for v1.6.0. Existing scripts should transition to `process`-first pipelines (filters, derives, preview, append) until the new join interface lands. Follow the roadmap in `[.todos/plan-v1.6.0.md](.todos/plan-v1.6.0.md)` for progress updates on the pipeline-oriented join strategy.

### install

### schema columns

List schema columns and their data types in a formatted table.

| Flag | Description |
|------|-------------|
| `-m, --schema <FILE>` | Schema file describing the columns to list. |

PowerShell:

```powershell
./target/release/csv-managed.exe schema columns `
  --schema ./data/orders-schema.yml
```

cmd.exe:

```batch
./target/release/csv-managed.exe schema columns ^
  --schema ./data/orders-schema.yml
```

Wrapper around `cargo install csv-managed` with a friendlier interface.

| Flag | Description |
|------|-------------|
| `--version <SEMVER>` | Install a specific published version. |
| `--force` | Reinstall even if already installed. |
| `--locked` | Pass `--locked` to respect `Cargo.lock`. |
| `--root <DIR>` | Target an alternate installation directory. |

Example:

```powershell
./target/release/csv-managed.exe install --locked
```

### Data Types

| Type | Examples | Notes |
|------|----------|-------|
| String | any UTF‑8 | Normalized header names usable in expressions. |
| Integer | `42`, `-7` | 64-bit signed. |
| Float | `3.14`, `2` | f64; integers accepted. |
| Boolean | `true/false`, `t/f`, `yes/no`, `y/n`, `1/0` | Parsing flexible; output format selectable. |
| Date | `2024-08-01`, `08/01/2024`, `01/08/2024` | Canonical output `YYYY-MM-DD`. |
| DateTime | `2024-08-01T13:45:00`, `2024-08-01 13:45` | Naive (no timezone). |
| Time | `06:00:00`, `14:30`, `08:01:30` | Canonical output `HH:MM:SS`; inference accepts `HH:MM[:SS]`. |
| Currency | `$12.34`, `123.4567` | Enforces 2 or 4 decimal places; thousands separators and leading symbols permitted; outputs normalized decimals. |
| Decimal | `123.4567`, `(1,234.50)`, `decimal(18,4)` | Fixed precision/scale numeric type (max precision 28). Accepts optional sign, parentheses for negatives, and separators; schema mappings can `round` or `truncate` to the declared scale. |
| Guid | `550e8400-e29b-41d4-a716-446655440000`, `550E8400E29B41D4A716446655440000` | Case-insensitive; accepts hyphenated or 32-hex representations. |

> See the [Expression Reference](#expression-reference) for temporal helper usage (date/time arithmetic & formatting), boolean output formatting considerations, and quoting rules affecting String, Date, DateTime, Time parsing in derived expressions and filters.

### Datatype Mappings

`datatype_mappings` let you declare an ordered chain of conversions that run **before** value replacements and final type parsing. Author them inside each column object in the schema file.

Key points:

* Capitalization: Use capitalized data types (`String`, `Integer`, `Float`, `Boolean`, `Date`, `DateTime`, `Time`, `Guid`, `Currency`) in production schema files. Decimal columns are declared with lowercase tokens such as `decimal(18,4)` to capture precision and scale.
* File naming: Prefer `<name>-schema.yml` for new schemas; legacy `<name>-schema.yml` loads fine but is deprecated.
* Replacement arrays: Use `replace` (array of `{ value, replacement }`). Older internal key `value_replacements` is auto-translated to `replace` when saving.
* Order matters: Each mapping consumes the previous output; declare from raw → intermediate → final.
* Strategies: `round` (numeric, including Currency and Decimal), `trim` / `lowercase` / `uppercase` (String→String), `truncate` (Float→Integer, Currency, Decimal). Rounding scale defaults to `4` for floats, the schema-declared scale for decimals, and the detected precision (2 or 4) for currency unless `options.scale` is provided.
* Options: Provide an `options` object for format guidance (e.g. a datetime `format`) or numeric rounding scale. Currency mappings accept `options.scale` of `2` or `4` to set precision explicitly.
* Failure: Any mapping parse error invalidates the row for that column during `schema verify`.

Example converting an ISO‑8601 timestamp with trailing `Z` to a date and rounding a decimal:

```yaml
- name: ordered_raw
  datatype: Date
  rename: ordered_at
  datatype_mappings:
    - from: String
      to: DateTime
      options:
        format: "%Y-%m-%dT%H:%M:%SZ"
    - from: DateTime
      to: Date

- name: amount_raw
  datatype: Float
  rename: amount
  datatype_mappings:
    - from: String
      to: Float
      strategy: round
      options:
        scale: 4
```

Built‑in fallback DateTime formats (used when no explicit `options.format` is specified):

```text
%Y-%m-%d %H:%M:%S
%Y-%m-%dT%H:%M:%S
%d/%m/%Y %H:%M:%S
%m/%d/%Y %H:%M:%S
%Y-%m-%d %H:%M
%Y-%m-%dT%H:%M
```

Common chrono tokens:

| Token | Meaning |
|-------|---------|
| `%Y`  | 4‑digit year |
| `%m`  | Month (01–12) |
| `%d`  | Day of month (01–31) |
| `%H`  | Hour (00–23) |
| `%M`  | Minute (00–59) |
| `%S`  | Second (00–60) |
| `%f`  | Fractional seconds (nanoseconds) |

Validation flow:

1. Raw value ingested.
2. `datatype_mappings` chain executes.
3. Value replacements apply.
4. Final parsing validates against the declared column `datatype`.

See extended examples in `docs/schema-examples.md`.

#### Schema File Format (Overview)

Minimal `-schema.yml` example (YAML):

```yaml
schema_version: 1.0
has_headers: true
columns:
  - name: id
    datatype: Integer
  - name: ordered_raw
    rename: ordered_at
    datatype: Date
    datatype_mappings:
      - from: String
        to: DateTime
        options:
          format: "%Y-%m-%dT%H:%M:%SZ"
      - from: DateTime
        to: Date
  - name: amount_raw
    rename: amount
    datatype: Float
    datatype_mappings:
      - from: String
        to: Float
        strategy: round
        options:
          scale: 4

  - name: measurement_raw
    rename: measurement
    datatype: decimal(18,4)
    datatype_mappings:
      - from: String
        to: decimal(18,4)
        strategy: truncate
    replace:
      - value: "N/A"
        replacement: "0"
```

Transformation order per cell:

1. Raw string value read.
2. `datatype_mappings` chain executes sequentially.
3. `replace` entries apply (exact match, case-sensitive for strings).
4. Final parse into declared `datatype`.

Header flag semantics:

* `has_headers: true` (default) – the first physical row is treated as column headers.
* `has_headers: false` – synthetic or previously persisted column names enumerate data columns; the first physical row is processed as data.
* When absent (older schemas) it is assumed `true` for backward compatibility.


`round` uses `options.scale` (default 4). `truncate` converts Float→Integer removing fractional part (toward zero). String→String strategies: `trim`, `lowercase`, `uppercase`. Invalid strategy/type pairing marks the row invalid during `schema verify`.

#### Verify Reporting Tiers

| Flag Example | Exit Code | Row Samples | Column Summary | Notes |
|--------------|-----------|-------------|----------------|-------|
| `--report-invalid` | non-zero if any invalid | no | yes | Fast overview of violating columns. |
| `--report-invalid:detail` | non-zero | yes (all or limited) | no | Shows `row \| column` (and raw value when combined with `:summary`). |
| `--report-invalid:detail:summary` | non-zero | yes | yes | Combines row samples + aggregate counts. |
| `--report-invalid:detail 5` | non-zero | yes (5 rows) | no | Limit only affects samples. |
| `--report-invalid:detail:summary 5` | non-zero | yes (5 rows) | yes | Summary counts remain total, not truncated. |

Base operation without any `--report-invalid` flag logs overall invalid count to stderr when failures occur.

#### Upcoming Schema Enhancements (Roadmap)

Planned (not yet implemented): primary key & hashed signature indexes, date/time format re-writing transforms, automatic candidate key suggestion, batch JSON definition ingestion.

### Streaming & Pipelines

`csv-managed` supports Unix-style and Windows command chaining via stdin (`-i -`) and stdout. Use a single dash (`-`) as the input file to instruct commands to read from stdin. This enables incremental transformation without writing intermediate files and keeps memory usage low for large datasets.

Key points:

* Use `-i -` (or `--input -`) to read streamed CSV data.
* A schema (`-m/--schema`) is strongly recommended (and required for typed operations like `stats`, inferred datatypes, temporal helpers, currency/decimal parsing, and most filters).
* Column selection (`--columns` / `-C`), exclusion, derives, filters, and table/preview output all work with piped data.
* When chaining multiple commands, each downstream command must specify `-i -` explicitly.
* You can mix file inputs and streamed inputs (e.g., pipe the first file, pass a second file normally to `append`).
* If you rename or drop columns in an upstream stage, make sure the downstream schema matches the transformed header (the original schema will reject it).
* `process --preview` cannot be combined with `-o`; in a pipeline it simply renders the elastic table and stops (no downstream data). For chaining transformations, omit `--preview`.
* Header validation in downstream typed stages accepts either the original column names or their `name_mapping` (snake_case) forms. However, adding or removing columns (e.g. with `--derive`, `--exclude-columns`) requires an updated schema; otherwise a header mismatch will occur.
* For `stats`, only numeric / temporal / decimal / currency columns are profiled. If you project or rename away those typed columns upstream, `stats` may find zero eligible columns unless you specify them explicitly with `-C/--columns`.
* Encoding normalization pipelines can combine `--input-encoding` (upstream) and `--output-encoding` (downstream or same stage with `-o`) to standardize heterogeneous CSV sources.

Cross-reference: For advanced multi-stage design patterns (schema evolution, encoding normalization, safe header modifications) see: [Designing Multi-Stage Pipelines](docs/pipelines.md).

Roadmap & Backlog: A consolidated product roadmap by release (including upcoming versions, deprecations, and larger epics) now lives in `[.plan/backlog.md](.plan/backlog.md)`. Refer there for planned feature sequencing beyond the high‑level notes below (e.g. join redesign v1.6.0, primary key indexing, streaming joins).

#### PowerShell Examples

Process streamed input and project a few columns:

```powershell
Get-Content .\tests\data\big_5_players_stats_2023_2024.csv |
  .\target\release\csv-managed.exe process -i - --schema .\tests\data\big_5_players_stats-schema.yml \
  --columns Player --columns Squad --columns Performance_Gls --limit 5 --table
```

Chain `process` into `stats` (filter rows, then compute statistics):

```powershell
Get-Content .\tests\data\big_5_players_stats_2023_2024.csv |
  .\target\release\csv-managed.exe process -i - --schema .\tests\data\big_5_players_stats-schema.yml \
  --filter "Performance_Gls >= 10" --limit 40 |
  .\target\release\csv-managed.exe stats -i - --schema .\tests\data\big_5_players_stats-schema.yml -C Performance_Gls
```

Append using a streamed first file and a second file on disk:

```powershell
Get-Content .\tests\data\big_5_players_stats_2023_2024.csv |
  .\target\release\csv-managed.exe append -i - -i .\tmp\big_5_preview.csv \
  --schema .\tests\data\big_5_players_stats-schema.yml -o .\tmp\players_union.csv
```

Stats from a filtered streamed transformation (filter first, stats second):

```powershell
Get-Content .\tests\data\big_5_players_stats_2023_2024.csv |
  .\target\release\csv-managed.exe process -i - --schema .\tests\data\big_5_players_stats-schema.yml \
  --filter "Performance_Gls >= 5" |
  .\target\release\csv-managed.exe stats -i - --schema .\tests\data\big_5_players_stats-schema.yml -C Performance_Gls
```

Encoding normalization (Windows-1252 -> UTF-8 in-memory, then stats):

```powershell
Get-Content .\tmp\big_5_windows1252.csv |
  .\target\release\csv-managed.exe process -i - --input-encoding windows-1252 --schema .\tests\data\big_5_players_stats-schema.yml \
  --columns Player --columns Performance_Gls --limit 25 |
  .\target\release\csv-managed.exe stats -i - --schema .\tests\data\big_5_players_stats-schema.yml -C Performance_Gls
```

#### cmd.exe Examples

Use `type` instead of `Get-Content`:

```batch
type tests\data\big_5_players_stats_2023_2024.csv | target\release\csv-managed.exe process -i - --schema tests\data\big_5_players_stats-schema.yml --columns Player --limit 5 --table
```

All relative paths in this section assume the current directory is the project root (where `Cargo.toml` lives).

For convenience, define helper variables before running the piped examples:

```batch
set "BIN=target\release\csv-managed.exe"
set "SCHEMA=tests\data\big_5_players_stats-schema.yml"
```

Chained transformation and stats:

```batch
type tests\data\big_5_players_stats_2023_2024.csv ^
| "%BIN%" process -i - --schema "%SCHEMA%" --filter "Performance_Gls^>=10" --limit 25 ^
| "%BIN%" stats -i - --schema "%SCHEMA%" -C Performance_Gls
```

Encoding normalization with explicit input encoding (Windows-1252 source) and chained stats:

```batch
type tmp\big_5_windows1252.csv ^
| "%BIN%" process -i - --input-encoding windows-1252 --schema "%SCHEMA%" --columns Player --columns Performance_Gls --limit 25 ^
| "%BIN%" stats -i - --schema "%SCHEMA%" -C Performance_Gls
```

Appending (mix streamed & file inputs):

```batch
type tests\data\big_5_players_stats_2023_2024.csv | "%BIN%" append -i - -i tmp\big_5_preview.csv --schema "%SCHEMA%" -o tmp\players_union.csv
```

#### Troubleshooting Pipelines

| Symptom | Cause | Fix |
|---------|-------|-----|
| Command hangs waiting for input | Upstream command didn't produce data yet or previous stage failed silently | Add `--preview --limit 5` temporarily to inspect stage output; verify file path and schema. |
| "Column not found" in downstream stage | Mapping/rename changed header names between stages | Use `schema columns` or inspect first stage output to confirm header normalization. |
| Stats reports zero rows | Filter removed all rows upstream | Remove filters or reduce strictness; test with `--limit 50` before stats. |
| Invalid datatype errors in chained stats | Missing schema or misdeclared types for piped data | Pass the correct `--schema` in every stage needing typed parsing. |

#### When a Schema Is Required

`stats`, typed filters (`--filter` numeric/temporal comparisons), temporal helpers, currency & decimal enforcement, boolean normalization, and datatype mappings all rely on schema context. For raw string-only projection, you may omit the schema—but performance and correctness on mixed datatypes improve markedly with it.

#### Minimal No-Schema Pipeline (String-Only)

```powershell
Get-Content .\tests\data\big_5_players_stats_2023_2024.csv |
  .\target\release\csv-managed.exe process -i - --columns Player --limit 3 --table
```

#### Validation Test Pattern (Recommended)

Use `cargo test` with `assert_cmd` to lock in pipeline behavior:

```rust
Command::cargo_bin("csv-managed")?
    .args(["process","-i","-","--schema", schema, "--columns","Player","--limit","3"]) 
    .write_stdin(std::fs::read_to_string(input)?)
    .assert()
    .success();
```

See `tests/stdin_pipeline.rs` for full chained examples.

#### Performance Note

Each stage streams rows forward—only derived expressions and optional sort/in-memory fallback allocate per row. Avoid unnecessary wide projections early; narrow inputs reduce CPU and memory footprints downstream.

---

#### Encoding Normalization Pipeline

Normalize a legacy Windows‑1252 encoded file to UTF‑8 while projecting a limited set of columns.

PowerShell:

```powershell
Get-Content .\tmp\big_5_windows1252.csv |
  .\target\release\csv-managed.exe process -i - --input-encoding windows-1252 --schema .\tests\data\big_5_players_stats-schema.yml \
  --columns Player --columns Squad --limit 5 --table
```

cmd.exe:

```batch
type tmp\big_5_windows1252.csv | target\release\csv-managed.exe process -i - --input-encoding windows-1252 --schema tests\data\big_5_players_stats-schema.yml --columns Player --columns Squad --limit 5 --table
```

Bash / zsh:

```bash
cat tmp/big_5_windows1252.csv | ./target/release/csv-managed process -i - --input-encoding windows-1252 --schema tests/data/big_5_players_stats-schema.yml --columns Player --columns Squad --limit 5 --table
| target\release\csv-managed.exe process -i - --schema tests\data\big_5_players_stats-schema.yml --filter Performance_Gls^>=5 ^
Write normalized UTF‑8 output:

`cmd.exe` treats `>` as a redirection operator even inside quotes; escape comparisons by removing the spaces and prefixing the `>` with `^` (e.g., `Performance_Gls^>=10`). In `.bat` files you need to double the caret (`Performance_Gls^^>=10`) because the first pass strips one `^`.

```powershell
Get-Content .\tmp\big_5_windows1252.csv |
  .\target\release\csv-managed.exe process -i - --input-encoding windows-1252 --schema .\tests\data\big_5_players_stats-schema.yml \
  --columns Player --columns Squad -o .\tmp\big_5_normalized_utf8.csv --output-encoding utf-8
```

> Tip: Place encoding normalization first; downstream stages assume UTF‑8 only.

### Boolean Formatting Examples

```powershell
./target/release/csv-managed.exe process -i orders.csv -m orders-schema.yml --boolean-format one-zero -C shipped_flag -o shipped.csv
./target/release/csv-managed.exe process -i orders.csv -m orders-schema.yml --boolean-format true-false --table -C shipped_flag
```

### Table Output

`--table` renders transformation results as an elastic-width ASCII table (stdout only; cannot combine with `--output`).

### Index Strategy

Index stores byte offsets keyed by concatenated column values. A single `.idx` can hold multiple named variants, each with its own mix of ascending/descending columns. `process` picks the variant that best matches the requested `--sort` signature or you can force one via `--index-variant`. When no compatible variant exists, the command falls back to in-memory sorting.

### Performance Considerations

* Indexed sort avoids loading all rows into memory.
* Early filtering cuts sort & derive workload.
* Derived expressions evaluated per emitted row—keep them lean.
* Median requires storing column values (potential memory impact for huge numeric columns).

### Error Handling

* Rich `anyhow` contexts (I/O, parsing, evaluation, schema, index).
* Fast failure on unknown columns, invalid expressions, header/schema mismatches.
* Invalid UTF‑8 rows error (never silently skipped).

### Logging

Set `RUST_LOG=csv_managed=debug` (or `info`) for insight into phases (index use, inference, filtering).

### Testing

```bash
cargo test
```

Integration tests cover schema inference, index, process (filters, derives, sort, delimiters). Additional tests planned for joins and stats frequency scenarios.

### Contributing

1. Fork & branch (`feat/<name>`).
2. Add tests (unit + integration) for new behavior.
3. Run `cargo fmt && cargo clippy && cargo test` before PR.
4. Update README (move items from roadmap when implemented).

### License

See `LICENSE`.

### Support

Open issues for bugs, enhancements, or documentation gaps. Pull requests welcome.