dataprof 0.4.78

High-performance data profiler with ISO 8000/25012 quality metrics for CSV, JSON/JSONL, and Parquet files
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
# Changelog

All notable changes to DataProfiler will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.4.78] - 2025-10-23

### Improved

- **CSV Parsing Fallback Messaging**
  - Integrated fallback messages into verbosity system (0=quiet, 1=normal, 2=verbose, 3=debug)
  - Default behavior (verbosity 0-1) now suppresses fallback messages for cleaner output
  - Verbose mode (-vv) shows informational messages for debugging
  - Changed message tone from error-like "⚠️ Strict CSV parsing failed" to info-like "ℹ️ Using flexible CSV parsing"
  - Fallback behavior is intentional design (fast strict path → robust fallback for malformed data)

## [0.4.77] - 2025-10-16

### Performance

- **Analysis Module Optimization**
  - Pre-compiled regex patterns with lazy_static (eliminated runtime compilation overhead)
  - Single-pass numeric type checking in inference (reduced iterations)
  - IEEE 754 compliant sorting using total_cmp (safer NaN handling)
  - Consistent whitespace handling across all analysis components

### Fixed

- IT phone regex pattern: corrected anchor and grouping syntax
- Unsafe unwrap on partial_cmp in metrics module (prevented potential panic with NaN)
- Clippy len_zero lint in column tests

### Added

- 43 unit tests for analysis module (inference: 17, patterns: 14, column: 12)
- --detailed flag implementation in info command (shows version and enabled features)

### Performance Impact

- Regex compilation overhead: eliminated
- Type inference: +15-25% estimated improvement
- Pattern detection: +20-30% estimated improvement

## [0.4.75] - 2025-10-09

### 🧹 **REFACTORING: Post-First-Month Cleanup (Phase 1 & 2)**

- **REMOVED:** Legacy and deprecated code cleanup
  - Removed unused `run_subcommand_mode()` function from `main.rs`
  - Removed deprecated `_ml_code_enabled` parameter from `batch_results.rs`
  - Removed deprecated `calculate_comprehensive_metrics_static()` method
  - Removed backward compatibility alias `calculate_overall_quality_score()`
  - Consolidated `api/simple.rs` into `api/mod.rs` (eliminated redundant wrapper)
  - **Impact:** ~200 lines of technical debt eliminated

- **IMPROVED:** Arrow Profiler now production-ready with complete feature parity
  - Fixed sample collection: `ColumnAnalyzer` now properly exposes samples for quality metrics
  - Extended Arrow type support: Added Boolean, Date32/64, Timestamp (all 4 variants), Binary/LargeBinary
  - Fixed `process_as_string_array()`: Now uses Arrow's `array_value_to_string()` instead of placeholders
  - Binary arrays displayed as hex strings (first 8 bytes) for inspection
  - **Impact:** Arrow profiler achieves complete ISO 8000/25012 quality metrics support

- **TESTING:** All test suites passing with zero regressions
  - 75/75 library tests passing
  - Arrow-specific tests verified
  - Compilation time stable with incremental builds

### 🎯 **FEATURE PARITY: Python Bindings Match Rust Core**

- **NEW:** Python bindings now support Parquet in batch processing
  - `batch_analyze_glob()` and `batch_analyze_directory()` now accept `.parquet` files
  - `PyBatchAnalyzer.add_file()` with automatic format detection (CSV/JSON/Parquet)
  - `PyBatchAnalyzer.analyze_batch()` with automatic format detection
  - Conditional compilation via `#[cfg(feature = "parquet")]` for clean builds

- **IMPROVED:** Batch HTML reports now display Parquet metadata
  - Extended `build_files_context()` to include Parquet metadata per file
  - New "📦 Parquet File Metadata" section in batch dashboard file details
  - Shows: row groups, compression codec, version, compressed size, compression ratio
  - Schema summary displayed in formatted code blocks

- **IMPROVED:** Enhanced Python type hints and documentation
  - Updated docstrings for batch functions to indicate Parquet support
  - `PyBatchAnalyzer` class documentation mentions format detection
  - Consistent API documentation between single and batch operations

- **FIXED:** Clippy warnings resolved
  - Removed unused imports in `src/python/types.rs`
  - Fixed wildcard-in-or-patterns warnings in batch format detection

### 🚀 **NEW: Production-Ready Parquet Support with Extended Type Coverage**

- **NEW:** Apache Parquet format support with native columnar processing
  - `analyze_parquet_with_quality()` - Direct Parquet file analysis
  - `analyze_parquet_with_config()` - Configurable batch size for performance tuning
  - `ParquetConfig` - Adaptive batch sizing (1KB-32KB based on file size)
  - `is_parquet_file()` - Robust format detection via magic number ("PAR1")
  - Full integration with unified `DataProfiler::auto()` API
  - Automatic format detection with two-tier approach:
    - Fast path: File extension check (`.parquet`)
    - Robust path: Magic number validation (works without extension)
  - Comprehensive ISO 8000/25012 quality metrics for Parquet data

- **IMPROVED:** Complete Arrow type coverage - **21 types supported** (from 7)
  - Integer types: `Int8`, `Int16`, `Int32`, `Int64`, `UInt8`, `UInt16`, `UInt32`, `UInt64`
  - Date/Time types: `Date32`, `Date64`, `Timestamp` (4 variants), `Duration` (4 variants)
  - Numeric types: `Float32`, `Float64`, `Decimal128`, `Decimal256`
  - Binary types: `Binary`, `LargeBinary`
  - Generic fallback: Uses Arrow `ArrayFormatter` for complex types (List, Struct, Map)
  - Type fidelity preserved: Timestamp ≠ Date ≠ Integer in type inference

- **NEW:** Parquet metadata exposure in quality reports
  - `ParquetMetadata` struct with row groups, compression, version, schema
  - Compressed size tracking and compression codec detection
  - Available in JSON exports and programmatic API

- **NEW:** Test data and examples
  - `generate_test_parquets.rs` - Script to create realistic test files
  - 3 sample Parquet files in `examples/test_data/`:
    - `simple.parquet` - Basic types demo (1.7KB)
    - `ecommerce.parquet` - Business data with Decimal/Timestamp (2.5KB)
    - `sensors.parquet` - IoT time-series with Date64 (1.9KB)

- **FIXED:** "stream did not contain valid UTF-8" error in unified API
  - `AdaptiveProfiler` now detects Parquet files before attempting text parsing
  - Report command now uses `DataProfiler::auto()` for format detection
  - Graceful error message when Parquet feature is not enabled

- **IMPROVED:** Test coverage: **12 integration tests** (from 7)
  - Extended types, binary, decimal, mixed types, custom batch size, adaptive sizing
  - All tests passing ✅

### 🔧 **REFACTOR: Config Module Technical Debt Cleanup - Issue #98**

- **FIXED:** Critical compilation error - removed broken ML config references
- **IMPROVED:** Magic numbers → 27+ documented named constants with rationale
- **NEW:** Builder pattern for fluent configuration API
  - `DataprofConfigBuilder` with 30+ chainable methods
  - Preset configurations: `ci_preset()`, `interactive_preset()`, `production_quality_preset()`
  - ISO quality profiles: `iso_quality_profile_strict()`, `iso_quality_profile_lenient()`
- **IMPROVED:** Config file loading with auto-discovery
  - Fixed TODO in `core_logic.rs` - proper config file loading implemented
  - Auto-discovery: `.dataprof.toml`, `~/.config/dataprof/config.toml`, `dataprof.toml`
  - Enhanced logging: `✓ Loaded configuration from...` with clear feedback
- **IMPROVED:** Comprehensive validation with actionable error messages
  - 20+ validation checks with `→ Fix:` and `→ Recommended:` guidance
  - New validations: chunk size, memory limits, concurrent operations, database settings
- **REFACTOR:** Consolidated overlapping config structures (no breaking changes)
  - Removed dead code from `QualityConfig`: `null_threshold`, `detect_duplicates`, `detect_mixed_types`, `check_date_formats`
  - Single source of truth: all quality control via `IsoQualityThresholds`
  - Eliminated 4 unused fields + 4 constants + 2 builder methods
  - Cleaner API: `quality_enabled` + `iso_thresholds` only

### 🎉 **NEW: JSON Batch Export - Issue #95**

- **NEW:** **📊 Complete JSON Export for Batch Processing**
  - `--json <path>` flag for batch command to export structured JSON reports
  - `--format json` option for batch mode with stdout or file output
  - Comprehensive JSON structure: summary, per-file reports, errors, aggregated metrics
  - Full ISO 8000/25012 compliance with all 5 dimensions in JSON output
  - CI/CD integration ready with machine-readable quality assessment

- **IMPROVED:** **🧹 Formatters Architecture Cleanup**
  - Removed duplicate structures (JsonSummary, JsonQuality)
  - DataQualityMetrics as single source of truth for all quality metrics
  - Full separation of concerns: no redundant quality score calculations
  - Cleaner JSON output without duplicate metrics
  - Removed legacy/unused formatter functions

## [0.4.70] - 2025-10-02 - "Quality-First Pivot: ISO 8000/25012 Focus Edition"

### ⚠️ **BREAKING: ML Features and Script Generation Removed (~7200 lines)**

**Strategic Pivot**: DataProfiler now focuses **exclusively** on ISO 8000/25012 data quality assessment.

#### **Removed Features**:
- ❌ ML readiness scoring and assessment engine
- ❌ ML feature analysis and recommendations
- ❌ Python/pandas preprocessing script generation
- ❌ Code snippet generation for data preprocessing
-`dataprof ml` CLI command
-`--ml`, `--ml-score`, `--ml-code`, `--output-script` flags

#### **Removed Modules** (~5000 lines):
- `src/analysis/ml_readiness.rs` (ML scoring engine)
- `src/analysis/code_generator.rs` (script generation)
- `src/cli/commands/ml.rs` (ML CLI command)
- `src/cli/commands/script_generator.rs` (script generation CLI)
- `src/database/ml_readiness_simple.rs` (database ML support)
- ML sections from `output/display.rs`, `output/html.rs`, `output/batch_results.rs`

#### **Removed Python Bindings** (~1500 lines):
- `src/python/ml.rs` (entire ML module)
- All `PyMl*`, `PyFeature*`, `PyPreprocessing*` classes
- Functions: `ml_readiness_score()`, `analyze_csv_for_ml()`, `feature_analysis_dataframe()`
- `ml_readiness_score_with_logging()`

#### **Removed Documentation & Tests** (~700 lines):
- `docs/python/ML_FEATURES.md`
- `python/examples/ml_readiness_example.py`
- `python/examples/sklearn_integration_example.py`
- `python/tests/test_ml_readiness.py`
- ML-related tests in `tests/cli_basic_tests.rs`, `tests/database_integration.rs`

#### **API Compatibility**:
- Deprecated fields in `BatchConfig` and `BatchResult` kept with `#[deprecated]` attribute
- Functions accepting ML parameters now accept `Option<&()>` placeholders
- Existing data quality features remain **100% functional**

#### **Migration Guide**:
If you were using ML features:
1. **CLI**: Remove `--ml*` flags from commands
2. **Python**: Remove calls to `ml_readiness_score()` and related functions
3. **Focus**: Use ISO 8000/25012 data quality metrics for data assessment

**Rationale**: Simplify codebase, eliminate maintenance burden, focus on core competency (data quality).

---

### 🏗️ **Major Architecture Refactoring: Eliminated Tech Debt (~730 lines removed)**

#### **Database Connectors** (~650 lines eliminated)
- **REMOVED:** DuckDB connector (unstable, 486 lines)
- **REMOVED:** ~150 lines of duplicated code across PostgreSQL, MySQL, SQLite connectors
- **ADDED:** `database/connectors/common.rs` - Shared query building functions
  - `build_count_query()` - Unified count query generation
  - `build_batch_query()` - Unified batch query with LIMIT/OFFSET
- **IMPROVED:** Error handling - replaced `.unwrap_or(None)` with `.ok()`
- **IMPROVED:** All connectors now use common validation and query building
- **FIXED:** Removed circular dependency and unused imports

#### **CSV Parser** (~39 lines eliminated)
- **REMOVED:** ~200 lines of duplicated initialization/processing logic
- **ADDED:** 5 reusable helper functions in `parsers/csv.rs`:
  - `initialize_columns()` - Initialize HashMap from headers
  - `process_records_to_columns()` - Convert row-oriented to column-oriented
  - `process_csv_record()` - Process single CSV record from reader
  - `analyze_columns()` - Analyze all columns and return profiles
  - `analyze_columns_fast()` - Fast analysis mode
- **REFACTORED:** All 6 CSV functions now use shared helpers:
  - `analyze_csv_robust()`
  - `analyze_csv_with_sampling()`
  - `analyze_csv()`
  - `analyze_csv_fast()`
  - `try_strict_csv_parsing()`
  - `try_strict_csv_parsing_fast()`
- **IMPROVED:** DRY principle - single source of truth for common operations

#### **StreamingProfiler God Object Refactoring** (350 → 309 lines)
- **PROBLEM:** `StreamingProfiler::analyze_file()` was a God Object (224 lines, 7 responsibilities)
- **SOLUTION:** Split into focused modules following Single Responsibility Principle
- **ADDED:** `engines/streaming/chunk_processor.rs` (153 lines)
  - Handles chunk processing and sampling logic
  - `ProcessingStats` - Track rows, chunks, bytes processed
  - Testable in isolation
- **ADDED:** `engines/streaming/report_builder.rs` (120 lines)
  - Handles report construction from processed data
  - Delegates to existing `analyze_column()` and `QualityChecker`
  - Testable in isolation
- **FIXED:** Wired existing `ProgressManager` from `output/progress.rs`
  - Before: StreamingProfiler manually created `EnhancedProgressBar` (16 duplicate lines)
  - After: Uses `manager.create_enhanced_file_progress()` (existing module)
  - Eliminated 40 lines of progress tracking duplication
- **IMPROVED:** `StreamingProfiler` now acts as clean coordinator (309 lines)
  - Delegates to: `ProgressManager`, `ChunkProcessor`, `ReportBuilder`
  - Single responsibility: orchestration
  - Much easier to maintain and extend

**Architecture Benefits:**
- **-730 total lines** of duplicated/dead code removed
- **+3 focused modules** (ChunkProcessor, ReportBuilder, common.rs)
- **Zero breaking changes** - public API unchanged
- **All tests passing** - no regressions
- **Tech debt eliminated** - no more God Objects or code duplication

### 🎯 **Refactoring: Code Deduplication & Architecture Improvements**

- **REMOVED:** `utils/sampler.rs` module (125 lines) - duplicated functionality from `core/sampling/`
- **IMPROVED:** `analyze_csv_with_sampling()` now uses modern `SamplingStrategy::adaptive()`
- **IMPROVED:** Unified output system - `analyze` command now uses `output_with_adaptive_formatter()`
- **DEPRECATED:** Legacy display functions in `output/display.rs` (use `output_with_adaptive_formatter()` instead)
- **FIXED:** All clippy warnings with `-D warnings` flag (26 issues resolved)
  - Removed unused functions and imports
  - Fixed `ptr_arg` lint (`&PathBuf``&Path`)
  - Derived `Default` trait instead of manual implementation
  - Fixed test warnings (unused `vec![]`, unnecessary comparisons)

**Code Quality:**
- **-125 lines** from sampler removal
- **-80 lines** from analyze command refactor
- Eliminated output code duplication across commands
- Single formatter system for all output formats (JSON, CSV, Text, Plain)

### 🧹 **BREAKING: Legacy Code Cleanup & Architecture Simplification**

- **REMOVED:** Legacy `check` subcommand (use `analyze` instead)
- **REMOVED:** Old `src/commands/` directory (replaced by `src/cli/commands/`)
- **REMOVED:** Legacy single-file CLI mode (now subcommands-only: `analyze`, `ml`, `report`, `batch`)
- **REMOVED:** `QualityReport::quality_score()` penalty-based scoring (now uses ISO 8000/25012 via `DataQualityMetrics`)
- **REMOVED:** Unused CLI files (`cli_parser.rs`, `validation.rs`, `smart_defaults.rs`, `args_v2.rs`)
- **REMOVED:** `display_profile()` function (use `display_data_quality_metrics()` instead)

- **IMPROVED:** **Simplified Architecture**
  - Unified command files: merged `*_impl.rs` into main command modules
  - Single quality scoring system based on ISO 8000/25012 standards
  - Cleaner separation: 4 subcommands + utilities
  - **~2400 lines of code removed** (reduced maintenance burden)

- **IMPROVED:** **DataQualityMetrics Integration**
  - `QualityReport::quality_score()` now returns `Option<f64>` using `DataQualityMetrics::overall_score()`
  - Weighted ISO formula: Completeness (30%), Consistency (25%), Uniqueness (20%), Accuracy (15%), Timeliness (10%)
  - All CLI commands use consistent DataQualityMetrics as base analysis layer
  - ML analysis remains optional layer on top of base quality metrics

- **CLI USAGE:**
  ```bash
  # New subcommand-only CLI
  dataprof analyze data.csv              # Base analysis
  dataprof analyze data.csv --detailed   # Detailed metrics
  dataprof analyze data.csv --ml         # With ML layer
  dataprof ml data.csv --script prep.py  # ML focus
  dataprof report data.csv               # HTML report
  dataprof batch examples/ --parallel    # Batch processing
  ```

### 🏆 **NEW: ISO 8000/25012 Compliance & Configurable Quality Thresholds**

- **NEW:** **📊 ISO-Compliant Quality Metrics System (5 Dimensions)**
  - `IsoQualityThresholds` configuration struct for industry-specific standards
  - Three preset profiles: Default (general), Strict (finance/healthcare), Lenient (exploratory/marketing)
  - Configurable thresholds for all 5 quality dimensions
  - Full compliance with ISO 8000-8, ISO 8000-61, ISO 8000-110, ISO 25012 standards

- **NEW:** **⏰ Timeliness Metrics (ISO 8000-8)**
  - Future dates detection (dates beyond current date)
  - Stale data ratio calculation (configurable age threshold: 2/5/10 years for strict/default/lenient)
  - Temporal ordering violations (e.g., end_date < start_date, updated_at < created_at)
  - Supports multiple date formats: YYYY-MM-DD, DD/MM/YYYY, DD-MM-YYYY, YYYY/MM/DD
  - Industry-specific freshness requirements

- **IMPROVED:** **🔬 Unified Outlier Detection (IQR Method)**
  - Replaced 3-sigma rule with ISO 25012 compliant IQR (Interquartile Range) method
  - More robust: not affected by extreme outliers like 3-sigma
  - Configurable IQR multiplier: 1.5 (default), 1.0 (strict), 2.0 (lenient)
  - Configurable minimum sample size for detection
  - Deprecated legacy `check_outliers()` 3-sigma method in `QualityChecker`

- **ADDED:** **⚙️ Configurable Quality Thresholds**
  - `max_null_percentage`: Threshold for reporting columns with excessive nulls (default: 50%)
  - `high_cardinality_threshold`: Threshold for detecting ID-like columns (default: 95%)
  - `outlier_iqr_multiplier`: IQR sensitivity for outlier detection (default: 1.5)
  - `duplicate_report_threshold`: Threshold for reporting duplicate issues (default: 5%)
  - `min_type_consistency`: Minimum acceptable type consistency percentage (default: 95%)

- **REFACTORED:** **🔧 MetricsCalculator Architecture**
  - Now instance-based instead of static methods for configuration support
  - Constructor methods: `new()`, `strict()`, `lenient()`, `with_thresholds()`
  - Backward compatibility maintained with deprecated static method
  - All quality metrics now respect configurable ISO thresholds

- **UPDATED:** **📖 Enhanced Documentation**
  - Updated `WHAT_DATAPROF_DOES.md` with ISO compliance details
  - Added comparison table for Default/Strict/Lenient thresholds
  - Documented IQR method advantages over 3-sigma
  - Added example: `examples/iso_compliance.rs` demonstrating all threshold profiles

- **ADDED:** **🧪 Comprehensive ISO Compliance Test Suite**
  - 14 comprehensive tests covering all 5 ISO dimensions
  - Tests for configurable thresholds across all profiles (default/strict/lenient)
  - Validation of IQR outlier detection vs deprecated 3-sigma
  - Timeliness metrics verification (future dates, stale data, temporal violations)
  - ISO reproducibility and audit trail tests
  - Test file: `tests/iso_compliance_test.rs`

### 🎯 **Benefits for ISO Certification**
  - ✅ Auditable threshold configuration per ISO 8000/25012
  - ✅ Industry-standard outlier detection (IQR vs deprecated 3-sigma)
  - ✅ Reproducible quality metrics with clear rationale
  - ✅ Support for regulatory compliance (finance, healthcare)
  - ✅ Clear separation: Quality Metrics (ISO) → ML Insights (domain-specific)
  -**5 dimensions tracked**: Completeness, Consistency, Uniqueness, Accuracy, Timeliness

---

### **BREAKING: Unified Database ML Implementation**

- **BREAKING CHANGE:** Database mode now uses full `MlReadinessEngine` instead of simplified `MLReadinessScore`
  - Complete feature parity with single file and batch modes
  - Access to advanced feature analysis, preprocessing suggestions, and interaction warnings
  - Structured recommendations with category, priority, and implementation effort
  - Code snippets automatically generated from ML recommendations

- **IMPROVED:** Database preprocessing script generation
  - Scripts now contain actual code from `MlRecommendation` structures
  - Framework-specific imports and variable documentation included
  - Priority-based preprocessing steps with implementation guidance

- **IMPROVED:** Database command display logic
  - Removed 100+ lines of hardcoded code snippets
  - Reuses `display_ml_score_with_code()` helper for consistency
  - Unified display logic across all modes (single file, batch, database)

- **ADDED:** Comprehensive database ML integration test
  - New `test_sqlite_ml_readiness_full_score()` validates full ML score structure
  - Verifies component scores (completeness, consistency, type_suitability, feature_quality)
  - Tests feature analysis presence and recommendation structure

### 🤖 **NEW: Enhanced ML Readiness Analysis & Feature Intelligence**

- **NEW:** **🎯 Advanced ML Feature Analysis**
  - Enhanced feature suitability scoring using actual column statistics (min/max/mean/length)
  - Precise numeric scaling assessment based on value ranges and magnitudes
  - Intelligent text feature analysis distinguishing short vs long text with character count analysis
  - Improved categorical cardinality evaluation with exact unique count assessment
  - ID column detection for data leakage prevention

- **NEW:** **⚠️ Feature Interaction Warnings System**
  - Curse of dimensionality detection (features vs samples ratio analysis)
  - Data leakage risk identification (ID-like columns with high uniqueness)
  - High cardinality feature overload warnings
  - Feature type diversity analysis (all-numeric vs all-categorical warnings)
  - Insufficient features detection for dataset size

- **NEW:** **🔗 DataQualityMetrics Integration for ML**
  - Combined ML readiness and data quality scoring with intelligent weighting
  - Quality impact quantification on ML performance
  - Enhanced penalty system for consistency issues affecting ML algorithms
  - Integrated completeness, accuracy, and uniqueness factors in ML assessment

- **NEW:** **💡 Enhanced ML Recommendations with Code Generation**
  - Priority-based recommendation system (Critical/High/Medium/Low)
  - Framework-specific code snippet generation (pandas/scikit-learn/feature-engine)
  - Implementation effort assessment (Trivial/Easy/Moderate/Significant/Complex)
  - ML-specific preprocessing pipeline suggestions with dependency ordering

- **NEW:** **🐍 Extended Python Bindings for ML Features**
  - New `PyFeatureInteractionWarning` class exposing all warning types
  - `quality_integration_score` field in `PyMlReadinessScore`
  - `feature_warnings` array with severity levels and recommendations
  - Full backward compatibility with existing ML analysis functions

### 🐍 **NEW: Python PyDataQualityMetrics Integration & Database ML Pipeline**

- **NEW:** **📊 Complete PyDataQualityMetrics Python Bindings**
  - Added comprehensive `PyDataQualityMetrics` class with all 4-dimension metrics
  - Rich HTML representation for Jupyter notebooks with interactive dashboards
  - Individual dimension summary methods (completeness, consistency, uniqueness, accuracy)
  - Overall quality score calculation with intelligent weighting
  - Dictionary export for pandas integration and analysis workflows
  - Added dedicated `calculate_data_quality_metrics()` function for standalone usage
  - Full integration with existing `PyQualityReport` for seamless compatibility

- **NEW:** **🗃️ Database ML Code Snippets & Script Generation**
  - Enhanced database command with ML code snippets support (`--ml-code`)
  - Database-specific preprocessing script generation (`--output-script`)
  - PostgreSQL, MySQL, SQLite integration with ML readiness pipeline
  - Context-aware database preprocessing recommendations with connection handling
  - Complete database ML pipeline: Analysis → Code Snippets → Script Generation
  - Real-time streaming with comprehensive DataQualityMetrics display

- **ENHANCED:** **🔧 Feature Parity Across All Interfaces**
  - Complete consistency between CLI, Python bindings, and database interfaces
  - All analysis modes now support: Quality Metrics + ML Scoring + Code Generation
  - Batch processing with enhanced DataQualityMetrics display per file
  - Database analysis with streaming progress and comprehensive quality assessment
  - Python test suite expanded with PyDataQualityMetrics verification tests
  - End-to-end verification: CSV → JSON → Database → Python → Batch modes

### 📊 **NEW: Complete DataQualityMetrics Display Integration**

- **NEW:** **🎯 Comprehensive Data Quality Metrics CLI Display**
  - Implemented complete visual display of industry-standard 4-dimension metrics
  - Beautiful CLI output with icons, colors, and assessment indicators for Completeness, Consistency, Uniqueness, Accuracy
  - Overall weighted data quality score calculation and categorization (Excellent/Good/Fair/Poor)
  - Context-aware formatting with actionable insights and recommendations

- **ENHANCED:** **🔄 Full Batch Processing Metrics Integration**
  - Extended batch processing to display comprehensive data quality metrics per file
  - Compact metrics summary in per-file analysis with all 4 dimensions
  - Aggregated quality assessment across multiple files in batch operations
  - Consistent metrics display format between single-file and batch modes

- **COMPLETED:** **🔧 End-to-End DataQualityMetrics Pipeline**
  - Fixed incomplete metrics exposure throughout the project (was "half baked")
  - JSON output: ✅ Already working (comprehensive structured metrics)
  - CLI text output: ✅ Now fully implemented with rich display
  - Batch processing: ✅ Integrated with per-file metrics summary
  - Database connectors: ✅ Using enhanced quality analysis
  - All analysis modes now expose the complete industry-standard metrics

- **COMPLETED:** **🎨 Enhanced HTML Output with DataQualityMetrics**
  - Beautiful HTML reports with comprehensive 4-dimension metrics dashboard
  - Interactive score circle with overall quality assessment and color coding
  - Metric cards with icons for Completeness, Consistency, Uniqueness, Accuracy
  - Improved UX: DataQualityMetrics first, legacy issues hidden to avoid redundancy
  - PlainFormatter updated with structured metrics summary
  - Responsive design with mobile-friendly metric grid layout

- **COMPLETED:** **🧹 Code Quality Improvements (Issue #85 Phase 4)**
  - Fixed benchmark compilation issues with proper clippy compliance
  - Added comprehensive database integration tests for DataQualityMetrics
  - Verified no critical dead code or unused imports in main codebase
  - Documented benchmark function patterns to prevent future issues
  - Legacy quality functions maintained for backward compatibility

### 🤖 **NEW: Enhanced Batch Processing with ML Pipeline Features**

- **NEW:** **🔄 Complete ML Batch Processing Integration**
  - Extended batch processing to support all single-file ML features
  - Unified ML readiness analysis across multiple files with intelligent aggregation
  - Parallel ML scoring with configurable concurrency (`--parallel`, `--max-concurrent`)
  - Cross-file recommendation analysis with pattern recognition and consolidation
  - **CLI flags:** `--ml-score`, `--ml-code`, `--output-script` now fully support batch mode

- **NEW:** **📊 Enhanced HTML Dashboard for Batch Analysis**
  - Interactive batch dashboard with comprehensive ML readiness overview
  - Per-file drill-down with detailed ML recommendations and code snippets
  - Aggregated quality metrics with distribution analysis and trend visualization
  - JavaScript-enhanced user experience with expandable file details
  - **Performance stats:** Processing speed, success rates, and artifact generation tracking

- **NEW:** **🐍 Automated Batch Script Generation**
  - Complete Python preprocessing pipeline generation from batch ML analysis
  - Aggregated recommendations with optimized common pattern detection
  - Parallel processing template with ThreadPoolExecutor and robust error handling
  - Ready-to-execute scripts with proper imports and configuration management
  - **Output:** Production-ready Python scripts for immediate ML pipeline integration

- **ENHANCED:** **🎯 Improved ML Metrics Display and Accuracy**
  - **FIXED:** ML score calculation bug (corrected percentage display from >8000% to proper 0-100% range)
  - Enhanced readiness categorization with accurate thresholds (Ready ≥80%, Good 60-80%, etc.)
  - Consistent ML score formatting across all output modes (terminal, HTML, scripts)
  - Improved aggregation algorithms for batch-level ML readiness assessment

- **ENHANCED:** **⚡ Performance Optimizations for Large Batch Operations**
  - Optimized memory usage for ML analysis across multiple files
  - Improved processing speed with intelligent parallel execution (2.5→4.4 files/sec)
  - Enhanced progress reporting with per-file and batch-level metrics
  - Smart resource management for concurrent ML scoring operations

- **IMPROVED:** **🎯 More Realistic ML Readiness Scoring Algorithm**
  - **CRITICAL FIX**: Completeness scoring now properly penalizes high per-column missing rates
  - Enhanced penalties for datasets with ≥50% missing values (0.1 score vs previous lenient calculation)
  - Progressive penalty system: ≥30% (0.3), ≥20% (0.5), ≥10% (0.7), ≥5% (0.85), <5% (1.0)
  - More accurate ML readiness classifications (problematic datasets now correctly rated as "Good" vs "Ready")
  - Improved credibility of ML scoring system for production use cases

### 🔧 **CRITICAL FIX: Smart Auto-Recovery System - Delimiter Detection**

- **FIXED:** **🛠️ Automatic Delimiter Detection Now Fully Functional**
  - Resolved critical bug where delimiter detection was disabled by default
  - Enhanced algorithm to prefer delimiters with higher field counts
  - CLI now uses robust parsing by default for intelligent CSV handling
  - **Supported delimiters:** Comma (`,`), Semicolon (`;`), Pipe (`|`), Tab (`\t`)
  - **Test Results:** All delimiters now correctly detect 4 columns vs 1
  - Backward compatibility maintained for existing workflows

### 🎯 Enhanced User Experience & Terminal Intelligence - Issue #79

- **NEW:** **🖥️ Intelligent Terminal Detection & Adaptive Output**
  - Automatic detection of terminal vs pipe/redirect contexts
  - Smart output format selection (rich interactive vs machine-readable)
  - Context-aware color and emoji support with graceful fallbacks
  - CI/CD-optimized output for seamless automation integration

- **NEW:** **📊 Enhanced Progress Indicators with Memory Intelligence**
  - Real-time memory tracking with leak detection and optimization hints
  - Comprehensive throughput metrics (MB/s, rows/s, columns/s) with smart ETAs
  - Performance-aware progress templates with adaptive update frequencies
  - Memory usage display with estimated peak consumption forecasting

- **NEW:** **🔧 Smart Auto-Recovery System**
  - Automatic delimiter detection (comma, semicolon, tab, pipe) with confidence scoring
  - Intelligent encoding detection and conversion (UTF-8, Latin-1, CP1252)
  - Multi-strategy error recovery with detailed logging and fallback mechanisms
  - Contextual recovery suggestions with success rate tracking

- **NEW:** **🚀 Real-time Performance Intelligence**
  - Advanced performance analytics with intelligent optimization recommendations
  - Memory-aware suggestions based on system resources and file characteristics
  - Adaptive algorithm selection with real-time processing hints
  - Performance bottleneck detection with actionable remediation steps

- **NEW:** **🧠 Memory-Aware Recommendations System**
  - Comprehensive memory tracking with resource lifecycle management
  - Smart streaming mode suggestions for large files and memory-constrained environments
  - Intelligent chunk size optimization based on available system resources
  - Memory leak detection with detailed allocation/deallocation reporting

- **ENHANCED:** **⚡ API Improvements & Backward Compatibility**
  - `ProgressManager::with_memory_tracking()` - Enable enhanced tracking features
  - `EnhancedProgressBar` - Advanced progress display with performance metrics
  - `PerformanceIntelligence` - Real-time system analysis and optimization guidance
  - `AutoRecoveryManager` - Configurable error recovery with strategy patterns
  - All existing APIs preserved with full backward compatibility

- **TECHNICAL:** **🔧 Code Quality & Performance**
  - Added `is-terminal` dependency for robust terminal detection
  - Comprehensive clippy fixes across all feature sets
  - 71/71 tests passing with full regression protection
  - Enhanced error handling with `Clone` trait support
  - Memory-efficient streaming profiler with intelligent hint generation

### 📈 **Impact & Results**
- **🎯 Enhanced Developer Experience**: Rich terminal interfaces with actionable insights
- **🤖 Seamless CI/CD Integration**: Auto-optimal output for scripts and automation pipelines
- **🔧 Reduced Manual Intervention**: Automatic handling of common parsing and processing issues
- **⚡ Optimized Performance**: Real-time guidance for better processing efficiency and resource utilization
- **🛡️ Professional Quality**: Comprehensive error recovery with intelligent fallback strategies

### 📚 **Documentation Overhaul**

#### Comprehensive Documentation Updates
- **UPDATED:** `README.md` - Complete rewrite focusing on ISO 8000/25012 quality assessment
  - Removed all ML readiness references
  - Updated CLI examples with new subcommand structure (`analyze`, `report`, `batch`, `database`, `benchmark`)
  - Added both HTML report GIFs (HTML.gif and HTMLbatch.gif)
  - Updated CI/CD section with batch processing support
  - Clear binary path note for Windows users (`target/release/dataprof-cli.exe`)

- **UPDATED:** `docs/WHAT_DATAPROF_DOES.md` - Complete accuracy audit
  - Fixed dimension count: 5 dimensions (was incorrectly stating 4)
  - Removed references to deleted `src/utils/quality.rs` file
  - Removed redundant section 3.2 (moved to configuration reference only)
  - All source file references verified and updated
  - Version and audit dates updated to 2025-10-02

- **UPDATED:** `docs/python/API_REFERENCE.md`
  - Removed ML recommendation and code generation sections
  - Added comprehensive `PyDataQualityMetrics` documentation
  - Updated all 5 quality dimensions with ISO standards
  - Removed `feature_analysis_dataframe()` references

- **UPDATED:** `docs/python/INTEGRATIONS.md`
  - Removed ML readiness and feature analysis functions
  - Updated scikit-learn integration to use quality-based preprocessing
  - Focus on data type and quality metrics instead of ML features

- **UPDATED:** `docs/guides/CLI_USAGE_GUIDE.md`
  - Updated command syntax to subcommand structure
  - Documented all 5 commands: analyze, report, batch, database, benchmark
  - Removed all `--ml*` flag references
  - Updated examples to match actual CLI implementation

- **FIXED:** `src/analysis/metrics.rs` module comment
  - Updated from 4 to 5 quality dimensions
  - Added Timeliness to ISO standard references

### 🎯 **Summary of v0.4.70 Release**

This release represents a **strategic pivot** from ML-focused tooling to **pure ISO 8000/25012 data quality assessment**:

- **~7200 lines removed**: Eliminated ML readiness scoring, feature analysis, and code generation
- **~730 lines cleaned**: Removed tech debt, duplicated code, and legacy modules
- **5 ISO dimensions**: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
- **Complete documentation**: All docs updated, verified, and accurate
- **Modern CLI**: Clean subcommand structure with batch processing support
- **Enhanced HTML reports**: Beautiful dashboards with comprehensive quality metrics

**Migration Path**: Remove `--ml*` flags from CLI commands, use ISO quality metrics for data assessment.

---

## [0.4.61] - 2025-09-26

- **MIGRATION:** From GNU 3.0 license to MIT.

## [0.4.6] - 2025-09-26

### 🚀 CI/CD Performance Optimizations - Issue #65

- **NEW:** **Path Filters** - Skip unnecessary CI runs for documentation-only changes
- **NEW:** **Workflow Cancellation** - Auto-cancel superseded runs to save resources
- **NEW:** **Draft PR Detection** - Skip expensive workflows on draft PRs
- **NEW:** **Unified Caching Strategy** - Improved cache sharing across workflows
- **OPTIMIZED:** Merged `quick-benchmarks.yml` into main benchmarks workflow
- **REMOVED:** Duplicate security audit from CI workflow (consolidated in security-advanced)
- **IMPROVED:** Test execution consolidation to eliminate redundancy
- **RESULT:** 30-40% CI time reduction for typical development workflows

### 🤖 Enhanced ML Recommendations with Actionable Code Snippets - Issue #71

- **NEW:** 🐍 **Actionable Code Generation for ML Preprocessing**
  - Ready-to-use Python code snippets for every ML recommendation
  - Framework-specific implementations (pandas, scikit-learn)
  - Context-aware code generation based on actual data characteristics
  - Required imports automatically included with each recommendation
  - Variable substitution for column names, thresholds, and strategies

- **NEW:** 🔧 **Comprehensive Preprocessing Code Templates**
  - **Missing Values**: `df['col'].fillna(strategy)`, `SimpleImputer` patterns
  - **Categorical Encoding**: `pd.get_dummies()`, `LabelEncoder()`, `OneHotEncoder()`
  - **Feature Scaling**: `StandardScaler()`, `MinMaxScaler()`, `RobustScaler()`
  - **Date Engineering**: Extract year, month, day, weekday, quarter features
  - **Outlier Handling**: IQR-based capping, z-score filtering, `IsolationForest`
  - **Text Preprocessing**: TF-IDF vectorization, tokenization patterns
  - **Mixed Types**: Data type standardization and cleaning

- **NEW:** 🖥️ **Enhanced CLI with Script Generation**
  - `--ml-code` flag: Display actionable code snippets in terminal output
  - `--output-script <path>` flag: Generate complete preprocessing Python scripts
  - Script includes all preprocessing steps, imports, error handling, and progress indicators
  - Generated scripts are immediately executable and production-ready

- **NEW:** 📊 **Extended MlRecommendation Data Structure**
  - `code_snippet: Option<String>` - Ready-to-use Python code
  - `framework: Option<String>` - Framework used (pandas, scikit-learn, etc.)
  - `imports: Vec<String>` - Required import statements
  - `variables: HashMap<String, String>` - Variables for customization

- **NEW:** 🐍 **Enhanced Python Bindings (PyMlRecommendation)**
  - All new fields exposed to Python API with proper type annotations
  - Backward compatible with existing code
  - New properties: `code_snippet`, `framework`, `imports`, `variables`

- **NEW:** 💻 **Interactive Code Display**
  - Syntax-highlighted code display in CLI output
  - Priority-based color coding for recommendations
  - Framework and import information clearly displayed
  - Code snippets properly formatted with indentation

- **NEW:** 📝 **Complete Script Generation Engine**
  - Generates full preprocessing pipelines with proper Python structure
  - Groups recommendations by priority (Critical → High → Medium)
  - Includes data loading, preprocessing steps, and result saving
  - Error handling and progress indicators included
  - Modular design allows easy customization

- **UPDATED:** 📚 **Documentation and Examples**
  - Enhanced ML_FEATURES.md with comprehensive code snippet documentation
  - Updated API_REFERENCE.md with new PyMlRecommendation properties
  - Added `code_snippets_showcase_example()` in Python examples
  - Updated README.md with new CLI usage examples and feature highlights
  - Complete usage examples for all new functionalities

### 🔧 Python Binding Improvements

- **FIXED:** 🐍 **PyO3 Migration to IntoPyObject** - Issue #70
  - Migrated 33 deprecated `IntoPy::into_py` calls to `IntoPyObject::into_pyobject`
  - Updated Python bindings for PyO3 v0.23.0+ compatibility
  - Fixed type annotations and error handling for new Result-based API
  - Resolved all deprecation warnings in Python modules

### 🔒 Security Enhancements - Issue #41 (Medium-term tasks)

#### Enhanced Security Infrastructure

- **NEW:** 🛡️ **Advanced Security Scanning Workflow** (`.github/workflows/security-advanced.yml`)
  - Comprehensive security pipeline with multiple scanners: cargo-audit, cargo-deny, Semgrep, TruffleHog
  - Static Application Security Testing (SAST) with security-focused Clippy rules
  - Secrets and sensitive data scanning with custom pattern detection
  - Database security validation and performance impact analysis
  - SARIF reporting integration with GitHub Security tab
  - Weekly scheduled scans and manual dispatch options

#### Security Testing Integration

- **ENHANCED:** 📋 **Security Testing Documentation** (`docs/TESTING.md`)
  - Comprehensive security testing guide integrated into main testing documentation
  - SQL injection prevention testing with 350+ attack pattern coverage
  - Error sanitization tests for credential and sensitive data protection
  - Security performance impact validation and CI/CD integration
  - Security test environment setup and monitoring procedures

#### Release & Performance Improvements

- **IMPROVED:** 🚀 **Release Workflow Robustness** (`.github/workflows/release.yml`)
  - Enhanced cross-compilation support for ARM64 targets using latest cross-rs
  - Improved Windows compatibility with PYO3 environment fixes
  - CPU compatibility verification for Python wheels
  - Robust error handling with multiple fallback strategies

- **IMPROVED:** 📊 **Benchmark Workflow Reliability** (`.github/workflows/benchmarks.yml`)
  - Timeout protection for external tool comparisons
  - Graceful fallback strategies for CI environment limitations
  - Enhanced Python dependency installation with retry mechanisms
  - Performance regression analysis with comprehensive reporting

#### Developer Productivity

- **NEW:** 🛠️ **Unified Security Command** (`justfile`)
  - `just security-scan` command for comprehensive security validation
  - Combines dependency audit, policy validation, security tests, and security-focused linting
  - Integration with existing development workflow for pre-commit security checks

### 🚀 Development Environment & Developer Experience - Issue #58

#### Phase 3: Comprehensive Documentation & Guides

- **NEW:** 📚 **Complete Development Documentation** (`docs/DEVELOPMENT.md`)
  - Comprehensive development guide with quick start, architecture overview, and daily workflows
  - Multiple development environment options (native, VS Code dev containers, GitHub Codespaces)
  - Performance optimization guidelines, security best practices, and release process
  - Project statistics, code quality standards, and contribution guidelines
- **NEW:** 🧪 **Detailed Testing Guide** (`docs/TESTING.md`)
  - Multi-layered testing approach: unit, integration, CLI, database, security, and performance tests
  - Test execution strategies for development workflow, CI, and pre-release validation
  - Code coverage targets (>90% unit, >80% integration) with property-based testing examples
  - Debugging tips, custom test attributes, and advanced testing techniques (fuzzing, load testing)
- **NEW:** 🛠️ **IDE Setup Guide** (`docs/IDE_SETUP.md`)
  - Complete setup instructions for VS Code, JetBrains IDEs, Vim/Neovim, Emacs, and Helix
  - IDE comparison matrix with debugging, database tools, and container support ratings
  - Pre-configured VS Code dev containers, debugging configurations, and extension recommendations
  - Universal development setup with essential tools and environment configuration
- **NEW:** 🔧 **Troubleshooting Guide** (`docs/TROUBLESHOOTING.md`)
  - Comprehensive issue resolution for setup, build, container, database, testing, and IDE problems
  - Platform-specific solutions (Windows/WSL2, macOS, Linux) with diagnostic commands
  - Performance troubleshooting, security validation, and network connectivity solutions
  - Quick diagnostics section and emergency debugging procedures

#### Phase 4: Quality Tooling & Developer Productivity

- **NEW:** 🐛 **Enhanced VS Code Debugging** (`.vscode/dataprof.code-workspace`)
  - 10 specialized debug configurations: unit tests, CLI variations, database tests, Arrow integration
  - Engine-specific debugging (streaming, memory profiling) with targeted logging
  - Custom input prompts for flexible debugging scenarios
  - Pre-configured environment variables and debug symbols
- **NEW:** ✂️ **VS Code Code Snippets** (`.vscode/dataprof.code-snippets`)
  - 20+ DataProfiler-specific code snippets for common patterns
  - Engine implementation, column analysis, database connectors, CLI commands
  - Test patterns (unit, integration, property-based, benchmarks) with AAA structure
  - Error handling, async functions, configuration structures, and documentation templates
- **NEW:** 📦 **Advanced Dependency Management** (`justfile` + `deny.toml`)
  - 15+ new dependency management commands: health checks, security audits, license compliance
  - Smart update system with backup/restore, specific package updates, and safety verification
  - Comprehensive dependency reports with outdated, security, and unused dependency analysis
  - `cargo-deny` integration for license compliance and advanced dependency analysis
- **NEW:** 🛡️ **Dependency Security Policy** (`deny.toml`)
  - Whitelist of approved licenses (MIT, Apache-2.0, BSD variants) with exceptions handling
  - Security advisory monitoring with vulnerability denial and unmaintained crate warnings
  - Duplicate dependency detection with platform-specific skip rules
  - Registry and git source validation for supply chain security

#### Standardized Development Environment Setup (Phase 1 & 2)

- **NEW:** 🐳 **Development Containers** (`.devcontainer/`)
  - VS Code dev container configuration with full Rust development stack
  - Multi-stage Dockerfile (development/testing/production environments)
  - Pre-configured extensions: Rust Analyzer, Docker, Database tools, GitHub Copilot
  - Volume caching for cargo dependencies and target directory
  - Automated setup with one-command environment initialization
- **NEW:** 🗃️ **Database Development Services** (`docker-compose.yml`)
  - PostgreSQL 15 with pre-loaded test schemas and sample data
  - MySQL 8.0 with comprehensive data type testing tables
  - Redis for caching tests and MinIO for S3-compatible storage
  - Admin tools: pgAdmin and phpMyAdmin (optional profiles)
  - Health checks and proper initialization scripts
- **NEW:** 🛠️ **Enhanced Task Automation** (`justfile` expansion)
  - 25+ new database and development commands
  - Cross-platform setup scripts (Bash + PowerShell) with robust error handling
  - Database management: `db-setup`, `db-connect-postgres`, `db-connect-mysql`, `db-status`
  - Testing workflows: `test-postgres`, `test-mysql`, `test-all-db`
  - One-command complete setup: `setup-complete`
- **NEW:** 📁 **VS Code Workspace Configuration** (`.vscode/dataprof.code-workspace`)
  - Comprehensive workspace settings with Rust-specific optimizations
  - Debug configurations for unit tests and CLI executable
  - Task definitions for common development workflows
  - Extension recommendations and editor settings
- **NEW:** 📊 **Development Test Data** (`.devcontainer/test-data/`)
  - Sample CSV files with various data patterns and edge cases
  - Pre-loaded database tables with 8 sample records per service
  - Views and stored procedures for testing database integrations
- **ENHANCED:** 🔧 **Cross-Platform Setup Scripts**
  - Windows PowerShell script with parameter support and logging
  - Enhanced Bash script with error handling and mode selection (minimal/full/update)
  - Automatic platform detection in justfile
  - Comprehensive prerequisite checking and tool installation

**🎯 Combined Results (Phases 1-4):**

- **Setup time reduced from hours to < 5 minutes** with one-command environment initialization
- **Consistent development environment across platforms** (Windows, macOS, Linux) with dev containers
- **Comprehensive documentation suite** covering development, testing, IDE setup, and troubleshooting
- **Enhanced developer productivity** with 20+ code snippets, 10 debug configurations, and 15+ dependency commands
- **Automated quality assurance** with security audits, license compliance, and dependency health monitoring
- **Professional onboarding experience** with multi-IDE support and extensive troubleshooting guides

### 🏗️ Code Architecture & Maintainability Improvements

#### Statistical Rigor Framework & Engine Selection Testing - Issue #60

- **NEW:** 📊 **Statistical Rigor Framework** (`src/core/stats.rs`)
  - 95% confidence intervals with t-distribution for small samples (<30)
  - IQR-based outlier detection and removal for data quality
  - Coefficient of variation measurement (target <5% for acceptable variance)
  - Regression detection using confidence interval comparison
  - Statistical significance validation for benchmark results
- **NEW:** 🎯 **Engine Selection Benchmarking** (`src/testing/engine_benchmarks.rs`)
  - Real integration with all profiling engines (Streaming, MemoryEfficient, TrueStreaming, Arrow)
  - Cross-platform memory tracking via system APIs (Windows/Linux/macOS)
  - AdaptiveProfiler accuracy testing with 85% target threshold
  - Performance vs accuracy trade-off analysis with efficiency scoring
  - Systematic engine comparison with statistical significance validation
- **NEW:** 🔬 **Comprehensive Metrics System** (`src/testing/result_collection.rs`)
  - 17 metric types: performance, quality, engine-specific, statistical
  - MetricMeasurement with confidence intervals and sample metadata
  - Performance vs accuracy analysis with automated trade-off rating
  - TradeoffRating system (Excellent/Good/Acceptable/Poor)
- **NEW:** 📈 **Statistical Benchmark Suite** (`benches/statistical_benchmark.rs`)
  - Statistical rigor testing with controlled datasets
  - Engine selection accuracy measurement and validation
  - Performance-accuracy trade-off comprehensive analysis
  - Automated quality criteria validation and reporting
- **ENHANCED:** 🚀 **GitHub Pages Dashboard** (`.github/workflows/benchmarks.yml`)
  - Professional design with statistical rigor metrics display
  - Real-time data loading from benchmark results
  - Statistical confidence indicators and engine accuracy tracking
  - Mobile-responsive design with comprehensive status grid

#### Consolidated & Modernized Benchmarking Infrastructure - Issue #59

- **NEW:** 🏗️ **Unified benchmarking system** - Consolidated fragmented benchmark files into comprehensive suite
  - Replaced `simple_benchmarks.rs`, `memory_benchmarks.rs`, `large_scale_benchmarks.rs` with `unified_benchmarks.rs`
  - Standardized dataset patterns: Basic, Mixed, Numeric, Wide, Deep, Unicode, Messy
  - Implemented dataset size categories: Micro (<1MB), Small (1-10MB), Medium (10-100MB), Large (100MB-1GB)
- **NEW:** 🎯 **Domain-specific benchmark suite** (`domain_benchmarks.rs`)
  - Transaction data patterns for financial/e-commerce testing
  - Time-series data for IoT/monitoring scenarios
  - Streaming data patterns for real-time processing validation
  - Cross-domain comparison and adaptive engine testing
- **NEW:** 📊 **Unified result collection system** (`src/testing/result_collection.rs`)
  - JSON-based result aggregation for CI/CD integration
  - Precise timing and memory collection across all benchmarks
  - GitHub Pages dashboard integration with performance regression tracking
- **NEW:** 🗂️ **Standardized dataset structure** (`tests/fixtures/standard_datasets/`)
  - Organized micro/small/medium/large/realistic dataset hierarchy
  - Realistic data patterns beyond synthetic test data
  - Comprehensive dataset generator with configurable patterns and sizes
- **NEW:** ⚙️ **Enhanced CI/CD workflow** (`.github/workflows/benchmarks.yml`)
  - Manual triggers for unified and domain-specific benchmarks
  - Automated performance dashboard updates
  - Cross-platform memory detection and regression analysis

#### Pre-commit Hooks & Code Quality - Issue #59 & Related

- **FIXED:** 🔧 **Clippy warnings** - Resolved "too many arguments" error by refactoring `add_criterion_result` function
  - Introduced `CriterionResultParams` struct to improve API ergonomics and maintainability
  - Updated all benchmark files (`unified_benchmarks.rs`, `domain_benchmarks.rs`) to use structured parameters
  - Fixed ownership issues without using clone operations for better performance
- **FIXED:** 🛠️ **Format string optimization** - Eliminated unnecessary `format!` calls within `writeln!` macros
  - Improved code efficiency in domain dataset generation (`tests/fixtures/domain_datasets.rs`)
  - Better adherence to Rust formatting best practices
- **IMPROVED:****Development workflow** - Enhanced pre-commit hook reliability and code quality checks

#### CI/CD Pipeline Improvements

- **MAJOR REFACTOR:** 🏗️ **Complete CI/CD workflow optimization** - Consolidated 6 workflows with composite actions
  - **Composite Actions**: Created reusable actions for Rust setup, Python deps, system deps, test execution, and benchmark running
  - **Eliminated Duplication**: Removed 24+ duplicate Rust setups across workflows with unified caching strategy
  - **Performance Gains**: Quick benchmarks <8min, parallel execution, intelligent caching with 80%+ hit rates
  - **Reliability**: Network retry logic, fallback installations, comprehensive timeout controls
  - **Maintainability**: External GitHub Pages template, unified naming, consistent error handling
- **OPTIMIZED:** 🎯 **Workflow specialization** - Clear separation of concerns across development lifecycle
  - `ci.yml`: Core testing (main/master) with parallel test matrix and security audits
  - `staging-dev.yml`: Development feedback (staging) with quick validation and integration tests
  - `quick-benchmarks.yml`: PR performance checks with micro/small datasets
  - `benchmarks.yml`: Comprehensive performance suite with external tool comparison
- **ENHANCED:** 📊 **GitHub Pages dashboard** - Modular performance tracking with external template
  - Separated 830-line embedded HTML into maintainable template system
  - Real-time performance metrics with structured JSON result collection
  - Historical trend analysis with 90-day artifact retention

#### Major Refactoring Initiative - Issue #52

- **REFACTORED:** 📁 **Main CLI structure** (`src/main.rs` → organized modules) (e4896e1)
  - Split 1,450-line main.rs into specialized modules: `cli/`, `commands/`, `output/`, `error/`
  - Improved separation of concerns for CLI argument parsing, command execution, and output formatting
  - Enhanced maintainability and code organization

- **REFACTORED:** 🐍 **Python bindings architecture** (`src/python.rs` → organized modules) (3280186)
  - Modularized 1,468-line python.rs into focused modules: `types/`, `analysis/`, `batch/`, `ml/`, `dataframe/`, `logging/`, `processor/`
  - Better organization of Python API surface and improved code discoverability
  - Preserved all existing functionality with comprehensive test coverage

- **REFACTORED:** 🛡️ **Database security utilities** (`src/database/security.rs` → organized modules) (da92c36)
  - Broke down 848-line security.rs into specialized modules: `sql_validation/`, `ssl_config/`, `credentials/`, `connection_security/`, `environment/`, `utils/`
  - Enhanced security code maintainability and module separation
  - All 32 database and security tests verified and passing

#### Development Experience Improvements

- **IMPROVED:** 🧪 **CLI test performance** - Optimized test execution from 3+ minutes to ~2.5 minutes
- **IMPROVED:** 🔧 **Database feature testing** - Comprehensive test coverage with feature flags enabled
- **VERIFIED:****Refactoring integrity** - All existing functionality preserved through extensive testing

#### Technical Benefits

- **Maintainability**: Large files broken down for easier navigation and modification
- **Code Organization**: Clear module boundaries and responsibilities
- **Developer Productivity**: Faster compilation and better IDE support
- **Future-Proofing**: Easier to add new features within organized structure

## [0.4.53] - 2025-09-20 - "CPU Compatibility & Build System Fixes"

### 🔧 Critical Bug Fixes & Build System Improvements

#### CPU Compatibility & Multi-Architecture Support

- **FIXED:** 🚨 **Critical "Illegal instruction" errors** with Python wheels on older CPUs (f53ea50)
- **NEW:** 🏗️ **Multi-target build system** - Separate baseline and optimized wheel builds
- **NEW:** 🛡️ **Universal CPU compatibility** - PyPI wheels use conservative `target-cpu=x86-64`
- **NEW:****Performance-optimized builds** - Available in GitHub Releases for modern CPUs
- **FIXED:** 🔧 **ARM64 target architecture support** - RUSTFLAGS now apply to all platforms (d41ba0c)
- **NEW:** 🔍 **Automated CPU instruction verification** - CI prevents AVX instructions in baseline builds

#### Build System & Development Environment

- **IMPROVED:** 📦 **Cargo.lock consistency** - Fixed line endings and version conflicts (6265ab2)
- **IMPROVED:** 🧹 **Dependency management** - Updated gitignore and cleaned dependencies (766789f)
- **FIXED:** 🛠️ **Clippy warnings** - Resolved dead_code warning in memory tracker (3ad2952)
- **IMPROVED:** 🏗️ **Conservative local builds** - `.cargo/config.toml` configured for compatibility

#### Technical Implementation Details

- **Release Workflow Enhancements**: Dual-build strategy with CPU profiling
- **Cross-Platform Support**: ARM64 macOS and Linux targets properly configured
- **Quality Assurance**: Automated objdump analysis prevents compatibility regressions
- **Documentation**: Clear wheel type distinction for users

#### User Experience Improvements

- **Zero installation failures** - Baseline wheels work on all x86-64 CPUs
-**Transparent performance** - Users can choose optimized wheels if desired
-**Developer-friendly** - Local builds use safe, compatible settings
-**CI/CD reliability** - All architectures properly handled in release pipeline

### 📋 Related Issues Resolved

- **Issue #51**: ✅ Error message sanitization implemented and verified
- **Issue #53**: ✅ Memory tracker stack trace collection implemented and verified

### 🔄 Migration & Compatibility

- **100% Backward Compatible** - No breaking changes to APIs or CLI interfaces
- **Automatic PyPI Compatibility** - Users get working wheels by default
- **Optional Performance** - Advanced users can use optimized wheels from GitHub Releases
- **Developer Workflow** - Local builds automatically use safe CPU targeting

### 📊 Performance & Quality

- **Zero Regressions** - All existing functionality preserved
- **Enhanced Reliability** - Reduced build failures and CPU compatibility issues
- **Better CI/CD** - Improved cross-platform build consistency
- **Quality Gates** - Automated verification prevents compatibility regressions

### 🚀 Files Changed Summary

- `.cargo/config.toml` - Conservative CPU targeting for local development
- `.github/workflows/release.yml` - Multi-target build system with verification
- `CHANGELOG.md` - Updated with v0.4.53 changes
- `notebooks/` - Added comprehensive demo notebooks for v0.4.5 features
- `src/core/memory_tracker.rs` - Fixed clippy warnings
- `Cargo.lock` - Version and line ending consistency fixes

## [0.4.4]

### 🎉 Major Features Added

#### Python Bindings ML/AI Enhancement - PR #49

- **NEW:** 🤖 **Complete ML Readiness Assessment System** - Comprehensive ML suitability scoring with feature analysis
- **NEW:** 📊 **ML Feature Analysis** - Automated feature type detection (numeric_ready, categorical_needs_encoding, temporal_needs_engineering, etc.)
- **NEW:** 🚫 **Blocking Issues Detection** - Critical ML workflow blockers (missing targets, all-null features, data leakage)
- **NEW:** 💡 **ML Preprocessing Recommendations** - Actionable suggestions with priority levels and implementation guidance
- **NEW:** 🐼 **Enhanced Pandas Integration** - DataFrame outputs for profiles and ML analysis
- **NEW:** 🔧 **Context Managers** - `PyBatchAnalyzer`, `PyMlAnalyzer`, `PyCsvProcessor` for resource management
- **NEW:** 📱 **Jupyter Notebook Support** - Rich HTML displays with interactive ML readiness reports
- **NEW:** 🔗 **Scikit-learn Integration** - Pipeline building examples and feature selection workflows
- **NEW:** 📝 **Python Logging Integration** - Native Python logging with configurable levels
- **NEW:** 🎯 **Type Safety** - Complete type hints with mypy compatibility and `py.typed` marker

#### Organized Python Documentation Structure

- **NEW:** 📚 **Restructured Documentation** - Organized `docs/python/` with focused guides:
  - `README.md` - Comprehensive overview and quick start guide
  - `API_REFERENCE.md` - Complete function and class reference
  - `ML_FEATURES.md` - ML workflow integration and recommendations guide
  - `INTEGRATIONS.md` - Ecosystem integrations (pandas, scikit-learn, Jupyter, Airflow, dbt)
- **Enhanced:** Main README.md with updated wiki navigation links

### 🎉 Major Features Added

#### CLI Enhancement & Production Readiness - PR #48

- **NEW:** 🚀 **Production-ready CLI experience** with comprehensive testing and validation
- **NEW:** 📊 **Progress indicators** using indicatif for all long-running operations
- **NEW:****Input validation** with helpful error messages and suggestions
- **NEW:** 🔧 **Enhanced help system** with practical examples and use cases
- **NEW:** 🎯 **Unix-standard exit codes** for proper shell integration
- **NEW:** 📋 **Comprehensive CLI testing** - 19 integration tests covering all functionality
- **NEW:** 🔒 **Security audit integration** with cargo-audit for vulnerability scanning

#### Database ML Readiness & Production Features - PR #47

- **NEW:** 🤖 **ML Readiness Assessment** - Automatic scoring system for database tables and columns
- **NEW:** 📊 **Intelligent Sampling Strategies** - Random, systematic, stratified, and temporal sampling for large datasets (>1M rows)
- **NEW:** 🔒 **Production Security** - SSL/TLS encryption, credential management, and environment variable support
- **NEW:** 🔄 **Connection Reliability** - Retry logic with exponential backoff and connection health monitoring
- **NEW:****CI/CD Optimization** - Streamlined workflows leveraging default database features

### ⚡ Performance & Reliability Improvements

- **Enhanced:** Connection retry logic with exponential backoff for database operations
- **Enhanced:** Memory optimization for large dataset processing
- **Enhanced:** Streaming processing with configurable batch sizes
- **Optimized:** Build times through CI/CD workflow streamlining

### 🛠️ Technical Enhancements

#### CLI Core Components

- **NEW:** `src/output/progress.rs` - Progress management system with beautiful indicators
- **NEW:** `src/core/validation.rs` - Input validation with helpful suggestions
- **NEW:** `tests/cli_basic_tests.rs` - Comprehensive CLI test suite (19 tests)
- **Enhanced:** `src/main.rs` with improved UX and error handling

#### Database Capabilities

- **NEW:** `profile_database_with_ml()` function returning quality report and ML assessment
- **Enhanced:** `DatabaseConfig` with security, sampling, and retry options
- **NEW:** Environment variable support for production deployments
- **NEW:** Feature engineering recommendations and data quality warnings

#### Sampling & Analysis

- **NEW:** Multiple sampling strategies for different use cases:
  - **Random sampling** for general analysis
  - **Systematic sampling** for evenly distributed data
  - **Stratified sampling** for maintaining category proportions
  - **Temporal sampling** for time-series data
- **NEW:** Automatic sample size optimization with confidence intervals

### 🔒 Security & Production Readiness

- **NEW:** SSL/TLS encryption with certificate validation for database connections
- **NEW:** Secure credential loading from environment variables
- **NEW:** Connection string masking in logs for security
- **NEW:** Security validation with actionable warnings
- **VALIDATED:** Zero vulnerabilities found in security audit

### 📊 Testing & Quality Assurance

- **NEW:** 81 new unit tests for database features
- **NEW:** 18 database integration tests covering all functionality
- **NEW:** 156 lines of comprehensive test coverage
- **ACHIEVEMENT:** All 19 CLI integration tests passing
- **MAINTAINED:** All existing tests continue to pass

### 🐛 Bug Fixes & Stability

- **FIXED:** Clippy warning for manual implementation of `.is_multiple_of()` in sampling strategies
- **FIXED:** HTML report generation with JSON format output
- **FIXED:** Output directory validation for current directory usage
- **FIXED:** Configuration file structure validation
- **FIXED:** Case-insensitive quality assessment matching
- **FIXED:** Test assertions aligned with actual CLI behavior

### 📚 Documentation & Developer Experience

- **NEW:** Comprehensive database connector guide with examples
- **NEW:** Security best practices and production deployment guide
- **NEW:** ML readiness assessment documentation
- **NEW:** Sampling strategy selection guide
- **Enhanced:** CLI help system with practical usage examples

### 🚀 New Python Features

```python
# ML readiness assessment
import dataprof

ml_score = dataprof.ml_readiness_score("data.csv")
print(f"ML Ready: {ml_score.is_ml_ready()} ({ml_score.overall_score:.1f}%)")

# Enhanced pandas integration
profiles_df = dataprof.analyze_csv_dataframe("data.csv")
features_df = dataprof.feature_analysis_dataframe("data.csv")

# Context managers for resource management
with dataprof.PyBatchAnalyzer() as batch:
    batch.add_file("file1.csv")
    batch.add_file("file2.csv")
    results = batch.get_results()

# Logging integration
dataprof.configure_logging(level="INFO")
profiles = dataprof.analyze_csv_with_logging("data.csv")
```

### 🚀 New CLI Features

```bash
# Enhanced CLI with progress indicators
dataprof data.csv --quality --html report.html --progress

# Comprehensive help with examples
dataprof --help

# Batch processing with progress feedback
dataprof /data/folder --progress --recursive

# ML readiness assessment
dataprof database.db --ml-assessment
```

### 🔄 Migration & Compatibility

- **GUARANTEED:** Zero breaking changes - all existing APIs remain compatible
- **MAINTAINED:** Full backward compatibility for CLI interface
- **EXTENDED:** Configuration options without deprecation

## [0.4.1] - 2025-09-15 - "Intelligent Engine Selection & Seamless Arrow Integration"

### 🎉 Major Features Added

#### Issue #36: Intelligent Engine Selection with Seamless Arrow Integration

- **NEW:** 🚀 **DataProfiler::auto()** - Intelligent automatic engine selection (RECOMMENDED)
- **NEW:** 🧠 **Multi-factor scoring algorithm** - Engine selection based on file size, columns, data types, memory pressure, and processing context
- **NEW:** 🔄 **Runtime Arrow detection** - No compile-time dependencies required, seamless feature detection
- **NEW:****Transparent fallback mechanism** - Automatic engine fallback with detailed logging and recovery
- **NEW:** 📊 **Performance benchmarking tools** - Built-in engine comparison with `--benchmark` CLI option
- **NEW:** 🔧 **Engine information display** - System status and recommendations with `--engine-info`
- **NEW:** 🎯 **AdaptiveProfiler** - Advanced profiler with intelligent selection, fallback, and performance logging

### ⚡ Performance Improvements

- **10-15% average improvement** with intelligent selection vs manual engine choice
- **47x faster incremental compilation** - Fixed Windows hard linking issues (4m48s → 0.29s)
- **Zero overhead** for existing code - new features are opt-in
- **Optimal engine selection** based on real-time system analysis and file characteristics

### 🛠️ Technical Enhancements

#### Intelligent Engine Selection Algorithm

- **Multi-factor scoring system** considering:
  - **File characteristics**: Size (MB), column count, data type distribution, complexity
  - **System resources**: Available memory, CPU cores, memory pressure
  - **Processing context**: Batch analysis, quality-focused, streaming-required
- **Engine-specific optimization thresholds**:
  - **Arrow**: >100MB files, >20 columns, numeric majority, high memory available
  - **TrueStreaming**: >500MB files, high memory pressure, streaming operations
  - **MemoryEfficient**: 50-500MB files, moderate memory pressure, balanced workloads
  - **Streaming**: <50MB files, simple operations, resource-constrained environments

#### Runtime Arrow Detection

- **Feature-agnostic design** - Works with or without Arrow compilation
- **Dynamic capability detection** at runtime instead of compile-time
- **Seamless integration** - Arrow automatically available when feature is enabled
- **Graceful degradation** - Intelligent fallback when Arrow is unavailable

#### Transparent Fallback System

- **Automatic recovery** from engine failures with detailed logging
- **Performance monitoring** of fallback attempts and success rates
- **User-friendly messaging** explaining engine selection and fallback reasoning
- **Configurable fallback chains** with per-engine success tracking

### 🚀 New APIs and CLI Features

#### Enhanced Rust API

```rust
use dataprof::{DataProfiler, AdaptiveProfiler, ProcessingType};

// 🚀 RECOMMENDED: Intelligent automatic selection
let profiler = DataProfiler::auto()
    .with_logging(true)
    .with_performance_logging(true);

let report = profiler.analyze_file("data.csv")?;

// Advanced adaptive profiler with context
let adaptive = AdaptiveProfiler::new()
    .with_fallback(true)
    .with_performance_logging(true);

let report = adaptive.analyze_file_with_context(
    "large_data.csv",
    ProcessingType::BatchAnalysis
)?;

// Benchmark all engines on your data
let performances = profiler.benchmark_engines("data.csv")?;
```

#### Enhanced CLI Interface

```bash
# 🚀 NEW: Intelligent automatic selection (RECOMMENDED)
dataprof --engine auto data.csv  # Default behavior

# Show system capabilities and engine recommendations
dataprof --engine-info

# Benchmark all engines and compare performance
dataprof --benchmark data.csv

# Manual engine override (legacy compatibility maintained)
dataprof --engine arrow data.csv         # Force Arrow
dataprof --engine streaming data.csv     # Force streaming
dataprof --engine memory-efficient data.csv  # Force memory-efficient
```

### 🔧 Infrastructure & Quality Improvements

#### Windows Development Optimization

- **FIXED:** Hard linking compilation warnings causing 4+ minute build times
- **ADDED:** Optimized `.cargo/config.toml` with shorter target directory path
- **RESULT:** Incremental compilation time reduced from 4m48s to 0.29s (159x improvement)

#### Code Quality & Testing

- **110 total tests** - All passing with comprehensive coverage
- **8 new adaptive engine tests** - Covering selection, fallback, benchmarking, API compatibility
- **Deterministic test execution** - Fixed flaky tests caused by real system resource variation
- **All clippy warnings resolved** - Clean code with no warnings
- **Cross-platform compatibility** - Windows, Linux, macOS validation

#### Comprehensive Documentation

- **Updated Apache Arrow integration guide** with v0.4.1 features and decision matrix
- **Performance comparison tables** with historical benchmark data
- **Migration guide** maintaining 100% backward compatibility
- **CLI usage examples** with new engine selection options
- **API reference** with intelligent selection best practices

### 🐛 Bug Fixes & Stability

- **FIXED:** Flaky engine selection tests caused by system resource variation during parallel execution
- **FIXED:** Clippy warnings: collapsible_if, unused imports, field_reassign_with_default
- **FIXED:** Type inference errors in adaptive engine tests (integer * float operations)
- **FIXED:** Memory pressure calculation inconsistencies in test environment
- **FIXED:** Windows CRLF line ending warnings in git operations

### 📊 Performance Validation & Benchmarking

#### Engine Selection Decision Matrix

| Factor               | Arrow Score          | Memory-Efficient | True Streaming | Standard          |
| -------------------- | -------------------- | ---------------- | -------------- | ----------------- |
| **File Size**        | >100MB: ✅            | 50-200MB: ✅      | >500MB: ✅      | <50MB: ✅          |
| **Column Count**     | >20 cols: ✅          | 10-50 cols: ✅    | Any: ✅         | <20 cols: ✅       |
| **Data Types**       | Numeric majority: ✅  | Mixed: ✅         | Complex: ✅     | Simple: ✅         |
| **Memory Available** | >1GB: ✅              | 500MB-1GB: ✅     | <500MB: ✅      | Any: ✅            |
| **Processing Type**  | Batch/Aggregation: ✅ | Quality Check: ✅ | Streaming: ✅   | Quick Analysis: ✅ |

#### Historical Performance Comparison

| File Size | Standard | Arrow | Memory-Eff | True Stream | Auto Selected |
| --------- | -------- | ----- | ---------- | ----------- | ------------- |
| 10MB      | 0.8s     | 1.2s  | 0.6s       | 0.9s        | Memory-Eff ✅  |
| 100MB     | 2.1s     | 0.8s  | 1.4s       | 1.8s        | Arrow ✅       |
| 500MB     | 12.3s    | 3.2s  | 8.1s       | 4.9s        | Arrow ✅       |
| 1GB       | 28.7s    | 5.9s  | 18.2s      | 9.1s        | Arrow ✅       |
| 5GB       | 156s     | 24s   | 89s        | 31s         | Arrow ✅       |

### 🔄 Migration & Compatibility

#### Zero Breaking Changes Guarantee

- **100% API backward compatibility** - All existing code continues to work unchanged
- **CLI compatibility** - All existing commands work identically with new features as opt-in
- **Python bindings compatibility** - No changes to existing Python interface
- **Configuration compatibility** - All existing configuration options preserved

#### Enhanced Capabilities (Additive Only)

```rust
// All v0.4.0 code continues to work unchanged
let profiler = DataProfiler::columnar();
let report = profiler.analyze_csv_file("data.csv")?;

// v0.4.1 adds new capabilities without breaking existing APIs
let adaptive = DataProfiler::auto();  // NEW: Intelligent selection
let report = adaptive.analyze_file("data.csv")?;  // Enhanced with fallback
```

### 🎯 Summary of Achievements

#### ✅ Acceptance Criteria Completed

1. **✅ Intelligent engine selection** - Multi-factor scoring algorithm implemented and validated
2. **✅ Runtime Arrow detection** - No compile-time dependencies, seamless feature detection
3. **✅ Transparent fallback mechanism** - Comprehensive logging and automatic recovery
4. **✅ Performance improvement** - 10-15% average improvement achieved through optimal selection
5. **✅ Zero breaking changes** - Full backward compatibility maintained and verified
6. **✅ Comprehensive documentation** - Decision matrix, migration guide, and usage examples

#### 🚀 Key Benefits Delivered

- **📈 Better Performance**: Automatic selection of optimal engine for specific data and system characteristics
- **🛡️ Enhanced Reliability**: Transparent fallback ensures analysis always completes successfully
- **🔍 Better Observability**: Built-in benchmarking, performance logging, and selection reasoning
- **⚡ Improved User Experience**: `--engine-info` and `--benchmark` commands for informed decisions
- **🚀 Future-Proof Design**: Runtime detection enables optional Arrow without compilation requirements

#### 📊 Technical Metrics

- **110 tests passing** with 0 failures across all platforms
- **47x compilation speed improvement** on Windows development
- **10-15% performance improvement** with intelligent vs manual selection
- **Zero overhead** for existing users - new features are completely opt-in
- **Deterministic engine selection** with comprehensive test coverage

The intelligent engine selection system provides seamless, automatic performance optimization while maintaining full API compatibility and delivering measurable performance improvements through data-driven engine selection.

---

### 🚀 Performance Claims Validation & Benchmarking Improvements

#### Issue #35: Fix DataProfiler CLI crash in benchmark comparison script

- **FIXED:** Resolved DataProfiler CLI crashes during benchmark comparison execution
- **IMPROVED:** Stable and reliable benchmark script execution
- **ENHANCED:** Consistent JSON output generation for CI/CD workflows

#### Issue #38: Validate and refine performance claims with comprehensive benchmarking

- **NEW:** Comprehensive benchmark matrix testing (3 file sizes × 4 data types × 3 tools = 36 combinations)
- **NEW:** Performance regression detection with automated CI validation and configurable thresholds
- **NEW:** `scripts/comprehensive_benchmark_matrix.py` - Full matrix testing suite for systematic validation
- **NEW:** `scripts/performance_regression_check.py` - Automated regression detection with baseline tracking
- **NEW:** `docs/performance-guide.md` - Complete performance guide with decision matrices and optimization tips
- **NEW:** Performance wiki documentation with user guidance for tool selection
- **REFINED:** Corrected performance claims from "13x faster than pandas" to **"20x more memory efficient than pandas"**
- **IMPROVED:** GitHub Pages dashboard with comprehensive benchmark results and trend analysis
- **ENHANCED:** Organized benchmark results in dedicated `benchmark-results/` directory structure
- **FIXED:** Linux compilation errors in memory benchmarks with proper fallback handling

### 🔧 Technical Improvements

#### Performance & Reliability

- **IMPROVED:** Enhanced GitHub Actions benchmark workflows with comprehensive CI validation
- **IMPROVED:** Updated `.gitignore` to properly handle generated benchmark results
- **IMPROVED:** Benchmark scripts now use organized file structure for cleaner repository management

#### Documentation

- **UPDATED:** README.md with accurate performance claims based on systematic benchmarking
- **ADDED:** Performance decision matrices to help users choose the right tool for their use case
- **ENHANCED:** Wiki documentation with comprehensive performance guidance

### 📊 Performance Analysis Results

Based on systematic benchmarking across different data sizes and types:

- **Speed**: ~1.0x (comparable to pandas) - competitive performance
- **Memory**: **20x more memory efficient** than pandas - significant advantage
- **Best use cases**: Large files, memory-constrained environments, production pipelines
- **Scalability**: Unlimited file size through streaming (pandas limited by RAM)

## [0.4.0] - 2025-09-14 - "Quality Assurance & Performance Validation Edition"

### 🎉 Major Features Added

#### Performance Benchmarking CI/CD - Issue #23

- **NEW:** Comprehensive performance benchmarking suite for performance validation
- **NEW:** Large-scale benchmark testing (1MB-1GB datasets) with realistic mixed data types
- **NEW:** Memory profiling and leak detection with advanced usage pattern analysis
- **NEW:** External tool comparison automation (pandas, polars) with regression detection
- **NEW:** Dedicated CI workflow for performance validation on PR and push events
- **NEW:** Historical performance tracking with trend analysis and GitHub Pages integration
- **NEW:** Automated performance regression alerts with configurable thresholds

#### Comprehensive Test Coverage Infrastructure - Issue #21

- **NEW:** Complete test infrastructure overhaul with 95%+ code coverage targets
- **NEW:** Multi-tier testing strategy: unit, integration, end-to-end, and property-based tests
- **NEW:** Cross-platform CI validation (Linux, macOS, Windows) with full feature matrix
- **NEW:** Performance regression testing and memory leak detection in CI
- **NEW:** Test data generation utilities and fixtures for consistent validation
- **NEW:** Coverage reporting with HTML output and GitHub Actions integration

#### Modular Architecture Refactoring - Issue #20

- **NEW:** Complete lib.rs modularization with clean separation of concerns
- **NEW:** Public API redesign with consistent naming conventions and error handling
- **NEW:** Engine abstraction layer supporting local, streaming, and columnar processing
- **NEW:** Feature-gated modules for optional functionality (databases, arrow, python)
- **NEW:** Enhanced documentation with rustdoc examples and API usage guides
- **NEW:** Backward compatibility maintained while improving internal architecture

### ⚡ Performance Improvements

- **Validated 10x performance claims** through automated CI benchmarking
- **Memory leak detection** preventing performance degradation over time
- **Benchmark-driven optimization** with continuous performance monitoring
- **Engine selection optimization** based on dataset characteristics and system resources

### 🛠️ Quality & Infrastructure Improvements

- **IMPROVED:** CI/CD reliability with comprehensive test matrix and error handling
- **IMPROVED:** Code organization with modular architecture and clean interfaces
- **IMPROVED:** Release workflow with automated validation and performance checks
- **IMPROVED:** Developer experience with better tooling and documentation
- **IMPROVED:** Security posture with comprehensive testing and validation

### 🔧 Technical Enhancements

#### Benchmark Suite Architecture

- **NEW:** Three-tier benchmark system: simple (validation), large-scale (performance), memory (profiling)
- **NEW:** Criterion-based statistical benchmarking with HTML report generation
- **NEW:** Cross-platform memory detection with Windows/Linux/macOS compatibility
- **NEW:** Dataset generation utilities for consistent and reproducible testing
- **NEW:** JSON output format for programmatic analysis and trend tracking

#### Test Infrastructure Components

- **NEW:** Automated test discovery and execution across all feature combinations
- **NEW:** Property-based testing for edge case validation and robustness
- **NEW:** Integration test scenarios covering real-world usage patterns
- **NEW:** Performance baseline establishment and regression detection
- **NEW:** Test data management with controlled fixtures and generators

#### Modular Library Design

- **NEW:** Clean API surface with consistent error types and handling patterns
- **NEW:** Engine abstraction supporting multiple processing strategies
- **NEW:** Feature composition allowing selective functionality inclusion
- **NEW:** Documentation-driven development with comprehensive examples
- **NEW:** Type-safe configuration with builder patterns and validation

### 🐛 Bug Fixes & Stability

- **FIXED:** CI workflow stability issues with proper error handling and retries
- **FIXED:** Cross-platform compatibility problems in test execution
- **FIXED:** Memory profiling accuracy on Windows systems
- **FIXED:** Benchmark statistical significance with proper sample sizing (≥10)
- **FIXED:** GitHub Actions runner compatibility using standard ubuntu-latest

### 📚 Documentation & Developer Experience

- **ENHANCED:** Complete API documentation with usage examples and best practices
- **ENHANCED:** Architecture documentation explaining design decisions and trade-offs
- **ENHANCED:** Contributing guidelines with development workflow and testing requirements
- **ENHANCED:** Performance benchmarking documentation with comparison methodologies
- **ENHANCED:** CI/CD documentation explaining workflow triggers and job dependencies

### 🚀 New Development Workflows

#### Performance Validation Pipeline

```bash
# Automated on every PR to staging/main
cargo bench --bench simple_benchmarks     # Lightweight validation
cargo bench --bench memory_benchmarks     # Memory leak detection
python scripts/benchmark_comparison.py    # External tool comparison
```

#### Test Coverage Validation

```bash
# Comprehensive test execution
cargo test --all-features                 # All feature combinations
cargo test --test integration_*           # Integration scenarios
cargo tarpaulin --out Html                # Coverage reporting
```

#### Modular Development Pattern

```rust
// Clean public API with engine abstraction
use dataprof::{DataProfiler, ProfilerEngine};

let profiler = DataProfiler::builder()
    .engine(ProfilerEngine::Streaming)
    .sample_size(10000)
    .build()?;

let report = profiler.analyze_file("data.csv")?;
```

### 📈 Performance Validation Results

**Benchmark Claims Validation:**

- **10x faster than pandas** verified through automated CI testing
-**Memory efficiency** validated with leak detection and usage profiling
-**Regression protection** with continuous monitoring and CI failure thresholds
-**Cross-platform consistency** with identical performance characteristics

**Test Coverage Metrics:**

- **95%+ code coverage** across all modules and feature combinations
-**100% API coverage** with documentation examples and usage validation
-**Cross-platform testing** ensuring consistent behavior across environments
-**Performance regression detection** with statistical significance validation

### 🔄 Migration & Compatibility

**No Breaking Changes:**

- All existing APIs remain fully compatible
- CLI interface unchanged with new functionality opt-in
- Python bindings maintain backward compatibility
- Configuration options extended without deprecation

**New Capabilities:**

- Enhanced performance monitoring and validation
- Comprehensive test infrastructure for contributors
- Modular architecture supporting future extensions
- Automated quality assurance in development workflow

---

## [0.3.6] - 2025-09-11 - "Apache Arrow Integration Edition"

### 🎉 Major Features Added

#### Apache Arrow Columnar Processing

- **NEW:** Apache Arrow integration for columnar data processing with 13x performance boost
- **NEW:** `ArrowProfiler` engine with SIMD acceleration for numeric operations
- **NEW:** Automatic engine selection (Arrow for files >500MB, streaming for smaller)
- **NEW:** Zero-copy operations and memory-efficient batch processing
- **NEW:** Support for all Arrow native types (Float64/32, Int64/32, Utf8, Date, etc.)
- **NEW:** Configurable batch sizes and memory limits for optimal performance

#### Enhanced Public API

- **NEW:** `DataProfiler::columnar()` method for explicit Arrow profiler access
- **NEW:** Transparent engine selection in Python bindings (`engine="arrow"` parameter)
- **NEW:** Memory monitoring with configurable limits and batch size optimization
- **NEW:** Progress tracking for large batch operations

#### Community & Development

- **NEW:** Code of Conduct added for community guidelines and inclusive development
- **NEW:** Streamlined release workflow with human-readable automation
- **NEW:** Enhanced CI/CD with proper matrix strategy for cross-platform builds

### ⚡ Performance Improvements

- **13x faster** processing for large datasets using Apache Arrow columnar format
- **Memory efficient** batch processing with configurable memory limits (default: 512MB)
- **SIMD acceleration** for numeric statistical calculations
- **Automatic optimization** based on file size and system capabilities

### 🛠️ Improvements

- **IMPROVED:** Maturin build process with proper Python interpreter detection
- **IMPROVED:** Database connector stability with SQLite `:memory:` support
- **IMPROVED:** Error handling in streaming profiler operations
- **IMPROVED:** Security audit resolution with dependency updates

### 🐛 Bug Fixes

- **FIXED:** Streaming profiler test failures in multi-threaded scenarios
- **FIXED:** SQLite in-memory database connection handling
- **FIXED:** Compilation errors in database feature combinations
- **FIXED:** GitHub Actions workflow matrix strategy syntax
- **FIXED:** Duplicate `thiserror` dependency entries in Cargo.lock

### 📚 Technical Details

- Arrow profiler processes data in configurable batches (default: 8,192 rows)
- Automatic type inference with Arrow schema detection
- Memory usage optimization with batch size scaling based on available RAM
- Feature-gated compilation to avoid unnecessary dependencies
- Full backward compatibility with existing APIs

### 🚀 New Usage Patterns

#### Rust API

```rust
use dataprof::DataProfiler;

// Explicit Arrow profiler
let profiler = DataProfiler::columnar()
    .batch_size(16384)
    .memory_limit_mb(1024);

let report = profiler.analyze_csv_file("large_data.csv")?;
```

#### Python API

```python
import dataprof

# Automatic Arrow selection for large files
profiles = dataprof.analyze_csv_file("huge_dataset.csv")

# Explicit Arrow engine
report = dataprof.analyze_csv_with_quality("data.csv", engine="arrow")
```

#### CLI Usage

```bash
# Arrow engine automatically selected for large files
dataprof large_dataset.csv

# Force Arrow profiler
dataprof --engine arrow data.csv
```

---

## [0.3.5] - 2025-09-08 - "Database Connectors & Memory Safety Edition"

### 🎉 Major Features Added

#### Database Connectors System

- **NEW:** Direct database profiling support for PostgreSQL, MySQL, SQLite, and DuckDB
- **NEW:** Async database connection handling with tokio runtime
- **NEW:** Feature-gated database dependencies to avoid conflicts
- **NEW:** Native SQL query execution for data analysis without exports
- **NEW:** Production-ready database feature combinations

#### Memory Safety & Performance

- **NEW:** Comprehensive memory leak detection system with `MemoryTracker`
- **NEW:** RAII patterns for automatic resource cleanup
- **NEW:** Memory-mapped file tracking for large CSV processing
- **NEW:** Reduced memory allocations by 28% (68→49 clone() calls optimized)
- **NEW:** Public memory monitoring APIs: `check_memory_leaks()`, `get_memory_usage_stats()`

#### Enhanced Testing & Quality

- **NEW:** 5 comprehensive memory leak test scenarios
- **NEW:** Real-world testing with files up to 260KB and 5000+ records
- **NEW:** Error condition memory safety validation
- **NEW:** Miri integration for undefined behavior detection

### 🛠️ Improvements

- **IMPROVED:** CI/CD workflows streamlined (319→100 lines, 75% faster)
- **IMPROVED:** Conventional release automation added
- **IMPROVED:** Safe error handling (eliminated unsafe unwrap() calls)
- **IMPROVED:** Build optimization with dependency cleanup

### 🐛 Bug Fixes

- **FIXED:** Unsafe `unwrap()` calls replaced with proper error handling
- **FIXED:** Memory tracker timestamp fallback for system time errors
- **FIXED:** Database feature conflict resolution

### 📚 Technical Details

- Memory leak detection uses size-based (configurable MB threshold) + age-based (60s) criteria
- Database connectors support all major production databases with feature flags
- RAII `TrackedResource` wrapper ensures automatic cleanup
- All tests pass including memory safety validation

## [0.3.0] - 2025-01-06 - "Streaming Edition"

### 🎉 Major Features Added

#### Python Bindings & Library Integration

- **NEW:** Complete Python bindings using PyO3 for `pip install dataprof`
- **NEW:** Full API coverage with Python classes: `ColumnProfile`, `QualityReport`, `BatchResult`
- **NEW:** Comprehensive Python documentation in `PYTHON.md` with integration examples
- **NEW:** Multi-platform wheel distribution via GitHub Actions (Linux, Windows, macOS)
- **NEW:** PyPI publishing pipeline with automated releases

#### High-Performance Batch Processing

- **NEW:** Batch processing system for directories and glob patterns
- **NEW:** Parallel file processing with configurable concurrency
- **NEW:** Multi-threaded execution using Rayon for maximum performance
- **NEW:** Progress tracking and comprehensive batch statistics
- **NEW:** Smart file filtering with extension and exclusion pattern support

#### Advanced Streaming Architecture

- **NEW:** Memory-efficient streaming engine for large datasets (GB+ files)
- **NEW:** Three streaming strategies: MemoryMapped, TrueStreaming, SimpleColumnar
- **NEW:** Adaptive chunk size optimization based on available memory
- **NEW:** Progress bars for long-running operations
- **NEW:** SIMD acceleration for numeric operations

#### Intelligent Sampling System

- **NEW:** Reservoir sampling algorithm for consistent results
- **NEW:** Multiple sampling strategies: Random, Systematic, Progressive, Adaptive
- **NEW:** Deterministic sampling with configurable seeds
- **NEW:** Weighted sampling support for biased datasets
- **NEW:** Dynamic sampling ratio based on dataset characteristics

### ⚡ Performance Improvements

- **10-100x faster** than pandas for basic profiling operations
- **Zero-copy parsing** where possible to minimize memory allocation
- **SIMD vectorization** for statistical calculations on numeric data
- **Memory-mapped I/O** for efficient large file processing
- **Parallel batch processing** utilizing all available CPU cores
- **Streaming architecture** handles datasets larger than available RAM

### 🔧 Technical Enhancements

#### Robust Data Processing

- **Enhanced:** CSV parsing with flexible delimiter detection
- **Enhanced:** Support for malformed CSV files with varying field counts
- **NEW:** JSON and JSONL file analysis capabilities
- **NEW:** Advanced data type inference with pattern recognition
- **NEW:** Email, phone number, and URL pattern detection

#### Quality Assessment System

- **Enhanced:** Comprehensive quality scoring (0-100) with severity weighting
- **NEW:** Mixed data type detection in columns
- **NEW:** Mixed date format detection with format cataloging
- **NEW:** Statistical outlier detection with configurable thresholds
- **NEW:** Quality issue categorization by severity (High/Medium/Low)

#### Error Handling & Diagnostics

- **NEW:** Comprehensive error categorization and recovery strategies
- **NEW:** Detailed error suggestions for common issues
- **NEW:** CSV diagnostics with automatic delimiter and encoding detection
- **NEW:** Graceful handling of corrupted or incomplete files

### 📚 Documentation & Integration

#### Library-First Architecture

- **BREAKING:** Restructured as library-first with CLI as secondary interface
- **NEW:** Complete Rust API documentation with usage examples
- **NEW:** Integration guides for Airflow, dbt, and Jupyter notebooks
- **NEW:** Professional project documentation structure

#### Developer Experience

- **NEW:** Comprehensive test suite with 64+ tests across all modules
- **NEW:** Integration tests for real-world scenarios
- **NEW:** Performance benchmarks and regression testing
- **NEW:** GitHub Actions CI/CD for Rust and Python builds
- **NEW:** Pre-commit hooks for code quality assurance

### 🐛 Bug Fixes

- **Fixed:** Memory leaks in large file processing
- **Fixed:** Inconsistent sampling results across runs
- **Fixed:** CSV parsing edge cases with quoted fields
- **Fixed:** Progress bar accuracy for streaming operations
- **Fixed:** Cross-platform compatibility issues on Windows

### 📦 Dependencies & Build System

- **Updated:** Rust minimum version to 1.70+
- **Added:** PyO3 0.22 for Python bindings
- **Added:** Rayon for parallel processing
- **Added:** Maturin for Python package building
- **Added:** SIMD intrinsics for performance optimization
- **Removed:** Dependency on Polars (replaced with custom streaming engine)

### 🚀 API Changes

#### New Python API

```python
import dataprof

# Single file analysis
profiles = dataprof.analyze_csv_file("data.csv")
report = dataprof.analyze_csv_with_quality("data.csv")

# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)
result = dataprof.batch_analyze_glob("/data/**/*.csv")
```

#### Enhanced Rust API

```rust
use dataprof::*;

// Streaming analysis
let report = analyze_csv_robust("large_file.csv")?;

// Batch processing
let processor = BatchProcessor::new();
let result = processor.process_directory(path)?;
```

#### New CLI Features

```bash
# Streaming mode for large files
dataprof --streaming --progress large_dataset.csv

# Batch processing
dataprof --recursive /data/warehouse/
dataprof --glob "**/*.csv" --parallel

# Advanced sampling
dataprof --sample 10000 huge_dataset.csv
```

### 📈 Performance Benchmarks

| Dataset Size | DataProfiler v0.3 | pandas.info() | Speedup |
| ------------ | ----------------- | ------------- | ------- |
| 1MB CSV      | 12ms              | 150ms         | 12.5x   |
| 10MB CSV     | 85ms              | 800ms         | 9.4x    |
| 100MB CSV    | 650ms             | 6.2s          | 9.5x    |
| 1GB CSV      | 4.2s              | 45s           | 10.7x   |

### 🔄 Migration Guide from v0.1

#### CLI Changes

- **No breaking changes** - all existing CLI commands work identically
- **New features** are opt-in with new flags (`--streaming`, `--parallel`, `--glob`)

#### Library Usage

- **New:** Import `dataprof` crate instead of building from source
- **Enhanced:** More comprehensive API with streaming and batch capabilities
- **Compatible:** Existing `analyze_csv()` function unchanged

### 🎯 Integration Examples

#### Airflow DAG Quality Gate

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
import dataprof

def quality_gate(**context):
    report = dataprof.analyze_csv_with_quality(context['params']['file'])
    if report.quality_score() < 80:
        raise ValueError(f"Quality too low: {report.quality_score()}")
```

#### Jupyter Data Exploration

```python
import dataprof
import matplotlib.pyplot as plt

report = dataprof.analyze_csv_with_quality("dataset.csv")
print(f"Quality Score: {report.quality_score():.1f}%")

# Visualize null percentages
null_data = [(p.name, p.null_percentage) for p in report.column_profiles]
columns, percentages = zip(*null_data)
plt.bar(columns, percentages)
plt.title('Data Completeness by Column')
```

### 📋 File Changes Summary

- **79 files changed** in this release
- **25+ new modules** added for streaming and batch processing
- **6 new workflows** for CI/CD automation
- **3 comprehensive documentation** files added
- **Complete test suite** with integration and performance tests

---

## [0.1.0] - 2024-12-15 - "Initial Release"

### 🎉 Features

- Basic CSV data profiling and quality analysis
- CLI tool with colored terminal output
- HTML report generation
- Smart sampling for large datasets
- Pattern detection for emails and phone numbers
- Quality issue detection (nulls, duplicates, outliers)

### 📦 Core Components

- Rust-based CLI tool using Clap
- Polars integration for data processing
- Terminal styling with colored output
- Basic error handling and reporting

---

**Legend:**

- 🎉 **Major Features** - New functionality
-**Performance** - Speed improvements
- 🔧 **Technical** - Architecture changes
- 📚 **Documentation** - Docs and guides
- 🐛 **Bug Fixes** - Issues resolved
- 📦 **Dependencies** - Library updates
- 🚀 **API Changes** - Interface modifications
- 📈 **Benchmarks** - Performance data
- 🔄 **Migration** - Upgrade guidance
- 🎯 **Examples** - Usage demonstrations