polars-readstat-rs 0.2.2

Pure Rust readers for SAS, Stata, and SPSS with Polars and Arrow outputs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
# Testing Guide for SAS7BDAT Reader

This guide explains how to run tests and benchmarks for the SAS7BDAT reader library.

## Test Suite Overview

The test suite consists of three main components:

1. **Integration Tests** - Verify the reader works with all test files
2. **Regression Tests** - Ensure specific data values remain correct
3. **Benchmarks** - Track performance over time to catch regressions

## Quick Start

```bash
# Run all tests
cargo test

# Run all benchmarks
cargo bench

# Run specific test suite
cargo test --test integration_tests
cargo test --test regression_tests

# Run specific benchmark
cargo bench --bench read_benchmarks
cargo bench --bench schema_benchmarks
```

## Test Organization

```
tests/
├── common/
│   └── mod.rs              # Shared test utilities
├── integration_tests.rs     # Tests for all 178 test files
└── regression_tests.rs      # Tests for specific data values

benches/
├── read_benchmarks.rs       # Performance benchmarks for reading
└── schema_benchmarks.rs     # Performance benchmarks for schema inference
```

## Running Tests

### All Tests

```bash
# Run all tests (unit + integration + regression)
cargo test

# Run with output visible (doesn't capture println!)
cargo test -- --nocapture

# Run tests in parallel with specific thread count
cargo test -- --test-threads=4
```

### Integration Tests

Tests that verify the reader works with all test files:

```bash
# Run all integration tests
cargo test --test integration_tests

# Run specific integration test
cargo test --test integration_tests test_all_files_can_be_opened
cargo test --test integration_tests test_compressed_files
cargo test --test integration_tests test_batch_reading_matches_full_read

# See which tests are available
cargo test --test integration_tests -- --list
```

**Integration tests include:**
- `test_all_files_can_be_opened` - Verify all 178 files can be opened
- `test_all_files_can_read_metadata` - Check metadata access
- `test_all_files_can_read_data` - Read all files completely
- `test_compressed_files` - Specifically test compressed files
- `test_batch_reading_matches_full_read` - Verify batch consistency
- `test_numeric_files` - Test numeric-only files
- `test_parallel_reading` - Test multi-threaded reading

### Regression Tests

Tests that verify specific data values to catch regressions:

```bash
# Run all regression tests
cargo test --test regression_tests

# Run specific regression test
cargo test --test regression_tests test_test1_data_values
cargo test --test regression_tests test_known_file_dimensions

# See available regression tests
cargo test --test regression_tests -- --list
```

**Regression tests include:**
- `test_test1_data_values` - Verify specific values from test1.sas7bdat
- `test_known_file_dimensions` - Check row/column counts
- `test_numeric_data_integrity` - Verify numeric data is valid
- `test_compressed_data_validity` - Ensure decompression works
- `test_batch_consistency` - Batch reading produces same results
- `test_schema_inference_consistency` - Schema inference doesn't break data
- `test_edge_cases` - Handle edge cases without panics
- `test_metadata_accuracy` - Metadata matches actual data

### Unit Tests

Unit tests are embedded in source files:

```bash
# Run unit tests only
cargo test --lib

# Run tests for specific module
cargo test reader::
cargo test arrow_output::
```

## Running Benchmarks

### Prerequisites

Benchmarks use [Criterion.rs](https://github.com/bheisler/criterion.rs) for statistical analysis.

### All Benchmarks

```bash
# Run all benchmarks
cargo bench

# This runs both read_benchmarks and schema_benchmarks
```

### Read Performance Benchmarks

```bash
# Run only read benchmarks
cargo bench --bench read_benchmarks

# Run specific benchmark group
cargo bench --bench read_benchmarks batch_reading
cargo bench --bench read_benchmarks parallel_reading
```

**Available benchmarks:**
- `open_test1` - File opening overhead
- `read_small_file` - Reading small files (test1)
- `read_large_file` - Reading large files (1M rows)
- `read_compressed` - Decompression performance
- `batch_reading` - Different batch sizes
- `parallel_reading` - Different thread counts
- `metadata_access` - Metadata retrieval speed
- `multiple_files` - Opening/reading multiple files
- `file_types` - Compare numeric vs compressed files

### Schema Inference Benchmarks

```bash
# Run only schema benchmarks
cargo bench --bench schema_benchmarks

# Run specific schema benchmark
cargo bench --bench schema_benchmarks default_vs_inferred
```

**Available benchmarks:**
- `schema_inference` - Time to infer schema
- `default_vs_inferred` - Compare default vs inferred read
- `schema_inference_only` - Just schema inference time
- `schema_cast_overhead` - Type casting overhead
- `arrow_conversion` - Arrow conversion performance
- `streaming_with_schema` - Streaming with schema
- `batch_size_schema` - Batch size impact with schema
- `memory_schema` - Large file memory impact

### Benchmark Comparison

Save a baseline and compare after changes:

```bash
# Save current performance as baseline
cargo bench -- --save-baseline before

# Make your changes...

# Compare against baseline
cargo bench -- --baseline before

# View detailed comparison
open target/criterion/report/index.html
```

### Benchmark Output

Criterion generates detailed reports in `target/criterion/`:
- HTML reports with graphs
- Statistical analysis
- Comparison with previous runs

View the reports:
```bash
# Open HTML report in browser
open target/criterion/report/index.html

# Or on Linux
xdg-open target/criterion/report/index.html
```

## Advanced Testing

### Faster Test Running with cargo-nextest

Install cargo-nextest for faster test execution:

```bash
cargo install cargo-nextest

# Run tests with nextest (much faster)
cargo nextest run

# Run specific test suite
cargo nextest run --test integration_tests
```

### Auto-run Tests on File Changes

Install cargo-watch:

```bash
cargo install cargo-watch

# Auto-run tests when files change
cargo watch -x test

# Auto-run specific test
cargo watch -x "test --test integration_tests"

# Auto-run benchmarks
cargo watch -x bench
```

### Code Coverage

Install cargo-tarpaulin:

```bash
cargo install cargo-tarpaulin

# Generate coverage report
cargo tarpaulin --out Html

# Open coverage report
open tarpaulin-report.html
```

## Continuous Integration

### GitHub Actions Example

Create `.github/workflows/test.yml`:

```yaml
name: Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions-rs/toolchain@v1
        with:
          toolchain: stable
      - name: Run tests
        run: cargo test --all
      - name: Run benchmarks
        run: cargo bench --no-run  # Compile but don't run

  benchmark:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v2
      - name: Benchmark
        run: cargo bench -- --save-baseline pr-${{ github.event.number }}
```

## Test Best Practices

### Before Committing

```bash
# Full test suite
cargo test

# Check formatting
cargo fmt --check

# Check for warnings
cargo clippy

# Run benchmarks to check for regressions
cargo bench
```

### Adding New Tests

1. **Integration test** - Add to `tests/integration_tests.rs` if testing overall functionality
2. **Regression test** - Add to `tests/regression_tests.rs` if verifying specific values
3. **Unit test** - Add to source file's `#[cfg(test)] mod tests` if testing internal logic
4. **Benchmark** - Add to appropriate bench file if measuring performance

### Writing Good Tests

```rust
#[test]
fn test_descriptive_name() {
    // Arrange - set up test data
    let path = test_data_path("test1.sas7bdat");

    // Act - perform the action
    let reader = Sas7bdatReader::open(&path).unwrap();
    let df = reader.read_all().unwrap();

    // Assert - verify the result
    assert_eq!(df.height(), 10);
    assert_eq!(df.width(), 4);
}
```

## Performance Tracking

### Establishing Baselines

When setting up performance tracking:

```bash
# 1. Run benchmarks on main branch
git checkout main
cargo bench -- --save-baseline main

# 2. Switch to feature branch
git checkout feature-branch

# 3. Make changes and benchmark
cargo bench -- --baseline main

# 4. Review differences
# Criterion will show % change from baseline
```

### Interpreting Results

- **Green** (improvement): Your change made it faster
- **Red** (regression): Your change made it slower
- **Yellow** (noise): No significant change

Criterion uses statistical analysis to filter out noise.

## Troubleshooting

### Tests Failing

```bash
# Run with verbose output
cargo test -- --nocapture --test-threads=1

# Run specific failing test
cargo test test_name -- --exact --nocapture
```

### Benchmarks Not Running

```bash
# Ensure criterion is installed
cargo bench --version

# Compile benchmarks without running
cargo bench --no-run

# Check for compilation errors
cargo check --benches
```

### Missing Test Files

If tests fail because test files are missing:

```bash
# Check if test data exists
ls tests/sas/data/

# The test suite expects SAS7BDAT files in tests/sas/data/
# Make sure your test files are in the correct location
```

## Summary

| Command | Purpose |
|---------|---------|
| `cargo test` | Run all tests |
| `cargo test --test integration_tests` | Run integration tests |
| `cargo test --test regression_tests` | Run regression tests |
| `cargo bench` | Run all benchmarks |
| `cargo bench --bench read_benchmarks` | Run read benchmarks |
| `cargo bench --bench schema_benchmarks` | Run schema benchmarks |
| `cargo bench -- --save-baseline name` | Save performance baseline |
| `cargo bench -- --baseline name` | Compare against baseline |
| `cargo nextest run` | Fast test runner |
| `cargo watch -x test` | Auto-run tests on changes |
| `cargo tarpaulin --out Html` | Generate coverage report |

## Next Steps

1. Run `cargo test` to verify all tests pass
2. Run `cargo bench` to establish performance baselines
3. Set up CI/CD with GitHub Actions
4. Add new tests as you develop features
5. Monitor benchmarks to catch performance regressions