datu 0.3.4

datu - a data file utility
Documentation
# datu Version Notes

## v0.3.4

### Highlights

- **REPL and pipeline**: Aggregate functions (`avg`, `min`, `max`) with validation and clearer selection and error handling.
- **XLSX**: Refactored writing to use newer Arrow array APIs.
- **Quality**: Parquet validation helper; REPL and conversion tests refactored for clarity.
- **Docs**: Pipeline diagram and documentation layout updates.

### Improvements

- **Pipeline**
  - Unified dispatch for construction; continued modularization after the v0.3.3 pipeline work.

- **REPL**
  - Aggregate support and incremental test cleanup.

- **Tooling**
  - Release workflow docs updated so `make-release` can run on `release/*` branches.

### Changelog Stats

- 18 commits
- 53 files changed
- 5528 insertions
- 5252 deletions

## v0.3.3

### Highlights

- **Pipeline**: Large refactor toward a unified CLI and REPL pipeline—shared `ReadArgs`, consolidated readers, and `heads_or_tails` handling.
- **Writes**: Replaced per-format write markers with a single `WriteResult` type for clearer sink outcomes.
- **Planning errors**: Added `PipelinePlanningError::ConflictingOptions` when pipeline options are mutually exclusive.

### Improvements

- **CLI and REPL**
  - Shared read path and arguments between interactive and command-line use.

- **Pipeline internals**
  - Unified write completion signaling via `WriteResult`.
  - Stricter validation for conflicting pipeline options.

### Changelog Stats

- 4 commits
- 46 files changed
- 4714 insertions
- 2643 deletions

## v0.3.2

### Highlights

- **Convert command** now uses the DataFrame API for streamlined file processing; internal `resolve_input_file_type` renamed to `resolve_file_type`.
- **Avro**: Added compatibility for Int16 fields in record batches.
- **Error handling**: Replaced `anyhow` with `eyre`.
- **REPL**: Refactored evaluation to use `exec_*` naming and improved test structure; feature files updated with REPL equivalents.
- **Docs**: README updates for JSON support and REPL usage.
- **CI**: Switched to `actions-rust-lang/setup-rust-toolchain@v1`, Rust 1.94.0, no nightly.

### Improvements

- **Convert**
  - Implemented DataFrame API support for file processing in the convert command.

- **Avro pipeline**
  - Int16 fields in record batches are now handled correctly when writing Avro.

- **REPL**
  - Evaluation methods refactored to `exec_*` naming convention.
  - Test structure and REPL feature scenarios updated.

- **Project**
  - Added `.clocignore`.
  - Dependency and version updates in Cargo files.
  - CI workflow and toolchain adjustments.

### Changelog Stats

- 16 commits
- 28 files changed
- 754 insertions
- 610 deletions

## v0.3.1

### Highlights

- Added a `sample` CLI command and REPL function for randomly sampling rows from data files.
- Replaced the `spinoff` spinner with an `indicatif` progress bar in the `convert` command, showing row-level progress for Parquet and ORC inputs.
- Improved REPL with incremental execution and persistent history across sessions.
- Internal refactoring of record-batch writing and tail slicing into shared pipeline helpers.

### New Features

- **`sample` command (CLI and REPL)**
  - `datu sample <FILE> [-n N]` — print N random rows from a Parquet, Avro, CSV, or ORC file (default: 10).
  - Uses index-based sampling for Parquet and ORC (metadata-aware); reservoir sampling for Avro and CSV.
  - Supports `--select`, `--output`, `--sparse`, and `--has-headers` options (same as `head`/`tail`).
  - REPL: `read("file") |> sample(5)` — samples 5 random rows from the pipeline.

- **Progress bar for `convert`**
  - Parquet and ORC inputs show a row-level progress bar (using file metadata for total row count).
  - Avro and CSV inputs show an animated spinner (total row count not available upfront).

### Improvements

- **REPL enhancements**
  - Incremental pipeline execution: stages are evaluated as they are parsed.
  - Command history is now persisted across sessions.

- **Internal refactoring**
  - Extracted record-batch writing into a shared sink harness (`batch_write` module).
  - Extracted tail batch slicing into a shared pipeline helper.

### Changelog Stats

- 7 commits
- 20 files changed
- 1317 insertions
- 157 deletions

## v0.3.0

### Highlights

- Added an interactive REPL mode: running `datu` without a subcommand now starts an interactive pipeline shell.
- Added CSV input support across key commands: `convert`, `count`, `schema`, `head`, and `tail`.
- Refactored the pipeline engine toward an async/DataFusion DataFrame-based architecture for cleaner composition and execution.
- Expanded test coverage with new REPL feature tests and reorganized CLI feature tests.

### New Features

- **Interactive REPL**
  - New REPL entrypoint and parser/evaluator flow.
  - Supports pipeline-style expressions (for example: `read("input.parquet") |> select(:id, :email) |> write("output.csv")`).
  - Added REPL integration tests and feature scenarios.

- **CSV as a first-class input format**
  - CSV can now be read by:
    - `datu convert`
    - `datu count`
    - `datu schema`
    - `datu head`
    - `datu tail`
  - Added `--has-headers[=bool]` to control CSV header handling.

### Improvements

- **Pipeline refactor**
  - Introduced and reworked modules for datasource and I/O concerns (`dataframe`, `datasource`, `read`, `write`, `select`).
  - Async pipeline execution improvements and clearer separation of responsibilities.

- **Documentation**
  - README expanded with command usage updates, CSV examples/options, and REPL usage examples.

- **Project metadata**
  - Added MIT `LICENSE`.
  - Updated CI/config and dependencies for new runtime/testing capabilities.

### Breaking / Behavioral Changes

- Running `datu` with no subcommand now launches the REPL.
- Commands that now accept CSV input may behave differently based on `--has-headers` (defaults to `true` when omitted).

### Testing

- CLI feature scenarios reorganized under `features/cli/*`.
- New REPL feature files and integration tests (`tests/repl.rs`).

### Changelog Stats

- 10 commits
- 46 files changed
- 5374 insertions
- 698 deletions