csv-managed-1.0.0 has been yanked.

csv-managed

csv-managed is a Rust command-line utility for high‑performance exploration and transformation of CSV data at scale. It emphasizes streaming, typed operations, and reproducible workflows via schema (.schema) and index (.idx) files.

Implemented Features

Area	Description
Delimiters & encodings	Read/write comma, tab, pipe, semicolon, or any ASCII delimiter; override output delimiter; stream via stdin/stdout (`-`) with explicit `--input-encoding` / `--output-encoding`.
Schema inference (`probe`)	Sample or full-scan detection of String, Integer, Float, Boolean, Date, DateTime, and Guid columns; optional `--mapping` and `--replace` templates saved to `.schema`.
Schema authoring & listing	`schema` builds manual definitions with renames and replacements; `columns` prints column positions, names, types, and output aliases.
Value normalization	Schema `replace` arrays and boolean normalization convert legacy tokens before typed operations; `process --boolean-format` controls emitted true/false representation.
Indexing (`index`)	Build B-tree index files with multiple named variants, mixed asc/desc columns, and combo expansion for shared prefixes.
Sorting (`process`)	Streams through matching index variants for ordered reads and falls back to stable multi-column in-memory sort when needed.
Filtering & projection	Type-aware filters (`= != > >= < <= contains startswith endswith`), column inclusion/exclusion, row limits, and optional 1-based row numbers.
Derived columns	Evalexpr-powered expressions provide arithmetic, comparison, conditional, and string operations referencing headers or positional aliases (`cN`).
Append (`append`)	Concatenate files with header validation and optional schema enforcement to guarantee consistent types before merging.
Verification (`verify`)	Validates each file against the schema; default output is a column summary. `--report-invalid:detail[:summary] [LIMIT]` adds ANSI-highlighted row samples with optional sample limits.
Stats & frequency	`stats` streams count/mean/median/min/max/stddev per numeric column; `frequency` reports distinct value counts with optional top-N cap.
Preview & table output	`preview` shows the first N rows as an elastic table; `process --table` renders transformed output as a table on stdout.
Joins (`join`)	Hash join supports inner, left, right, and full outer joins with schema-driven typing and replacement normalization for keys.
Installation	`install` wraps `cargo install` with convenience flags (`--locked`, `--force`, `--root`, `--version`) and matches the release workflow.

cmd.exe:

./target/release/csv-managed.exe process ^
  -i ./data/orders.csv ^
  -m ./data/orders.schema ^
  -x ./data/orders.idx ^
  --index-variant default ^
  --sort order_date:asc,customer_id:asc ^
  --filter "status = shipped" ^
  --filter "amount >= 100" ^
  --derive "total_with_tax=1" ^
  --derive "channel=\"online\"" ^
  -C order_id,customer_id,amount,total_with_tax ^
  --exclude-columns internal_flag ^
  --row-numbers ^
  --boolean-format one-zero ^
  --output-delimiter pipe

If --index-variant is omitted, process automatically chooses the variant that covers the longest prefix of the requested --sort columns and directions. When no indexed variant matches the requested sort signature, the command falls back to an in-memory stable sort while continuing to stream rows wherever possible.

Installation

cargo build --release

Binary (Windows): target\release\csv-managed.exe

Install from crates.io:

cargo install csv-managed

Install locally from the workspace (useful when developing):

cargo install --path .

After building, the CLI can re-run installation on the current machine:

./target/release/csv-managed.exe install --locked

The helper wraps cargo install csv-managed and accepts --version, --force, --locked, and --root pass-through options.

Release automation: push a tag like v0.1.0 and provide a CRATES_IO_TOKEN repository secret; the GitHub Actions release workflow will build archives and execute cargo publish --locked automatically.

Logging examples:

$env:RUST_LOG='info'

set RUST_LOG=info

Quick Start

# 1. Infer schema
./target/release/csv-managed.exe probe -i ./data/orders.csv -m ./data/orders.schema --sample-rows 0
# 2. Build index (optional for sorted reads)
./target/release/csv-managed.exe index -i ./data/orders.csv -o ./data/orders.idx --spec default=order_date:asc,customer_id:asc --spec recent=order_date:desc --schema ./data/orders.schema
# 3. Process with filters / derives / sort
./target/release/csv-managed.exe process -i ./data/orders.csv -m ./data/orders.schema -x ./data/orders.idx --index-variant default --sort order_date:asc,customer_id:asc --filter "status = shipped" --derive 'total_with_tax=amount*1.0825' --row-numbers -o ./data/orders_filtered.csv
# 4. Normalize legacy tokens via schema replacements
./target/release/csv-managed.exe process -i ./data/orders.csv -o ./data/orders_clean.csv --schema ./data/orders.schema
# 5. Summary statistics
./target/release/csv-managed.exe stats -i ./data/orders.csv -m ./data/orders.schema
# 6. Frequency counts (top 10)
./target/release/csv-managed.exe frequency -i ./data/orders.csv -m ./data/orders.schema --top 10
# 7. Preview first 15 rows
./target/release/csv-managed.exe preview -i ./data/orders.csv --rows 15
# 8. Join customers with orders
./target/release/csv-managed.exe join --left ./data/orders.csv --right ./data/customers.csv --left-key customer_id --right-key id --type inner -o joined.csv
# 9. Append monthly extracts
./target/release/csv-managed.exe append -i jan.csv -i feb.csv -i mar.csv -m orders.schema -o q1.csv
# 10. Verify integrity (summary default)
./target/release/csv-managed.exe verify -m orders.schema -i q1.csv
#     Investigate failures with highlighted samples (optional limit)
./target/release/csv-managed.exe verify -m orders.schema -i orders_invalid.csv --report-invalid:detail:summary 5

Command Reference

Detailed --help output for every command is mirrored in docs/cli-help.md for quick reference.

probe

Infer column types and produce a .schema JSON file.

Flag	Description
`-i, --input <FILE>`	Input CSV file.
`-m, --schema <FILE>`	Output schema file.
`--sample-rows <N>`	Sample size (`0` = full scan).
`--delimiter <VAL>`	Input delimiter.
`--mapping`	Insert default lowercase_with_underscores `name_mapping` aliases into the `.schema` file and print templates to stdout.
`--replace`	Emit empty `replace` arrays for each column as a template for future value substitutions.

PowerShell:

./target/release/csv-managed.exe probe `
  -i ./data/orders.csv `
  -m ./data/orders.schema `
  --delimiter tab `
  --sample-rows 0 `

./target/release/csv-managed.exe schema `
  -o ./schemas/orders.schema `
  -c id:integer->Identifier `
  -c customer_id:integer->Customer ID,order_date:date,amount:float,status:string ^
  --replace status=Pending->Open ^
  --replace status=Closed (Legacy)->Closed

cmd.exe: Use --replace to normalize legacy tokens or synonyms before validation and processing; each entry populates the column's replace array in the schema file.

./target/release/csv-managed.exe schema ^
  -o ./schemas/orders.schema ^
  -c id:integer->Identifier ^
  -c customer_id:integer->Customer ID,order_date:date,amount:float,status:string

index

Build a B-Tree index for specified key columns (ascending order optimization).

Flag	Description
`-i, --input <FILE>`	Input CSV file.
`-o, --index <FILE>`	Output `.idx` file.
`-C, --columns <LIST>`	Legacy single ascending index (comma list). Superseded by `--spec`.
`--spec <SPEC>`	Repeatable: `name=col_a:asc,col_b:desc` or `col_a:asc`. Builds named variants per index file.
`--combo <SPEC>`	Generate prefix combinations with optional direction branches using `\|`, e.g. `geo=date:asc\|desc,customer:asc`.
`-m, --schema <FILE>`	Optional schema file.
`--limit <N>`	Stop after N rows (partial index).
`--delimiter <VAL>`	Input delimiter.

PowerShell:

./target/release/csv-managed.exe index `
  -i ./data/orders.csv `
  -o ./data/orders.idx `
  --spec default=order_date:asc,customer_id:asc `
  --spec recent=order_date:desc `
  -m ./data/orders.schema

cmd.exe:

./target/release/csv-managed.exe index ^
  -i ./data/orders.csv ^
  -o ./data/orders.idx ^
  --spec default=order_date:asc,customer_id:asc ^
  --spec recent=order_date:desc ^
  -m ./data/orders.schema

--spec accepts comma-separated column:direction tokens. Prefix with name= to label the variant (e.g. fast=col_a:asc,col_b:desc). When omitted, the variant is anonymous but still usable for automatic matching.

process

Transform pipeline: sort, filter, derive, project, exclude, boolean formatting, row numbers, delimiter changes, optional table output.

Flag	Description
`-i, --input <FILE>`	Input CSV (required).
`-o, --output <FILE>`	Output file (stdout if omitted).
`-m, --schema <FILE>`	Schema file.
`-x, --index <FILE>`	Index file for accelerated sort matching asc/desc directives.
`--index-variant <NAME>`	Pin to a named variant stored in the index file (requires matching `--sort`).
`--sort <SPEC>`	Repeatable: `column[:asc or :desc]`. Comma list or multiple uses.
`-C, --columns <LIST>`	Inclusion list (repeatable).
`--exclude-columns <LIST>`	Exclusion list (repeatable).
`--derive <name=expr>`	Derived column (repeatable).
`--filter <expr>`	Filter expression (repeatable; AND).
`--row-numbers`	Prepend `row_number`.
`--limit <N>`	Emit at most N rows.
`--delimiter <VAL>`	Input delimiter.
`--output-delimiter <VAL>`	Output delimiter override.
`--boolean-format <FORMAT>`	Normalize boolean output. Formats: `original`, `true-false`, `one-zero`.
`--table`	Render as formatted table (stdout only; incompatible with `--output`).

PowerShell:

./target/release/csv-managed.exe process `
  -i ./data/orders.csv `
  -m ./data/orders.schema `
  -x ./data/orders.idx `
  --index-variant default `
  --sort order_date:asc,customer_id:asc `
  --filter "status = shipped" `
  --filter "amount >= 100" `
  --derive 'total_with_tax=amount*1.0825' `
  --derive 'channel="online"' `
  -C order_id,customer_id,amount,total_with_tax `
  --exclude-columns internal_flag `
  --row-numbers `
  --boolean-format one-zero `
  --output-delimiter pipe

cmd.exe:

./target/release/csv-managed.exe process ^
  -i ./data/orders.csv ^
  -m ./data/orders.schema ^
  -x ./data/orders.idx ^
  --index-variant default ^
  --sort order_date:asc,customer_id:asc ^
  --filter "status = shipped" ^
  --filter "amount >= 100" ^
  --derive "total_with_tax=amount*1.0825" ^
  --derive "channel=\"online\"" ^
  -C order_id,customer_id,amount,total_with_tax ^
  --exclude-columns internal_flag ^
  --row-numbers ^
  --boolean-format one-zero ^
  --output-delimiter pipe

If --index-variant is omitted, process automatically chooses the variant that covers the longest prefix of the requested --sort columns and directions.

Schema-driven replacements defined in the .schema file are always applied before parsing, so process can clean and transform data in a single pass.

append

Append multiple CSV files into a single output. Ensures consistent headers (baseline or schema enforced).

Flag	Description
`-i, --input <FILE>`	Repeatable input files (first defines header).
`-o, --output <FILE>`	Output file (stdout if omitted).
`-m, --schema <FILE>`	Optional schema for strict validation.
`--delimiter <VAL>`	Delimiter for all inputs.

Example:

./target/release/csv-managed.exe append -i jan.csv -i feb.csv -i mar.csv -m orders.schema -o q1.csv

verify

Validate one or more CSV files against a schema definition.

Flag	Description
`-m, --schema <FILE>`	Schema file.
`-i, --input <FILE>`	Repeatable input files.
`--delimiter <VAL>`	Input delimiter.
`--report-invalid[:detail[:summary]] [LIMIT]`	Summarize invalid columns by default; append `:detail` for row samples, `:detail:summary` for both, and optionally add LIMIT to cap sample rows.

preview

Display first N rows in an elastic table.

Flag	Description
`-i, --input <FILE>`	Input file.
`--rows <N>`	Number of data rows (default 10).
`--delimiter <VAL>`	Input delimiter.

stats

Summary statistics for numeric columns.

Flag	Description
`-i, --input <FILE or ->`	Input file or `-` (stdin; requires schema).
`-m, --schema <FILE>`	Schema file (recommended).
`-C, --columns <LIST>`	Restrict to listed numeric columns.
`--delimiter <VAL>`	Input delimiter.
`--limit <N>`	Scan at most N rows (0 = all).

frequency

Distinct value counts per column.

Flag	Description
`-i, --input <FILE or ->`	Input file or stdin (`-`; schema required).
`-m, --schema <FILE>`	Optional schema.
`-C, --columns <LIST>`	Limit to listed columns.
`--delimiter <VAL>`	Input delimiter.
`--top <N>`	Limit to top N values (0 = all).

join

Join two CSV files on one or more key columns.

Flag	Description
`--left <FILE or ->`	Left input file or stdin (`-`; requires left schema).
`--right <FILE>`	Right input file (file path).
`-o, --output <FILE>`	Output file (stdout if omitted).
`--left-key <COLS>`	Comma-separated left key columns.
`--right-key <COLS>`	Comma-separated right key columns.
`--type <inner/left/right/full>`	Join type (default inner).
`--left-schema <FILE>`	Left schema (required if left is stdin).
`--right-schema <FILE>`	Right schema.
`--delimiter <VAL>`	Input delimiter.

Example:

./target/release/csv-managed.exe join `
  --left orders.csv `
  --right customers.csv `
  --left-key customer_id `
  --right-key id `
  --type left `
  -o orders_with_customers.csv

install

columns

List schema columns and their data types in a formatted table.

Flag	Description
`-m, --schema <FILE>`	Schema file describing the columns to list.

PowerShell:

./target/release/csv-managed.exe columns `
  --schema ./data/orders.schema

cmd.exe:

./target/release/csv-managed.exe columns ^
  --schema ./data/orders.schema

Wrapper around cargo install csv-managed with a friendlier interface.

Flag	Description
`--version <SEMVER>`	Install a specific published version.
`--force`	Reinstall even if already installed.
`--locked`	Pass `--locked` to respect `Cargo.lock`.
`--root <DIR>`	Target an alternate installation directory.

Example:

./target/release/csv-managed.exe install --locked

Derived Column Expression Notes

Access Pattern	Example	Meaning
Normalized header	`total_with_tax=amount*1.0825`	Uses inferred numeric type.
Positional alias	`c3*1.1`	Fourth column (0-based).
String literal (PS)	`'tag="promo"'`	Single quotes wrap inner quotes.
String literal (cmd)	`"tag=\"promo\""`	Escaped inner quotes.
Row number	`row_index=row_number`	Available when `--row-numbers` enabled.

Filter Syntax Examples

PowerShell:

--filter "status = shipped" --filter "amount >= 100" --filter "customer_id startswith 1"

Mixed operators:

--filter "order_date >= 2024-01-01" --filter "description contains urgent" --filter "region != US"

Data Types

Type	Examples	Notes
String	any UTF‑8	Normalized header names usable in expressions.
Integer	`42`, `-7`	64-bit signed.
Float	`3.14`, `2`	f64; integers accepted.
Boolean	`true/false`, `t/f`, `yes/no`, `y/n`, `1/0`	Parsing flexible; output format selectable.
Date	`2024-08-01`, `08/01/2024`, `01/08/2024`	Canonical output `YYYY-MM-DD`.
DateTime	`2024-08-01T13:45:00`, `2024-08-01 13:45`	Naive (no timezone).
Guid	`550e8400-e29b-41d4-a716-446655440000`, `550E8400E29B41D4A716446655440000`	Case-insensitive; accepts hyphenated or 32-hex representations.

Future work: Decimal, Time-only, Currency.

Stdin/Stdout Usage

Use - for streaming input where supported. Example:

Get-Content orders.csv | ./target/release/csv-managed.exe stats -i - -m orders.schema

Boolean Formatting Examples

./target/release/csv-managed.exe process -i orders.csv -m orders.schema --boolean-format one-zero -C shipped_flag -o shipped.csv
./target/release/csv-managed.exe process -i orders.csv -m orders.schema --boolean-format true-false --table -C shipped_flag

Table Output

--table renders transformation results as an elastic-width ASCII table (stdout only; cannot combine with --output).

Index Strategy

Index stores byte offsets keyed by concatenated column values. A single .idx can hold multiple named variants, each with its own mix of ascending/descending columns. process picks the variant that best matches the requested --sort signature or you can force one via --index-variant. When no compatible variant exists, the command falls back to in-memory sorting.

Performance Considerations

Indexed sort avoids loading all rows into memory.
Early filtering cuts sort & derive workload.
Derived expressions evaluated per emitted row—keep them lean.
Median requires storing column values (potential memory impact for huge numeric columns).

Error Handling

Rich anyhow contexts (I/O, parsing, evaluation, schema, index).
Fast failure on unknown columns, invalid expressions, header/schema mismatches.
Invalid UTF‑8 rows error (never silently skipped).

Logging

Set RUST_LOG=csv_managed=debug (or info) for insight into phases (index use, inference, filtering).

Testing

cargo test

Integration tests cover probe, index, process (filters, derives, sort, delimiters). Additional tests planned for joins, stats, frequency.

Contributing

Fork & branch (feat/<name>).
Add tests (unit + integration) for new behavior.
Run cargo fmt && cargo clippy && cargo test before PR.
Update README (move items from roadmap when implemented).

License

See LICENSE.

Support

Open issues for bugs, enhancements, or documentation gaps. Pull requests welcome.

csv-managed 1.0.0

csv-managed

Implemented Features

Installation

Quick Start

Command Reference

probe

index

process

append

verify

preview

stats

frequency

join

install

columns

Derived Column Expression Notes

Filter Syntax Examples

Data Types

Stdin/Stdout Usage

Boolean Formatting Examples

Table Output

Index Strategy

Performance Considerations

Error Handling

Logging

Testing

Contributing

License

Support