# grab
`grab` is a high-performance, declarative stream processor for delimited text data.
It is designed to replace fragile shell pipelines (`awk`, `cut`, `sed`) with a structured approach for data extraction and manipulation. Instead of relying on complex, column-based syntax, `grab` allows you to define your data schema upfront: turning messy, brittle pipelines into readable, maintainable, and verifiable data flows.
## Key Features
- **High Performance:** Process ~12.8M fields/sec (often limited only by system pipe throughput).
- **Safety First:** Strict UTF-8 validation and schema enforcement by default.
- **JQ's Best Friend:** Transform messy delimited text into structured JSON ingress for `jq`.
- **Zero Dependencies:** Single static binary (~800KB). No libc requirements (musl).
## Quick Start
To create JSON objects from a CSV file, you can use the following command:
```bash
# users.csv:
# 1,John,Doe,555-1234,555-5678,London,UK
# 2,Jane,Smith,555-8765,555-4321,New York,USA
grab --mapping id,_,last,phones:2,_:g --json < users.csv
# Output:
# {"id":"1","last":"Doe","phones":["555-1234","555-5678"]}
# {"id":"2","last":"Smith","phones":["555-8765","555-4321"]}
```
Or see processes consuming more than 5% memory:
```bash
### Benchmark
Benchmarks were done as follows:
- **Machine**: Lenovo Thinkpad E15 Gen 2
- **Dataset**: 2 million rows of CSV data with 12 column (~350MB, ~24 million fields)
#### All columns
Even processing 24 million fields while validating the schema, ensuring UTF-8 correctness, and handling errors, `grab` achieves a throughput of 7.6 million fields per second.
```bash
hyperfine --warmup 3 --runs 5 "./grab --mapping index,customer_id,first_name,last_name,company,city,country,phones:2,email,subscription_date,website --skip 1 --json < .demo/2mil.csv > /dev/null"
# Results
# Time (mean ± σ): 2.677 s ± 0.012 s [User: 2.631 s, System: 0.046 s]
# Range (min … max): 2.662 s … 2.691 s 5 runs
# Throughput: 8.9 million fields/s
```
#### Filtering and taking a subset
When we actually start using `grab` as intended, mapping only the fields we care about and skipping the rest, the performance improves significantly. In this case, we achieve a throughput of 12.8 million fields per second (including skipped ones).
```bash
hyperfine --warmup 3 --runs 5 "./grab --mapping _:2,first_name,last_name,_:3,phones:2,email,_:g --skip 1 --json < .demo/2mil.csv > /dev/null"
# Results
# Time (mean ± σ): 1.397 s ± 0.014 s [User: 1.357 s, System: 0.040 s]
# Range (min … max): 1.381 s … 1.412 s 5 runs
# Throughput: 17.1 million fields/s
```
#### Note
While profiling, a significant portion of the execution time is spent on system calls and kernel-space I/O. `grab` often operates at the theoretical limit of the system pipe.
### TL;DR
| All columns with full schema validation | 7.6 million | 3.15s |
| Partial map + greedy skip | **12.8 million** | **1.86s** |
## Installation
### Binaries
Precompiled binaries for Linux are available on the releases page.
### Cargo
You can also install `grab` using Cargo:
```
cargo install grab-cli
```
### Source
To build from source, clone the repository and run:
```
cargo build --release
```
## Contributing
As of now, `grab` is in early development and not yet accepting contributions. However, if you're interested in contributing or have ideas for features, please reach out to me directly.