dply 0.1.2 - Docs.rs

dply is a command line tool for viewing, querying, and writing csv and parquet files, inspired by dplyr and powered by polars.

Usage overview

A dply pipeline consists of a number of functions to read, transform, or write data to disk.

The following is an example of a three steps pipeline that reads a parquet file selects all columns that contain amount and shows some of the data[^1]:

$ dply -c 'parquet("nyctaxi.parquet") | select(contains("amount")) | head()'
shape: (10, 4)
┌─────────────┬────────────┬──────────────┬──────────────┐
│ fare_amount ┆ tip_amount ┆ tolls_amount ┆ total_amount │
│ ---         ┆ ---        ┆ ---          ┆ ---          │
│ f64         ┆ f64        ┆ f64          ┆ f64          │
╞═════════════╪════════════╪══════════════╪══════════════╡
│ 14.5        ┆ 3.76       ┆ 0.0          ┆ 22.56        │
│ 6.5         ┆ 0.0        ┆ 0.0          ┆ 9.8          │
│ 11.5        ┆ 2.96       ┆ 0.0          ┆ 17.76        │
│ 18.0        ┆ 4.36       ┆ 0.0          ┆ 26.16        │
│ 12.5        ┆ 3.25       ┆ 0.0          ┆ 19.55        │
│ 19.0        ┆ 0.0        ┆ 0.0          ┆ 22.3         │
│ 8.5         ┆ 0.0        ┆ 0.0          ┆ 11.8         │
│ 6.0         ┆ 2.0        ┆ 0.0          ┆ 11.3         │
│ 12.0        ┆ 3.26       ┆ 0.0          ┆ 19.56        │
│ 9.0         ┆ 2.56       ┆ 0.0          ┆ 15.36        │
└─────────────┴────────────┴──────────────┴──────────────┘

A simple pipeline can be passed as a command line argument with the -c flag or as standard input, for more complex pipelines is convenient to store the pipeline in a file and run dply with the file name as a command line argument.

For example the NYC taxi test file [^1] has a payment_type and total_amount columns, let's say we want to find out for all payment types the minimum, maximum, and mean amount paid and the number of payments for each type sorted in descending order, we can write the following pipeline in a dply file:

# Compute some statistics on the payment types
parquet("nyctaxi.parquet") |
    group_by(payment_type) |
    summarize(
        mean_price = mean(total_amount),
        min_price = min(total_amount),
        max_price = max(total_amount),
        n = n()
    ) |
    arrange(desc(n)) |
    show()

and then run the script:

$ dply payments.dply
shape: (5, 5)
┌──────────────┬────────────┬───────────┬───────────┬─────┐
│ payment_type ┆ mean_price ┆ min_price ┆ max_price ┆ n   │
│ ---          ┆ ---        ┆ ---       ┆ ---       ┆ --- │
│ str          ┆ f64        ┆ f64       ┆ f64       ┆ u32 │
╞══════════════╪════════════╪═══════════╪═══════════╪═════╡
│ Credit card  ┆ 22.378757  ┆ 8.5       ┆ 84.36     ┆ 185 │
│ Cash         ┆ 18.458491  ┆ 3.3       ┆ 63.1      ┆ 53  │
│ Unknown      ┆ 26.847778  ┆ 9.96      ┆ 54.47     ┆ 9   │
│ Dispute      ┆ -0.5       ┆ -8.3      ┆ 7.3       ┆ 2   │
│ No charge    ┆ 8.8        ┆ 8.8       ┆ 8.8       ┆ 1   │
└──────────────┴────────────┴───────────┴───────────┴─────┘

[^1]: The file nyctaxi.parquet in the tests/data folder is a 250 rows parquet file sampled from the NYC trip record data.

Supported functions

dply supports the following functions:

arrange Sorts rows by column values
count Counts columns unique values
csv Reads or writes a dataframe in CSV format
distinct Retains unique rows
filter Filters rows that satisfy given predicates
glimpse Shows a dataframe overview
group by and summarize Performs grouped aggregations
head Shows the first few dataframe rows in table format
mutate Creates or mutate columns
parquet Reads or writes a dataframe in Parquet format
relocate Moves columns positions
rename Renames columns
select Selects columns
show Shows all dataframe rows

more examples can be found in the tests folder.

Installation

Binaries generated by the release Github action for Linux, macOS (x86), and Windows are available in the releases page.

You can also install dply using Cargo:

cargo install dply

or by building it from this repository:

git clone https://github.com/vincev/dply-rs
cd dply-rs
cargo install --path .