cudf-polars

GPU execution engine for Polars DataFrames using NVIDIA libcudf.

cudf-polars transparently offloads Polars DataFrame operations to the GPU, providing significant speedups for filter, sort, groupby, join, and other data-intensive operations.

Prerequisites

NVIDIA GPU with CUDA support
CUDA Toolkit 12.x
libcudf (built from cudf-sys / cudf-cxx in this workspace)
Rust 1.85+

Quick Start

use polars_core::prelude::*;
use polars_lazy::prelude::*;
use cudf_polars::collect_gpu;

fn main() {
    let df = df!(
        "id"    => [1i32, 2, 3, 1, 2, 3],
        "value" => [10.0f64, 20.0, 30.0, 40.0, 50.0, 60.0],
    ).unwrap();

    // Execute on GPU — one-line API
    let result = collect_gpu(
        df.lazy()
          .filter(col("value").gt(lit(25.0)))
          .group_by([col("id")])
          .agg([col("value").sum()])
    ).unwrap();
    println!("{}", result);
}

Supported Operations

Category	Operation	API	Status
Transfer	CPU -> GPU	`GpuDataFrame::from_polars()`	Done
	GPU -> CPU	`GpuDataFrame::to_polars()`	Done
Selection	Column select	`GpuDataFrame::select_columns()`	Done
	Row slice	`GpuDataFrame::slice()`	Done
Filter	Boolean mask	`GpuDataFrame::apply_boolean_mask()`	Done
Sort	Sort by key columns	`GpuDataFrame::sort_by_key()`	Done
GroupBy	Aggregation	`GpuDataFrame::groupby()`	Done
Dedup	Distinct rows	`GpuDataFrame::distinct()`	Done
Join	Inner/Left/Full	`Table::inner_join()` etc.	Done
	Semi/Anti/Cross	`Table::left_semi_join()` etc.	Done
Union	Vertical concat	`concatenate_tables()`	Done
HConcat	Horizontal concat	Column collection	Done
Binary Ops	Column-column	`Column::binary_op()`	Done
	Column-scalar	`Column::binary_op_scalar()`	Done
Ternary	when/then/otherwise	`Column::copy_if_else()`	Done
Expression	Polars expr -> GPU	`cudf_polars::expr`	Done
Plan Exec	Full plan execution	`cudf_polars::execute_plan()`	Done

Supported Aggregations (GroupBy)

Sum, Min, Max, Count, Mean, Median, Variance, Std, Nunique, First, Last, Quantile.

Supported Data Types

Polars Type	cudf Type
Int8	INT8
Int16	INT16
Int32	INT32
Int64	INT64
UInt8	UINT8
UInt16	UINT16
UInt32	UINT32
UInt64	UINT64
Float32	FLOAT32
Float64	FLOAT64
Boolean	BOOL8
String	STRING
Date	TIMESTAMP_DAYS
Datetime	TIMESTAMP_{MS,US,NS}
Duration	DURATION_{MS,US,NS}

Benchmark

cargo run --example benchmark -p cudf-polars --features gpu-tests --release

Architecture

Polars DataFrame
      |
      v  (Arrow C Data Interface)
cudf-polars::convert   -- zero-copy CPU <-> GPU bridge
      |
      v
cudf-polars::GpuDataFrame -- named GPU columns
      |
      v
cudf (Rust)  ->  cudf-cxx (C++ bridge)  ->  libcudf (NVIDIA)

execute_plan() takes an IRPlan obtained from polars-lazy's to_alp_optimized().

LazyFrame Integration

use polars_lazy::frame::LazyFrame;
use cudf_polars::engine::execute_plan;

let lf: LazyFrame = df.lazy().filter(...).group_by(...);
let plan = lf.to_alp_optimized()?;
let result = execute_plan(plan)?;

Testing

# Run GPU e2e tests (56 tests + 1 doctest)
cargo test -p cudf-polars --features gpu-tests

# Python polars-gpu integration (81 tests)
python tests/polars_gpu_integration.py

Limitations

Polars version: Compatible with polars 0.53.0.
Unsupported types: Categorical, List, Struct return explicit errors.
Unsupported expressions: Window functions with order_by, expression-level Sort/Filter/Slice.
Unsupported IR nodes: Cache, MapFunction (rename, explode, melt), ExtContext.
Multi-file Parquet: Only reads the first file in multi-file scans.
GroupBy maintain_order: Approximated by key-column sort (not true input-order preservation).

License

Apache-2.0 OR MIT

cudf-polars 0.3.1