Ironbeam
A data processing framework for Rust inspired by Apache Beam and Google Cloud Dataflow. Ironbeam provides a declarative API for building batch data pipelines with support for transformations, aggregations, joins, and I/O operations.
Features
- Declarative pipeline API with fluent interface
- Stateless operations:
map,filter,flat_map,map_batches - Stateful operations:
group_by_key,combine_values, keyed aggregations - Built-in combiners: Sum, Min, Max, Average, DistinctCount, TopK
- Join support: inner, left, right, and full outer joins
- Side inputs for enriching streams with auxiliary data
- Sequential and parallel execution modes
- Type-safe with compile-time correctness
- Optional I/O backends: JSON Lines, CSV, Parquet
- Optional compression: gzip, zstd, bzip2, xz
- Metrics collection and checkpointing support
Installation
Add to your Cargo.toml:
[]
= "1.0.0"
By default, all features are enabled. To use a minimal configuration:
[]
= { = "1.0.0", = false }
Available feature flags:
io-jsonl- JSON Lines supportio-csv- CSV supportio-parquet- Parquet supportcompression-gzip- gzip compressioncompression-zstd- zstd compressioncompression-bzip2- bzip2 compressioncompression-xz- xz compressionparallel-io- parallel I/O operationsmetrics- pipeline metrics collectioncheckpointing- checkpoint and recovery support
Quick Start
use *;
Core Concepts
Pipeline
A Pipeline is the container for your computation graph. Create one with Pipeline::default(), then attach data sources and transformations.
PCollection
A PCollection<T> represents a distributed collection of elements. Collections are immutable, lazy, and type-safe. Transformations create new collections, and computation happens when you call an execution method like collect_seq() or collect_par().
Transformations
Stateless operations work on individual elements:
map- transform each elementfilter- keep elements matching a predicateflat_map- transform each element into zero or more outputsmap_batches- process elements in batches
Stateful operations work on keyed data:
key_by- convert to a keyed collectionmap_values- transform values while preserving keysgroup_by_key- group values by keycombine_values- aggregate values per keytop_k_per_key- select top K values per key
Combiners
Combiners provide efficient aggregation. Built-in options include:
Count- count elementsSum- sum numeric valuesMin/Max- find minimum/maximumAverageF64- compute averagesDistinctCount- count unique valuesTopK- select top K elements
Custom combiners can be implemented via the CombineFn trait.
Joins
Keyed collections support all standard join operations:
let joined = left_collection.join_inner?;
Available methods: join_inner, join_left, join_right, join_full.
Side Inputs
Enrich pipelines with auxiliary data:
let lookup = side_hashmap;
data.map_with_side
Execution
Execute pipelines in sequential or parallel mode:
let results = collection.collect_seq?; // Single-threaded
let results = collection.collect_par?; // Multi-threaded with Rayon
I/O Examples
JSON Lines
use *;
use ;
CSV
let data = ?;
data.write_csv?;
Parquet
let data = ?;
data.write_parquet?;
Compression
Compression is automatically detected by file extension:
// Read compressed file
let data = ?;
// Write compressed file
data.write_jsonl?;
Supported extensions: .gz, .zst, .bz2, .xz
Advanced Features
Windowing
Group time-series data into fixed or sliding windows:
let windowed = data
.window_fixed
.group_by_key
.combine_values;
Checkpointing
Save and restore the pipeline's state:
data.checkpoint?;
// Later, recover from checkpoint
let recovered = ?;
Metrics
Collect pipeline execution metrics:
let result = collection.collect_seq?;
let metrics = p.get_metrics;
println!;
Examples
The examples/ directory contains complete demonstrations:
etl_pipeline.rs- Extract, transform, load workflowadvanced_joins.rs- Join operationswindowing_aggregations.rs- Time-based windowingcombiners_showcase.rs- Using built-in combinerscompressed_io.rs- Working with compressed filescheckpointing_demo.rs- Checkpoint and recoverymetrics_example.rs- Collecting metricstesting_pipeline.rs- Testing patterns
Run examples with:
Testing
Run tests with:
For coverage:
License
This project uses the Rust 2024 edition and requires a recent Rust toolchain.