🚇 ClickHouse DataFusion Integration
A high-performance Rust library that integrates ClickHouse with Apache DataFusion, enabling seamless querying across ClickHouse and other data sources.
Built on clickhouse-arrow for optimal performance and DataFusion for advanced SQL analytics.
Why clickhouse-datafusion?
- 🚀 High Performance: Built on clickhouse-arrow for optimal data transfer and Arrow format efficiency
- ⚡ Connection Pooling: clickhouse-arrow provides connection pooling for scalability
- 🔗 Federation Support: Join
ClickHousetables with otherDataFusionsources seamlessly - ⛓️ Two-tier Execution: Delegate complexity to
ClickHouseleveraging additional optimizations at the edge - 🛠️
ClickHouseUDFs: Direct access toClickHousefunctions inDataFusionSQL queries - 📊 Advanced Analytics: Support for window functions, CTEs, subqueries, and complex JOINs
- 🎯 Arrow Native: Native Apache Arrow integration for zero-copy data processing
- 🔄 Schema Flexibility: Optional schema coercion after fetching data for automatic type compatibility
Quick Start
Add to your Cargo.toml:
[]
= "0.1.4"
Basic Usage
use ClickHouseEngine;
use ;
use SessionContext;
async
Preserving custom SessionContext configuration
If you add bespoke configuration to a SessionContext before wrapping it (for example adding
custom catalogs or enabling SessionContext::enable_url_table so URLs can be queried directly),
use ClickHouseSessionContext::with_session_transform
to re-apply those changes after the ClickHouse-specific rebuild:
use SessionContext;
use ClickHouseSessionContext;
let base = new.enable_url_table;
let ctx = from
.with_session_transform;
This keeps your customisation intact while the library layers in its own query planner and UDFs.
With Federation (Cross-DBMS Queries)
use ClickHouseEngine;
use ;
use FederatedContext;
use SessionContext;
async
ClickHouse Functions
Access ClickHouse's powerful functions directly in DataFusion SQL:
Direct Function Calls
-- Mathematical functions
SELECT clickhouse(sigmoid(price), 'Float64') as price_sigmoid FROM products;
-- String functions
SELECT clickhouse(`base64Encode`(name), 'Utf8') as b64_name FROM users;
-- Array functions
SELECT clickhouse(`arrayJoin`(tags), 'Utf8') as individual_tags FROM articles;
Lambda Functions
-- Transform arrays with lambda functions
SELECT
names,
clickhouse(`arrayMap`($x, concat($x, '_processed'), names), 'List(Utf8)') as processed_names
FROM user_data;
-- Filter arrays
SELECT clickhouse(`arrayFilter`($x, length($x) > 3, tags), 'List(Utf8)') as long_tags
FROM content;
Complex Analytics
-- Window functions with ClickHouse functions
SELECT
user_id,
clickhouse(exp(revenue), 'Float64') as exp_revenue,
SUM(revenue) OVER (PARTITION BY user_id ORDER BY date) as running_total
FROM sales_data;
-- CTEs with ClickHouse functions
WITH processed_data AS (
SELECT
user_id,
clickhouse(`arrayJoin`(event_types), 'Utf8') as event_type
FROM user_events
)
SELECT event_type, COUNT(*) as event_count
FROM processed_data
GROUP BY event_type;
Architecture
Core Components
ClickHouseBuilder: Main configuration entry pointClickHouseSessionContext: EnhancedDataFusioncontext withClickHouseUDF support- Table Providers:
DataFusionintegration layer forClickHousetables - Federation: Cross-database query support via datafusion-federation
- UDF System:
ClickHousefunction integration with intelligent pushdown - Function Analyzer: Advanced optimization for UDF placement
Key Features
Connection Pooling
let clickhouse = new
.configure_pool
.configure_client
.build_catalog
.await?;
Schema Coercion
let builder = new
.with_schema_coercion // Enable automatic type coercion
.build_catalog
.await?;
Schema Management
// Create tables from Arrow schema
let schema = new;
clickhouse.with_new_table.create.await?;
Federation Support
When the federation feature is enabled, clickhouse-datafusion can join ClickHouse tables with other DataFusion sources:
Note: The current release uses
datafusion-federationv0.4.7 from crates.io for publishing compatibility. This version has a known issue withUNNESToperations due to an upstream DataFusion bug in expression handling. If you needUNNESTsupport in federated queries, please track PR #135 for the fix.
-- Join ClickHouse with Parquet files
SELECT
ch.user_id,
ch.total_purchases,
parquet.user_segment
FROM clickhouse.analytics.user_stats ch
JOIN local_parquet.user_segments parquet
ON ch.user_id = parquet.user_id
WHERE ch.total_purchases > 1000;
-- Federated aggregations
SELECT
segment,
AVG(clickhouse(log(total_purchases), 'Float64')) as avg_log_purchases
FROM (
SELECT
ch.user_id,
ch.total_purchases,
csv.segment
FROM clickhouse.sales.users ch
JOIN local_csv.segments csv ON ch.user_id = csv.user_id
)
GROUP BY segment;
Features
- default: Core functionality with
ClickHouseintegration - federation: Enable cross-database queries
- cloud:
ClickHouseCloud support - test-utils: Testing utilities for development
Development
Running Tests
# All tests with multiple feature combinations
# Specific test types
# Coverage reports
Environment Variables
RUST_LOG=debug- Enable debug loggingDISABLE_CLEANUP=true- Keep test containers runningDISABLE_CLEANUP_ON_ERROR=true- Keep containers on test failure
Examples
See the examples directory for a complete working example:
- Basic Integration: Simple
ClickHousequerying - Federation: Cross-database joins
- UDF Usage:
ClickHousefunction examples - Schema Management: Table creation and management
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.