🚇 ClickHouse
DataFusion
Integration
A high-performance Rust library that integrates ClickHouse
with Apache DataFusion
, enabling seamless querying across ClickHouse
and other data sources.
Built on clickhouse-arrow for optimal performance and DataFusion for advanced SQL analytics.
Why clickhouse-datafusion?
- 🚀 High Performance: Built on clickhouse-arrow for optimal data transfer and Arrow format efficiency
- ⚡ Connection Pooling: clickhouse-arrow provides connection pooling for scalability
- 🔗 Federation Support: Join
ClickHouse
tables with otherDataFusion
sources seamlessly - ⛓️ Two-tier Execution: Delegate complexity to
ClickHouse
leveraging additional optimizations at the edge - 🛠️
ClickHouse
UDFs: Direct access toClickHouse
functions inDataFusion
SQL queries - 📊 Advanced Analytics: Support for window functions, CTEs, subqueries, and complex JOINs
- 🎯 Arrow Native: Native Apache Arrow integration for zero-copy data processing
- 🔄 Schema Flexibility: Optional schema coercion after fetching data for automatic type compatibility
Quick Start
Add to your Cargo.toml
:
[]
= "0.1.2"
Basic Usage
use ClickHouseEngine;
use ;
use SessionContext;
async
With Federation (Cross-DBMS Queries)
use ClickHouseEngine;
use ;
use FederatedContext;
use SessionContext;
async
ClickHouse
Functions
Access ClickHouse
's powerful functions directly in DataFusion
SQL:
Direct Function Calls
-- Mathematical functions
SELECT clickhouse(sigmoid(price), 'Float64') as price_sigmoid FROM products;
-- String functions
SELECT clickhouse(`base64Encode`(name), 'Utf8') as b64_name FROM users;
-- Array functions
SELECT clickhouse(`arrayJoin`(tags), 'Utf8') as individual_tags FROM articles;
Lambda Functions
-- Transform arrays with lambda functions
SELECT
names,
clickhouse(`arrayMap`($x, concat($x, '_processed'), names), 'List(Utf8)') as processed_names
FROM user_data;
-- Filter arrays
SELECT clickhouse(`arrayFilter`($x, length($x) > 3, tags), 'List(Utf8)') as long_tags
FROM content;
Complex Analytics
-- Window functions with ClickHouse functions
SELECT
user_id,
clickhouse(exp(revenue), 'Float64') as exp_revenue,
SUM(revenue) OVER (PARTITION BY user_id ORDER BY date) as running_total
FROM sales_data;
-- CTEs with ClickHouse functions
WITH processed_data AS (
SELECT
user_id,
clickhouse(`arrayJoin`(event_types), 'Utf8') as event_type
FROM user_events
)
SELECT event_type, COUNT(*) as event_count
FROM processed_data
GROUP BY event_type;
Architecture
Core Components
ClickHouseBuilder
: Main configuration entry pointClickHouseSessionContext
: EnhancedDataFusion
context withClickHouse
UDF support- Table Providers:
DataFusion
integration layer forClickHouse
tables - Federation: Cross-database query support via datafusion-federation
- UDF System:
ClickHouse
function integration with intelligent pushdown - Function Analyzer: Advanced optimization for UDF placement
Key Features
Connection Pooling
let clickhouse = new
.configure_pool
.configure_client
.build_catalog
.await?;
Schema Coercion
let builder = new
.with_schema_coercion // Enable automatic type coercion
.build_catalog
.await?;
Schema Management
// Create tables from Arrow schema
let schema = new;
clickhouse.with_new_table.create.await?;
Federation Support
When the federation
feature is enabled, clickhouse-datafusion can join ClickHouse
tables with other DataFusion
sources:
Note: The current release uses
datafusion-federation
v0.4.7 from crates.io for publishing compatibility. This version has a known issue withUNNEST
operations due to an upstream DataFusion bug in expression handling. If you needUNNEST
support in federated queries, please track PR #135 for the fix.
-- Join ClickHouse with Parquet files
SELECT
ch.user_id,
ch.total_purchases,
parquet.user_segment
FROM clickhouse.analytics.user_stats ch
JOIN local_parquet.user_segments parquet
ON ch.user_id = parquet.user_id
WHERE ch.total_purchases > 1000;
-- Federated aggregations
SELECT
segment,
AVG(clickhouse(log(total_purchases), 'Float64')) as avg_log_purchases
FROM (
SELECT
ch.user_id,
ch.total_purchases,
csv.segment
FROM clickhouse.sales.users ch
JOIN local_csv.segments csv ON ch.user_id = csv.user_id
)
GROUP BY segment;
Features
- default: Core functionality with
ClickHouse
integration - federation: Enable cross-database queries
- cloud:
ClickHouse
Cloud support - test-utils: Testing utilities for development
Development
Running Tests
# All tests with multiple feature combinations
# Specific test types
# Coverage reports
Environment Variables
RUST_LOG=debug
- Enable debug loggingDISABLE_CLEANUP=true
- Keep test containers runningDISABLE_CLEANUP_ON_ERROR=true
- Keep containers on test failure
Examples
See the examples directory for a complete working example:
- Basic Integration: Simple
ClickHouse
querying - Federation: Cross-database joins
- UDF Usage:
ClickHouse
function examples - Schema Management: Table creation and management
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.