hdbconnect-arrow
Apache Arrow integration for the hdbconnect SAP HANA driver. Converts HANA result sets to Arrow RecordBatch format, enabling zero-copy interoperability with the entire Arrow ecosystem.
Why Arrow?
Apache Arrow is the universal columnar data format for analytics. By converting SAP HANA data to Arrow, you unlock seamless integration with:
| Category | Tools |
|---|---|
| DataFrames | Polars, pandas, Vaex, Dask |
| Query engines | DataFusion, DuckDB, ClickHouse, Ballista |
| ML/AI | Ray, Hugging Face Datasets, PyTorch, TensorFlow |
| Data lakes | Delta Lake, Apache Iceberg, Lance |
| Visualization | Perspective, Graphistry, Falcon |
| Languages | Rust, Python, R, Julia, Go, Java, C++ |
[!TIP] Arrow's columnar format enables vectorized processing — operations run 10-100x faster than row-by-row iteration.
Installation
[]
= "0.3"
Or with cargo-add:
[!IMPORTANT] Requires Rust 1.88 or later.
Usage
Basic batch processing
use ;
use ;
use Arc;
Schema mapping
use ;
use TypeId;
// Convert individual types
let arrow_type = hana_type_to_arrow;
// Returns: DataType::Decimal128(18, 2)
// Convert entire field metadata
let arrow_field = hana_field_to_arrow;
Custom batch size
use BatchConfig;
use NonZeroUsize;
let config = new;
Performance
The crate is optimized for high-throughput data transfer with several performance enhancements in v0.3.2:
Optimization Techniques
- Enum-based dispatch — Eliminates vtable overhead by replacing
Box<dyn HanaCompatibleBuilder>withBuilderEnum, resulting in ~10-20% performance improvement through better cache locality and monomorphized dispatch - Homogeneous loop hoisting — Detects schemas where all columns share the same type and hoists the dispatch match outside the row loop for +4-8% throughput on wide tables (100+ columns)
- Zero-copy decimal conversion — Uses
Cow::Borrowedto avoid cloning BigInt during decimal conversion, improving decimal throughput by +222% (55 → 177 Melem/s) and saving 8MB per 1M decimals - String capacity pre-sizing — Extracts max_length from HANA field metadata to pre-allocate optimal buffer sizes, reducing reallocation overhead by 2-3x per string column
- Batch processing — Configurable batch sizes to balance memory usage and throughput
- Builder reuse — Builders reset between batches, eliminating repeated allocations
[!TIP] For large result sets, use
LendingBatchIteratorto stream data with constant memory usage.
Profiling Support
Enable the optional profiling feature flag to integrate dhat heap profiler:
[]
= { = "0.3", = ["profiling"] }
This enables allocation tracking with zero impact on release builds through conditional compilation. See src/profiling.rs for usage examples.
Ecosystem integration
Query HANA data with SQL using Apache DataFusion:
use *;
let batches = collect_batches_from_hana?;
let ctx = new;
ctx.register_batch?;
let df = ctx.sql.await?;
df.show.await?;
Load Arrow data directly into DuckDB:
use ;
let conn = open_in_memory?;
conn.register_arrow?;
let mut stmt = conn.prepare?;
let result = stmt.query_arrow?;
Convert to Polars DataFrame:
use *;
let batch = processor.flush?.unwrap;
let df = try_from?;
let result = df.lazy
.filter
.group_by
.agg
.collect?;
Serialize Arrow data for storage or network transfer:
use FileWriter;
use ArrowWriter;
use File;
// Arrow IPC (Feather) format
let file = create?;
let mut writer = try_new?;
writer.write?;
writer.finish?;
// Parquet format
let file = create?;
let mut writer = try_new?;
writer.write?;
writer.close?;
Export Arrow data to Python without copying (requires pyo3):
use PyArrowType;
use *;
// Python: df = pl.from_arrow(get_hana_data())
Features
Enable optional features in Cargo.toml:
[]
= { = "0.3", = ["async", "test-utils", "profiling"] }
| Feature | Description | Default |
|---|---|---|
async |
Async support via hdbconnect_async |
No |
test-utils |
Expose MockRow/MockRowBuilder for testing |
No |
profiling |
dhat heap profiler integration for performance analysis | No |
[!TIP] Enable
test-utilsin dev-dependencies for unit testing without a HANA connection.
Type mapping
| HANA Type | Arrow Type | Notes |
|---|---|---|
| TINYINT | UInt8 | Unsigned in HANA |
| SMALLINT | Int16 | |
| INT | Int32 | |
| BIGINT | Int64 | |
| REAL | Float32 | |
| DOUBLE | Float64 | |
| DECIMAL(p,s) | Decimal128(p,s) | Full precision preserved |
| CHAR, VARCHAR | Utf8 | |
| NCHAR, NVARCHAR | Utf8 | Unicode strings |
| CLOB, NCLOB | LargeUtf8 | Large text |
| BLOB | LargeBinary | Large binary |
| DATE | Date32 | Days since epoch |
| TIME | Time64(Nanosecond) | |
| TIMESTAMP | Timestamp(Nanosecond) | |
| BOOLEAN | Boolean | |
| GEOMETRY, POINT | Binary | WKB format |
API overview
HanaBatchProcessor— Converts HANA rows to ArrowRecordBatchwith configurable batch sizesBatchConfig— Configuration for batch processing (usesNonZeroUsizefor type-safe batch size)SchemaMapper— Maps HANA result set metadata to Arrow schemasBuilderFactory— Creates appropriate Arrow array builders for HANA typesTypeCategory— Centralized HANA type classification enum
BuilderEnum— Enum-wrapped builder for static dispatch (eliminates vtable overhead)BuilderKind— Discriminant identifying builder type for schema profilingSchemaProfile— Classifies schemas as homogeneous or mixed for optimized processing paths
HanaCompatibleBuilder— Trait for Arrow builders that accept HANA valuesFromHanaValue— Sealed trait for type-safe value conversionBatchProcessor— Core batch processing interfaceLendingBatchIterator— GAT-based streaming iterator for large result setsRowLike— Row abstraction for testing without HANA connection
When test-utils feature is enabled:
use ;
let row = new
.push_i64
.push_string
.push_null
.build;
use ;
Part of pyhdb-rs
This crate is part of the pyhdb-rs workspace, providing the Arrow integration layer for the Python SAP HANA driver.
Related crates:
hdbconnect-py— PyO3 bindings exposing Arrow data to Python
Resources
- Apache Arrow — Official Arrow project
- Arrow Rust — Rust implementation
- DataFusion — Query engine built on Arrow
- Powered by Arrow — Projects using Arrow
MSRV policy
[!NOTE] Minimum Supported Rust Version: 1.88. MSRV increases are minor version bumps.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.