OxiGDAL ETL
Streaming ETL (Extract, Transform, Load) framework for continuous geospatial data processing.
Features
- Async Streaming: Built on tokio for high-performance async I/O
- Backpressure Handling: Automatic backpressure management to prevent memory overflow
- Error Recovery: Configurable error handling and retry logic
- Checkpointing: State persistence for fault tolerance
- Monitoring: Built-in metrics and logging with tracing
- Resource Limits: Control parallelism and memory usage
- Flexible Pipeline: Fluent API for composing ETL workflows
Architecture
The ETL framework consists of several key components:
Sources
Data inputs from various sources:
- File: Local file system (GeoTIFF, GeoJSON, etc.)
- HTTP/S3: Streaming from HTTP endpoints and S3
- STAC: STAC catalog integration (optional feature)
- Kafka: Real-time streaming (optional feature)
- PostGIS: PostgreSQL/PostGIS database (optional feature)
- Custom: User-defined sources
Transforms
Data transformations:
- Map: Element-wise transformations
- Filter: Conditional filtering
- FlatMap: One-to-many transformations
- Reduce: Aggregations
- GroupBy: Grouping operations
- Window: Sliding/tumbling windows
- Join: Spatial/attribute joins
Sinks
Data outputs to various destinations:
- File: Local file system
- S3/Azure/GCS: Cloud storage (optional features)
- PostGIS: PostgreSQL/PostGIS database (optional feature)
- Kafka: Message streaming (optional feature)
- Custom: User-defined sinks
Pipeline
Fluent API for composing ETL workflows with:
- Source-transform-sink composition
- Backpressure management
- Error recovery and retries
- Checkpointing for fault tolerance
- Parallel processing
Scheduler
Task scheduling and execution:
- Cron-based scheduling (optional feature)
- Event-triggered execution
- Retry on failure
- Resource limits
- Monitoring
Usage
Basic Pipeline
use *;
use PathBuf;
async
Streaming Pipeline with STAC
use *;
async
Window Operations
use *;
use *;
use Duration;
let window = tumbling_time;
// Process stream with windowing
let pipeline = builder
.source
.transform
.sink
.build?;
Scheduled Tasks
use *;
use *;
use Duration;
let scheduler = new;
// Add scheduled task
let config = new.max_retries;
scheduler.add_task.await?;
// Start scheduler
scheduler.start.await?;
Feature Flags
std(default): Enable standard library supportkafka: Enable Kafka source and sinkpostgres: Enable PostgreSQL/PostGIS supports3: Enable Amazon S3 supportstac: Enable STAC catalog supporthttp: Enable HTTP source supportscheduler: Enable cron-based schedulingall: Enable all optional features
Performance
The ETL framework is designed for high performance:
- Async I/O with tokio for maximum throughput
- Automatic backpressure to prevent memory overflow
- Parallel processing with configurable parallelism
- Zero-copy operations where possible
- Efficient buffering and batching
Error Handling
Comprehensive error handling:
- Type-safe error types with thiserror
- Configurable error recovery
- Retry logic with exponential backoff
- Detailed error messages
- No
unwrap()orpanic!()in production code
License
Apache-2.0
Authors
COOLJAPAN OU (Team Kitasan)