MIDAS Processor
A high-performance Rust tool for converting UK Met Office MIDAS weather datasets from BADC-CSV format to optimized Parquet files for efficient analysis.
Overview
MIDAS Processor is part of a climate research toolkit designed to process historical UK weather data from the CEDA Archive. It transforms the original BADC-CSV format into modern, optimized Parquet files with significant performance improvements for analytical workloads.
The MIDAS Ecosystem
This tool works as part of a complete climate data processing pipeline:
- midas-fetcher: Downloads MIDAS datasets from CEDA
- midas-processor (this tool): Converts BADC-CSV to optimized Parquet
- Analysis tools: Python/R analysis of the resulting Parquet files
What is MIDAS?
MIDAS (Met Office Integrated Data Archive System) contains historical weather observations from 1000+ UK land-based weather stations, spanning from the late 19th century to present day. The datasets include:
- Daily rainfall observations (161k+ files, ~5GB)
- Daily temperature observations (41k+ files, ~2GB)
- Wind observations (12k+ files, ~9GB)
- Solar radiation observations (3k+ files, ~2GB)
Key Features
🚀 Performance Optimizations
- Large row groups (500K rows)
- Smart compression (Snappy, ZSTD, LZ4 options)
- Column statistics for query pruning
- Memory-efficient streaming for large datasets
🔍 Intelligent Processing
- Automatic dataset discovery from midas-fetcher cache
- Schema detection and validation
💻 User-Friendly Interface
- Interactive dataset selection when run without arguments
- Simple command-line interface with sensible defaults
- Comprehensive progress reporting with file counts and timing
- Verbose mode for debugging and optimization insights
Installation
Prerequisites
- Rust 1.70+ (uses Rust 2024 edition features)
- 8GB+ RAM recommended for large datasets
From Source
From Crates.io
Usage
Quick Start
- Download datasets using midas-fetcher
- Convert with auto-discovery:
This will show available datasets and let you select one interactively.
Common Usage Patterns
# Interactive dataset selection
# Process specific dataset
# Custom output location
# High compression for archival
# Schema analysis only (no conversion)
# Combine options
Command-Line Options
| Option | Description | Default |
|---|---|---|
DATASET_PATH |
Path to MIDAS dataset directory (optional) | Auto-discover |
--output-path |
Custom output location | ../parquet/{dataset}.parquet |
--compression |
Compression algorithm (snappy/zstd/lz4/none) | snappy |
--discovery-only |
Analyze schema without converting | false |
--verbose |
Enable detailed logging | false |
Technical Details
Data Structure Optimizations
- Station-Timestamp Sorting: Data is sorted by
station_idthenob_end_timefor optimal query performance - Large Row Groups: 500K rows per group for better compression and fewer metadata operations
- Column Statistics: Enabled for all columns to allow query engines to skip irrelevant data
- Memory Streaming: Processes datasets larger than available RAM through streaming execution
Quality Control
- Header validation: Ensures BADC-CSV headers are correctly parsed
- Schema consistency: Validates column structures across files
- Error reporting: Detailed error messages with file locations
- Missing data handling: Graceful handling of incomplete or corrupted files
Integration with Analysis Tools
Python (Polars)
# Fast station-based query
= \
\
# Time range analysis
= \
\
\
\
Python (Pandas)
# Read with automatic optimization
=
# Station-specific analysis
=
R
# Lazy evaluation with Arrow
rain_data <-
# Efficient aggregation
monthly_totals <- rain_data %>%
%>%
%>%
%>%
Troubleshooting
Common Issues
Memory Issues
# For very large datasets, ensure sufficient RAM or use streaming
Performance Issues
# Check if storage is the bottleneck
Cache Directory Not Found
# Ensure midas-fetcher has been run first
Error Messages
- "No MIDAS datasets found in cache": Run midas-fetcher first to download datasets
- "Failed to parse header": BADC-CSV file may be corrupted, check source data
- "Configuration file not found": Dataset type not recognized, check file structure
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
See CHANGELOG.md for detailed version history and release notes.
Contributing
We welcome contributions! Please see our contributing guidelines for details.
Development Setup
Code Style
- Use
cargo fmtfor formatting - Ensure
cargo clippypasses without warnings - Add tests for new functionality
- Update documentation for API changes
Citation
If you use this tool in your research, please cite:
Support
- Documentation: See docs/ directory
- Issues: Report bugs via GitHub Issues
- Discussions: Ask questions in GitHub Discussions
Acknowledgments
- UK Met Office: For providing the MIDAS datasets
- CEDA: For hosting and maintaining the climate data archive
- BADC: For developing the CSV format standards
- Polars Project: For the high-performance DataFrame library enabling fast processing