XML2ARROW
A Rust crate for efficiently converting XML data to Apache Arrow format.
A Python version of this library is also available on GitHub: https://github.com/mluttikh/xml2arrow-python
Overview
xml2arrow
provides a high-performance solution for transforming XML documents into Apache Arrow tables. It leverages the quick-xml parser for efficient XML processing and the arrow crate for building Arrow data structures. This makes it ideal for handling large XML datasets and integrating them into data processing pipelines that utilize the Arrow ecosystem.
Features
- 🚀 High-performance XML parsing using quick-xml
- 📊 Flexible Mapping: Map complex XML structures to Apache Arrow with YAML
- 🔄 Nested Structure Support: Handle deeply nested XML hierarchies
- 🎯 Customizable Type Conversion: Automatically convert data types and apply unit conversion.
- 💡 Attribute & Element Extraction: Seamlessly extract XML attributes or elements
Usage
xml2arrow
converts XML data to Apache Arrow format using a YAML configuration file.
1. Configuration File (YAML):
The YAML configuration defines the mapping between your XML structure and Arrow tables and fields.
tables:
- name: <table_name> # The name of the resulting Arrow table
xml_path: <xml_path> # The XML path to the *parent* element of the table's row elements
levels: # Index levels for nested XML structures.
- <level1>
- <level2>
fields:
- name: <field_name> # The name of the Arrow field
xml_path: <field_path> # The XML path to the field within a row
data_type: <data_type> # The Arrow data type (see below)
nullable: <true|false> # Whether the field can be null
scale: <number> # Optional scaling factor for floats.
offset: <number> # Optional offset for numeric floats
- name: ... # Define additional tables as needed
tables
: A list of table configurations. Each entry defines a separate Arrow table.name
: The name of the resulting ArrowRecordBatch
(table).xml_path
: An XPath-like string specifying the parent element of the row elements. For example, for<library><book>...</book><book>...</book></library>
, thexml_path
would be/library
.levels
: An array of strings representing parent tables for creating indexes in nested structures. For/library/shelves/shelf/books/book
, uselevels: ["shelves", "books"]
. This creates indexes named<shelves>
and<books>
.fields
: A list of field configurations (columns) for the Arrow table.name
: The name of the field in the Arrow schema.xml_path
: An XPath-like string selecting the field's value. Use@
to select attributes (e.g.,/library/book/@id
).data_type
: The Arrow data type. Supported types:Boolean
(false
,true
,0
or1
)Int8
,UInt8
,Int16
,UInt16
,Int32
,UInt32
,Int64
,UInt64
Float32
,Float64
Utf8
(Strings)
nullable
(Optional): Whether the field can be null (defaults tofalse
).scale
(Optional): A scaling factor for float fields.offset
(Optional): An offset value for float fields.
2. Parsing the XML
use File;
use BufReader;
use ;
Example
This example demonstrates how to convert meteorological station data from XML to Arrow format.
1. XML Data (stations.xml
)
Meteorological Station Data
National Weather Service
2024-12-30T13:59:15Z
-61.39110459389277
48.08662749089257
547.1050788360882
2024-12-30T12:39:15Z
35.486545480326114
950.439973486407
49.77716576844861
2024-12-30T12:44:15Z
29.095166644493865
1049.3215015450517
32.5687148391251
Located in the Arctic Tundra area, used for Scientific Research.
2024-03-31
11.891496388319311
135.09336983543022
174.53349357280004
2024-12-30T12:39:15Z
24.791842953632283
989.4054287187706
57.70794884397625
2024-12-30T12:44:15Z
15.153690541845911
1001.413052919951
45.45094598045342
2024-12-30T12:49:15Z
-4.022555715139081
1000.5225751769922
70.40117458947834
2024-12-30T12:54:15Z
25.852920542644185
953.762785698162
42.62088244545566
Located in the Desert area, used for Weather Forecasting.
2024-01-17
2. Configuration File (stations.yaml
)
tables:
- name: report
xml_path: /
levels:
fields:
- name: title
xml_path: /report/header/title
data_type: Utf8
nullable: false
- name: created_by
xml_path: /report/header/created_by
data_type: Utf8
nullable: false
- name: creation_time
xml_path: /report/header/creation_time
data_type: Utf8
nullable: false
- name: stations
xml_path: /report/monitoring_stations
levels:
- station
fields:
- name: id
xml_path: /report/monitoring_stations/monitoring_station/@id # Path to an attribute
data_type: Utf8
nullable: false
- name: latitude
xml_path: /report/monitoring_stations/monitoring_station/location/latitude
data_type: Float32
nullable: false
- name: longitude
xml_path: /report/monitoring_stations/monitoring_station/location/longitude
data_type: Float32
nullable: false
- name: elevation
xml_path: /report/monitoring_stations/monitoring_station/location/elevation
data_type: Float32
nullable: false
- name: description
xml_path: report/monitoring_stations/monitoring_station/metadata/description
data_type: Utf8
nullable: false
- name: install_date
xml_path: report/monitoring_stations/monitoring_station/metadata/install_date
data_type: Utf8
nullable: false
- name: measurements
xml_path: /report/monitoring_stations/monitoring_station/measurements
levels:
- station # Link to the 'stations' table by element order
- measurement
fields:
- name: timestamp
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/timestamp
data_type: Utf8
nullable: false
- name: temperature
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/temperature
data_type: Float64
nullable: false
offset: 273.15 # Convert from Celsius to Kelvin
- name: pressure
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/pressure
data_type: Float64
nullable: false
scale: 100.0 # Convert from hPa to Pa
- name: humidity
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/humidity
data_type: Float64
nullable: false
3. Parsing the XML
use File;
use BufReader;
use ;
4. Expected Record Batches (Conceptual)
- report:
┌─────────────────────────────┬──────────────────────────┬──────────────────────┐
│ title ┆ created_by ┆ creation_time │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════════════════════╪══════════════════════════╪══════════════════════╡
│ Meteorological Station Data ┆ National Weather Service ┆ 2024-12-30T13:59:15Z │
└─────────────────────────────┴──────────────────────────┴──────────────────────┘
- stations:
┌───────────┬───────┬────────────┬────────────┬────────────┬────────────────────────┬──────────────┐
│ <station> ┆ id ┆ latitude ┆ longitude ┆ elevation ┆ description ┆ install_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ f32 ┆ f32 ┆ f32 ┆ str ┆ str │
╞═══════════╪═══════╪════════════╪════════════╪════════════╪════════════════════════╪══════════════╡
│ 0 ┆ MS001 ┆ -61.391106 ┆ 48.086628 ┆ 547.105103 ┆ Located in the Arctic ┆ 2024-03-31 │
│ ┆ ┆ ┆ ┆ ┆ Tundra a… ┆ │
│ 1 ┆ MS002 ┆ 11.891497 ┆ 135.093369 ┆ 174.533493 ┆ Located in the Desert ┆ 2024-01-17 │
│ ┆ ┆ ┆ ┆ ┆ area, us… ┆ │
└───────────┴───────┴────────────┴────────────┴────────────┴────────────────────────┴──────────────┘
- measurements:
┌───────────┬───────────────┬──────────────────────┬─────────────┬───────────────┬───────────┐
│ <station> ┆ <measurement> ┆ timestamp ┆ temperature ┆ pressure ┆ humidity │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════════╪══════════════════════╪═════════════╪═══════════════╪═══════════╡
│ 0 ┆ 0 ┆ 2024-12-30T12:39:15Z ┆ 308.636545 ┆ 95043.997349 ┆ 49.777166 │
│ 0 ┆ 1 ┆ 2024-12-30T12:44:15Z ┆ 302.245167 ┆ 104932.150155 ┆ 32.568715 │
│ 1 ┆ 0 ┆ 2024-12-30T12:39:15Z ┆ 297.941843 ┆ 98940.542872 ┆ 57.707949 │
│ 1 ┆ 1 ┆ 2024-12-30T12:44:15Z ┆ 288.303691 ┆ 100141.305292 ┆ 45.450946 │
│ 1 ┆ 2 ┆ 2024-12-30T12:49:15Z ┆ 269.127444 ┆ 100052.257518 ┆ 70.401175 │
│ 1 ┆ 3 ┆ 2024-12-30T12:54:15Z ┆ 299.002921 ┆ 95376.27857 ┆ 42.620882 │
└───────────┴───────────────┴──────────────────────┴─────────────┴───────────────┴───────────┘