XML2ARROW
A Rust crate for efficiently converting XML data to Apache Arrow format.
Overview
xml2arrow
provides a high-performance solution for transforming XML documents into Apache Arrow tables. It leverages the quick-xml parser for efficient XML processing and the arrow crate for building Arrow data structures. This makes it ideal for handling large XML datasets and integrating them into data processing pipelines that utilize the Arrow ecosystem.
Features
- 🚀 High-performance XML parsing using quick-xml
- 📊 Flexible Mapping: Map complex XML structures to Apache Arrow with YAML
- 🔄 Nested Structure Support: Handle deeply nested XML hierarchies
- 🎯 Customizable Type Conversion: Automatically convert data types and apply unit conversion.
- 💡 Attribute & Element Extraction: Seamlessly extract XML attributes or elements
Usage
- Create a Configuration File (YAML):
The configuration file (YAML format) defines how your XML structure maps to Arrow tables and fields. Here's a detailed explanation of the configuration structure:
tables:
- name: <table_name> # The name of the resulting Arrow table
xml_path: <xml_path> # The XML path to the *parent* element of the table's row elements
levels: # Index levels for nested XML structures.
- <level1>
- <level2>
fields:
- name: <field_name> # The name of the Arrow field
xml_path: <field_path> # The XML path to the field within a row
data_type: <data_type> # The Arrow data type (see below)
nullable: <true|false> # Whether the field can be null
scale: <number> # Optional scaling factor for floats.
offset: <number> # Optional offset for numeric floats
- name: ...
tables
: A list of table configurations. Each entry defines a separate Arrow table to be extracted from the XML.name
: The name given to the resulting Arrow RecordBatch (which represents a table).xml_path
: An XPath-like string that specifies the XML element that is the parent of the elements representing rows in the table. For example, if your XML contains<library><book>...</book><book>...</book></library>
, thexml_path
would be/library
.levels
: An array of strings that represent parent tables to create an index for nested structures. If the XML structure is/library/shelfs/shelf/books/book
you should define levels like this:levels: ["shelfs", "books"]
. This will create indexes named<shelfs>
and<books>
.fields
: A list of field configurations for each column in the Arrow table.name
: The name of the field in the Arrow schema.xml_path
: An XPath-like string that specifies the XML element or attribute containing the field's value. To select an attribute, append@
followed by the attribute name to the element's path. For example,/library/book/@id
selects theid
attribute of thebook
element.data_type
: The Arrow data type of the field. Supported types are:Boolean
(true or false)Int8
UInt8
Int16
UInt16
Int32
UInt32
Int64
UInt64
Float32
Float64
Utf8
(Strings)
nullable
: A boolean value indicating whether the field can contain null values. This field is optional and defaults tofalse
if not specified.scale
(Optional): A scaling factor for float fields (e.g., to convert units).offset
(Optional): An offset value for float fields (e.g., to convert units).
- Parse the XML
use File;
use BufReader;
use
Example
Suppose we have the following XML file (stations.xml
):
Meteorological Station Data
National Weather Service
2024-12-30T13:59:15Z
-61.39110459389277
48.08662749089257
547.1050788360882
2024-12-30T12:39:15Z
35.486545480326114
950.439973486407
49.77716576844861
2024-12-30T12:44:15Z
29.095166644493865
1049.3215015450517
32.5687148391251
Located in the Arctic Tundra area, used for Scientific Research.
2024-03-31
11.891496388319311
135.09336983543022
174.53349357280004
2024-12-30T12:39:15Z
24.791842953632283
989.4054287187706
57.70794884397625
2024-12-30T12:44:15Z
15.153690541845911
1001.413052919951
45.45094598045342
2024-12-30T12:49:15Z
-4.022555715139081
1000.5225751769922
70.40117458947834
2024-12-30T12:54:15Z
25.852920542644185
953.762785698162
42.62088244545566
Located in the Desert area, used for Weather Forecasting.
2024-01-17
We can define a YAML configuration file (stations.yaml
) to specify how to convert the XML data to Arrow tables:
tables:
- name: report
xml_path: /
levels:
fields:
- name: title
xml_path: /report/header/title
data_type: Utf8
nullable: false
- name: created_by
xml_path: /report/header/created_by
data_type: Utf8
nullable: false
- name: creation_time
xml_path: /report/header/creation_time
data_type: Utf8
nullable: false
- name: stations
xml_path: /report/monitoring_stations
levels:
- station
fields:
- name: id
xml_path: /report/monitoring_stations/monitoring_station/@id # Path to an attribute
data_type: Utf8
nullable: false
- name: latitude
xml_path: /report/monitoring_stations/monitoring_station/location/latitude
data_type: Float32
nullable: false
- name: longitude
xml_path: /report/monitoring_stations/monitoring_station/location/longitude
data_type: Float32
nullable: false
- name: elevation
xml_path: /report/monitoring_stations/monitoring_station/location/elevation
data_type: Float32
nullable: false
- name: description
xml_path: report/monitoring_stations/monitoring_station/metadata/description
data_type: Utf8
nullable: false
- name: install_date
xml_path: report/monitoring_stations/monitoring_station/metadata/install_date
data_type: Utf8
nullable: false
- name: measurements
xml_path: /report/monitoring_stations/monitoring_station/measurements
levels:
- station
- measurement
fields:
- name: timestamp
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/timestamp
data_type: Utf8
nullable: false
- name: temperature
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/temperature
data_type: Float64
nullable: false
offset: 273.15 # Convert from Celsius to Kelvin
- name: pressure
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/pressure
data_type: Float64
nullable: false
scale: 100.0 # Convert from hPa to Pa
- name: humidity
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/humidity
data_type: Float64
nullable: false
Here's how to use xml2arrow
to parse the XML and YAML files and get the resulting Arrow tables:
use File;
use BufReader;
use
- report:
┌─────────────────────────────┬──────────────────────────┬──────────────────────┐
│ title ┆ created_by ┆ creation_time │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════════════════════╪══════════════════════════╪══════════════════════╡
│ Meteorological Station Data ┆ National Weather Service ┆ 2024-12-30T13:59:15Z │
└─────────────────────────────┴──────────────────────────┴──────────────────────┘
- stations:
┌───────────┬───────┬────────────┬────────────┬────────────┬────────────────────────┬──────────────┐
│ <station> ┆ id ┆ latitude ┆ longitude ┆ elevation ┆ description ┆ install_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ f32 ┆ f32 ┆ f32 ┆ str ┆ str │
╞═══════════╪═══════╪════════════╪════════════╪════════════╪════════════════════════╪══════════════╡
│ 0 ┆ MS001 ┆ -61.391106 ┆ 48.086628 ┆ 547.105103 ┆ Located in the Arctic ┆ 2024-03-31 │
│ ┆ ┆ ┆ ┆ ┆ Tundra a… ┆ │
│ 1 ┆ MS002 ┆ 11.891497 ┆ 135.093369 ┆ 174.533493 ┆ Located in the Desert ┆ 2024-01-17 │
│ ┆ ┆ ┆ ┆ ┆ area, us… ┆ │
└───────────┴───────┴────────────┴────────────┴────────────┴────────────────────────┴──────────────┘
- measurements:
┌───────────┬───────────────┬──────────────────────┬─────────────┬───────────────┬───────────┐
│ <station> ┆ <measurement> ┆ timestamp ┆ temperature ┆ pressure ┆ humidity │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════════╪══════════════════════╪═════════════╪═══════════════╪═══════════╡
│ 0 ┆ 0 ┆ 2024-12-30T12:39:15Z ┆ 308.636545 ┆ 95043.997349 ┆ 49.777166 │
│ 0 ┆ 1 ┆ 2024-12-30T12:44:15Z ┆ 302.245167 ┆ 104932.150155 ┆ 32.568715 │
│ 1 ┆ 2 ┆ 2024-12-30T12:39:15Z ┆ 297.941843 ┆ 98940.542872 ┆ 57.707949 │
│ 1 ┆ 3 ┆ 2024-12-30T12:44:15Z ┆ 288.303691 ┆ 100141.305292 ┆ 45.450946 │
│ 1 ┆ 4 ┆ 2024-12-30T12:49:15Z ┆ 269.127444 ┆ 100052.257518 ┆ 70.401175 │
│ 1 ┆ 5 ┆ 2024-12-30T12:54:15Z ┆ 299.002921 ┆ 95376.27857 ┆ 42.620882 │
└───────────┴───────────────┴──────────────────────┴─────────────┴───────────────┴───────────┘