UniParc XML parser
- Introduction
- Usage
- Table schema
- Installation
- Output files
- Benchmarks
- Roadmap
- FAQ (Frequently Asked Questions)
- FUQ (Frequently Used Queries)
Introduction
Process the UniParc XML file (uniparc_all.xml.gz
) downloaded from the UniProt website into CSV files that can be loaded into a relational database.
Usage
Uncompressed XML data can be piped into uniparc_xml_parser
in order to
| |
The output is a set of CSV (or more specifically TSV) files:
Table schema
The generated CSV files conform to the following schema:
Installation
Binaries
Linux binaries are available here: https://gitlab.com/ostrokach/uniparc_xml_parser/-/packages.
Cargo
Use cargo
to compile and install uniparc_xml_parser
for your target platform:
Conda
Use conda
to install precompiled binaries:
Output files
Parquet
Parquet files containing the processed data are available at the following URL and are updated monthly: http://uniparc.data.proteinsolver.org/.
Google BigQuery
The data can also be queried directly using Google BigQuery: https://console.cloud.google.com/bigquery?project=ostrokach-data&p=ostrokach-data&page=dataset&d=uniparc.
Benchmarks
Parsing 10,000 XML entires takes around 30 seconds (the process is mostly IO-bound):
The actual uniparc_all.xml.gz
file has around 373,914,570 elements.
Roadmap
- Keep everything in bytes all the way until output.
FAQ (Frequently Asked Questions)
Why not split uniparc_all.xml.gz
into multiple small files and process them in parallel?
- Splitting the file requires reading the entire file. If we're reading the entire file anyway, why not parse it as we read it?
- Having a single process which parses
uniparc_all.xml.gz
makes it easier to create an incremental unique index column (e.g.xref.xref_id
).
FUQ (Frequently Used Queries)
TODO