biorustlings 0.0.2

Scripts for parsing UniParc XML files downloaded from the Uniprot website into CSV files.

Coverage
0%
0 out of 1 items documented0 out of 0 items with examples
Size
Source code size: 58.83 kB This is the summed size of all the files inside the crates.io package for this release.
Documentation size: 1.01 MB This is the summed size of all files generated by rustdoc for all configured targets
Ø build duration
this release: 11s Average build duration of successful builds.
all releases: 11s Average build duration of successful builds in releases after 2024-10-23.
Links
Homepage
Documentation
Repository
crates.io
Dependencies
- xml-rs ^0.6 normal
Versions
- 0.0.2 (2017-11-13)
Owners

UniParc XML parser

Process the UniParc XML file (uniparc_all.xml.gz) downloaded from the UniProt website into CSV files that can be loaded into a relational database.

Example

Parsing 1 million lines takes about 5.5 seconds:

$ mkdir uniparc
$ time bash -c "zcat tests/uniparc_1mil.xml.gz | uniparc_xml_parser >/dev/null"

real    0m5.564s
user    0m5.528s
sys     0m0.132s

The actual uniparc_all.xml.gz file is about 5 billion rows.

FAQ

Why not split `uniparc_all.xml.gz` into multiple small files and process them in parallel

Splitting the file requires reading the entire file. If we're reading the entire file anyway, why not parse it as we read it?
Having a single process which parses uniparc_all.xml.gz makes it easier to create an incremental unique index column (e.g. UniparcXRef.idx, Property.idx, etc.).