biorustlings 0.0.2

Scripts for parsing UniParc XML files downloaded from the Uniprot website into CSV files.
Documentation
  • Coverage
  • 0%
    0 out of 1 items documented0 out of 0 items with examples
  • Size
  • Source code size: 58.83 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 1.01 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 11s Average build duration of successful builds.
  • all releases: 11s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • Documentation
  • Repository
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • ostrokach

UniParc XML parser

Process the UniParc XML file (uniparc_all.xml.gz) downloaded from the UniProt website into CSV files that can be loaded into a relational database.

Example

Parsing 1 million lines takes about 5.5 seconds:

$ mkdir uniparc
$ time bash -c "zcat tests/uniparc_1mil.xml.gz | uniparc_xml_parser >/dev/null"

real    0m5.564s
user    0m5.528s
sys     0m0.132s

The actual uniparc_all.xml.gz file is about 5 billion rows.

FAQ

Why not split uniparc_all.xml.gz into multiple small files and process them in parallel

  • Splitting the file requires reading the entire file. If we're reading the entire file anyway, why not parse it as we read it?
  • Having a single process which parses uniparc_all.xml.gz makes it easier to create an incremental unique index column (e.g. UniparcXRef.idx, Property.idx, etc.).