Skip to main content

Module ncbi_report

Module ncbi_report 

Source
Expand description

Parser for NCBI assembly report files.

NCBI assembly reports contain rich metadata about contigs including multiple naming conventions. The key columns are:

  • Sequence-Name: The primary name (e.g., “1”, “X”, “MT”)
  • GenBank-Accn: GenBank accession (e.g., “CM000663.2”)
  • RefSeq-Accn: RefSeq accession (e.g., “NC_000001.11”)
  • UCSC-style-name: UCSC-style name (e.g., “chr1”)
  • Sequence-Length: Length in base pairs

All non-empty names from these columns become aliases for matching.

§UCSC-style Name Generation for Patches

For GRCh38 assembly reports prior to p13, the UCSC-style-name column shows “na” for fix-patches and novel-patches. However, UCSC does assign names to these patches following a specific convention:

  • Format: chr{chromosome}_{accession}v{version}_{suffix}
  • Suffix: _fix for fix-patches, _alt for novel-patches
  • Example: GenBank accession KN196472.1 on chromosome 1 as a fix-patch becomes chr1_KN196472v1_fix

This module can optionally generate these UCSC-style names when they are missing from the assembly report. This is controlled by the generate_ucsc_names parameter.

§Sources and References

§Verification

The naming convention has been verified against UCSC’s official chromosome size files for GRCh38.p12 and cross-referenced with NCBI assembly reports for p12-p14.

Structs§

NcbiContigEntry
A parsed contig from an NCBI assembly report with all naming variants

Enums§

PatchType
Patch type for NCBI assembly report entries

Functions§

generate_ucsc_patch_name
Generate a UCSC-style name for a patch contig.
parse_ncbi_report_text
Parse NCBI assembly report from text