Expand description
Parser for NCBI assembly report files.
NCBI assembly reports contain rich metadata about contigs including multiple naming conventions. The key columns are:
- Sequence-Name: The primary name (e.g., “1”, “X”, “MT”)
- GenBank-Accn:
GenBankaccession (e.g., “CM000663.2”) - RefSeq-Accn:
RefSeqaccession (e.g., “NC_000001.11”) - UCSC-style-name: UCSC-style name (e.g., “chr1”)
- Sequence-Length: Length in base pairs
All non-empty names from these columns become aliases for matching.
§UCSC-style Name Generation for Patches
For GRCh38 assembly reports prior to p13, the UCSC-style-name column shows “na”
for fix-patches and novel-patches. However, UCSC does assign names to these
patches following a specific convention:
- Format:
chr{chromosome}_{accession}v{version}_{suffix} - Suffix:
_fixfor fix-patches,_altfor novel-patches - Example:
GenBankaccessionKN196472.1on chromosome 1 as a fix-patch becomeschr1_KN196472v1_fix
This module can optionally generate these UCSC-style names when they are missing
from the assembly report. This is controlled by the generate_ucsc_names parameter.
§Sources and References
- UCSC FAQ on chromosome naming: https://genome.ucsc.edu/FAQ/FAQdownloads.html
- UCSC Patches blog post: https://genome-blog.soe.ucsc.edu/blog/2019/02/22/patches/
- UCSC hg38.p12 chrom.sizes: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p12/
- GRC Patches documentation: https://www.ncbi.nlm.nih.gov/grc/help/patches/
§Verification
The naming convention has been verified against UCSC’s official chromosome size files for GRCh38.p12 and cross-referenced with NCBI assembly reports for p12-p14.
Structs§
- Ncbi
Contig Entry - A parsed contig from an NCBI assembly report with all naming variants
Enums§
- Patch
Type - Patch type for NCBI assembly report entries
Functions§
- generate_
ucsc_ patch_ name - Generate a UCSC-style name for a patch contig.
- parse_
ncbi_ report_ text - Parse NCBI assembly report from text