but first, how is this better than any other option? yeah, just check the image below.
googled 'get chromosome sizes from fasta', grab every command/tool I found and benchmarked it. surprisingly, you can lose 14 seconds of your life just waiting for those chrom sizes to be calculated. crazy.
What's new on v.0.0.33?
- --fasta now is --sequence, change based on .fa/.fa.gz/.2bit inputs
- implementation of stdin mode!
- --sequence now defaults to -, so omitting it reads stdin; help text mentions stdin.
- SIMD newline/CR counting via bytecount
- Switched gzip inflate to flate2 with the faster zlib-ng-compat backend and enlarged output buffering
Usage
Binary
Usage: chromsize --output <OUTPUT>
Arguments:
-s, --sequence <SEQUENCE>: Sequence file
-o, --output <OUTPUT>: path to chrom.sizes
Options:
-t, --threads <THREADS>: number of threads
-a, --accession-only only keep the accession id part of the header
--help: print help
--version: print version
examples:
- stream:
cat genome.fa | chromsize -o tmp.chromsizes - explicit stdin:
zcat genome.fa.gz | chromsize -o tmp.chromsizes -s - - file input:
chromsize -s genome.fa -o tmp.chromsizes
input format (FASTA/2bit) and gzip compression are auto-detected (from files or stdin) even if the extension is missing.
crate: https://crates.io/crates/chromsize
Installation
to install rust and use chromsize on your system follow this steps:
- get installer:
curl https://sh.rustup.rs -sSf | shon unix, or go here for other options - run
cargo install chromsize(make sure~/.cargo/binis in your$PATHbefore running it) - use
chromsizewith the required arguments
Library
use chromsize;
Python
build the port to install it as a pkg:
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize/py-chromsize
hatch shell
maturin develop --release
use it as a binary wrapper [for .fa and .2bit]:
import chromsize as cs
input = "/path/to/genome.fa"
output = "/path/to/chrom.sizes"
cs.write_chromsizes(input, output)
or just get them directly
import chromsize as cs
input = "/path/to/genome.fa"
sizes = cs.get_chromsizes(input)
>>> print(sizes)
[
('chr1', 123),
('chr2', 456),
...
]
Build
to build chromsize from this repo, do:
- get rust
- run
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize - run
cargo run --release -- -s <SEQUENCE> -o <OUTPUT>
Container image
to build the development container image:
- run
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize - initialize docker with
start dockerorsystemctl start docker - build the image
docker image build --tag chromsize . - run
docker run --rm -v "[dir_where_your_fa_is]:/dir" chromsize -s /dir/<SEQUENCE> -o /dir/<OUTPUT>
Conda
to use chromsize through Conda just:
conda install chromsize -c biocondaorconda create -n chromsize -c bioconda chromsize
Nextflow (not available yet)
Benchmark
do not believe me? run the benchmark on your own:
- get .fa from any species you want (or download the ones I used from UCSC/NCBI)
- install hyperfine: https://github.com/sharkdp/hyperfine
- go to chromsize/bench and modify the
ASSEMBLIESconst with the .fa you've download - run
cargo run release --bin chromsize-benchmark -- -d /dir/where/my/fastas/are -a show-output ignore-failure
here is all the info and metadata from my experiment:
which tools I used?
| Tool | Command | Reference | Discussion |
|---|---|---|---|
| seqkit | seqkit fx2tab --length --name --header-line {assembly} > chrom.sizes |
1 | 2 |
| chromsize | target/release/chromsize -s {assembly} -o chrom.sizes |
3 | |
| pyfaidx | faidx {assembly} -i chromsizes > chrom.sizes |
4 | 5 |
| samtools | samtools faidx {assembly} && wait | cut -f1,2 {assembly}.fai > chrom.sizes |
6 | 5 |
| faSize | faSize -detailed -tab {assembly} > chrom.sizes |
7 | |
| awk1 | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' {assembly} > chrom.sizes |
8 | 9 |
| awk2 | awk '/^>/{if (l!=") print l; print; l=0; next}{l+=length($0)}END{print l}' {assembly} > chrom.sizes |
8 | 9 |
| bioawk1 | bioawk -c fastx '{print > $name ORS length($seq)}' {assembly} > chrom.sizes |
10 | 9 |
| awk3 | cat {assembly} | awk '$0 ~ > {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > chrom.sizes |
8 | 11 |
| bioawk2 | bioawk -c fastx '{ print $name, length($seq) }' < {assembly} > chrom.sizes |
10 | 2 |
detailed data?
| Species | Assembly | Size (Gb) | chromsize | seqKit | awk1 | awk2 | awk3 | bioawk1 | bioawk2 | faSize | pyfaidx | samtools |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S. cerevisiae | R64 | 0.01 | 0.004 | 0.016 (X 4.0) | 0.043 (X 10.7) | 0.043 (X 10.7) | 0.05 (X 12.5) | 0.03 (X 7.5) | 0.03 (X 7.5) | 0.054 (X 13.5) | 0.101 (X 25.2) | 0.064 (X 16.0) |
| C. elegans | ce11 | 0.10 | 0.02 | 0.103 (X 5.1) | 0.409 (X 20.4) | 0.408 (X 20.4) | 0.492 (X 24.6) | 0.274 (X 13.7) | 0.274 (X 13.7) | 0.426 (X 21.3) | 0.225 (X 11.2) | 0.472 (X 23.6) |
| D. melanogaster | dm6 | 0.14 | 0.028 | 0.147 (X 5.2) | 0.581 (X 20.7) | 0.583 (X 20.8) | 0.714 (X 25.5) | 0.426 (X 15.2) | 0.418 (X 14.9) | 0.633 (X 22.6) | 0.337 (X 12.0) | 0.667 (X 23.8) |
| D. rerio | danRer11 | 1.37 | 0.22 | 0.742 (X 3.4) | 6.815 (X 31.0) | 6.803 (X 30.9) | 8.216 (X 37.3) | 3.946 (X 17.9) | 3.95 (X 18.0) | 7.202 (X 32.7) | 3.029 (X 13.8) | 7.633 (X 34.7) |
| C. familiaris | canFam4 | 2.48 | 0.311 | 1.209 (X 3.9) | 10.158 (X 32.7) | 10.124 (X 32.6) | 12.206 (X 39.2) | 6.55 (X 21.1) | 6.518 (X 21.0) | 10.671 (X 34.3) | 4.741 (X 15.2) | 11.394 (X 36.6) |
| H. sapiens | GRCh38 | 3.10 | 0.43 | 1.696 (X 3.9) | 12.393 (X 28.8) | 12.432 (X 28.9) | 13.681 (X 31.8) | 7.414 (X 17.2) | 7.284 (X 16.9) | 13.102 (X 30.5) | 6.37 (X 14.8) | 14.074 (X 32.7) |
| B. bombina | aBomBom1 | 9.80 | 1.554 | 8.501 (X 5.5) | 41.676 (X 26.8) | 41.696 (X 26.8) | 49.064 (X 31.6) | 24.202 (X 15.6) | 24.374 (X 15.7) | 43.856 (X 28.2) | 19.755 (X 12.7) | 45.387 (X 29.2) |
| A. mexicanum | AmbMex60DD | 28.20 | 3.327 | 14.375 (X 4.3) | 118.923 (X 35.7) | 118.422 (X 35.6) | 137.781 (X 41.4) | 57.626 (X 17.3) | 57.591 (X 17.3) | 121.257 (X 36.4) | 54.82 (X 16.5) | 128.374 (X 38.6) |
| P. annectens | PAN1.0 | 40.10 | 4.606 | 18.664 (X 4.1) | 167.85 (X 36.4) | 165.701 (X 36.0) | 196.833 (X 42.7) | 91.747 (X 19.9) | 91.924 (X 20.0) | 170.475 (X 37.0) | 77.707 (X 16.9) | 181.562 (X 39.4) |
how well performs with .gz?
CHM13-T2T.fa.gz
| Tool | Cores | Time |
|---|---|---|
| seqkit | 16 | 18.993 s ± 0.132 s |
| chromsize | default (max_cpus: 16) | 7.631 s ± 0.010 s |
| seqkit | default (4) | 18.525 s ± 0.520 s |
| chromsize | 4 | 8.035 s ± 0.077 s |
| seqkit | 2 | 18.535 s ± 0.376 s |
| chromsize | 2 | 8.284 s ± 0.030 s |