chromsize 0.0.32

just get your chrom sizes
Documentation
<p align="center">
  <h1 align="center">
    chromsize
  </h1>

  <p align="center">
    <a href="https://img.shields.io/badge/version-0.1.0dev-green" target="_blank">
      <img alt="Version Badge" src="https://img.shields.io/badge/version-0.0.1-green">
    </a>
    <a href="https://crates.io/crates/chromsize" target="_blank">
      <img alt="Crates.io Version" src="https://img.shields.io/crates/v/chromsize">
    </a>
    <a href="https://github.com/alejandrogzi/chromsize" target="_blank">
      <img alt="GitHub License" src="https://img.shields.io/github/license/alejandrogzi/chromsize?color=blue">
    </a>
    <a href="https://crates.io/crates/chromsize" target="_blank">
      <img alt="Crates.io Total Downloads" src="https://img.shields.io/crates/d/chromsize">
    </a>
  </p>


  <p align="center">
    annoyed to have to create an index and cut it? <br>
    have to look for that old script every time? <br>
    got you. just get your chrom sizes. very fast.
  </p>


but first, how is this better than any other option? yeah, just check the image below.

</p>

<p align="center">
    <img width=700 align="center" src="./assets/chromsize.png">
</p>


googled 'get chromosome sizes from fasta', grab every command/tool I found and benchmarked it. surprisingly, you can lose 14 seconds of your life just waiting for those chrom sizes to be calculated. crazy.

> What's new on v.0.0.32?
> - now accepts .2bit files as input too!
> - --fasta argument now is --input (or -i) [accounts for .2bit files]

## Usage
### Binary
``` rust
Usage: chromsize --input <FASTA/FASTA.GZ/2BIT> --output <OUTPUT> [-t <THREADS>]

Arguments:
    -i, --input <FASTA>: FASTA file
    -o, --output <OUTPUT>: path to chrom.sizes

Options:
    -t, --threads <THREADS>: number of threads [default: your max ncpus]
    -a, --accession-only     only keep the accession id part of the header (stop after blank)
    --help: print help
    --version: print version
```
#### crate: [https://crates.io/crates/chromsize]https://crates.io/crates/chromsize

### Installation
to install rust and use chromsize on your system follow this steps:
1. get installer: `curl https://sh.rustup.rs -sSf | sh` on unix, or go [here]https://www.rust-lang.org/tools/install for other options
2. run `cargo install chromsize` (make sure `~/.cargo/bin` is in your `$PATH` before running it)
4. use `chromsize` with the required arguments


### Library
``` rust
use chromsize;

fn main() {
    let input = PathBuf::new("/path/to/fasta.fa"); // INFO: can be .2bit too
    let output = PathBuf::new("/path/to/chrom.sizes");

    let sizes: Vec<(String, u64)> = chromsize::chromsize(&input);
    chromsize::write(sizes, &output)
}
```
### Python
build the port to install it as a pkg:
```
git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize/py-chromsize
hatch shell
maturin develop --release
```
use it as a binary wrapper [for .fa and .2bit]:
``` python3
import chromsize as cs

input = "/path/to/fasta.fa" # INFO: can be .2bit too
output = "/path/to/chrom.sizes"
cs.write_chromsizes(input, output)
```
or just get them directly
``` python3
import chromsize as cs

input = "/path/to/fasta.fa" # INFO: can be .2bit too
sizes = cs.get_chromsizes(input)

>>> print(sizes)
[
    ('chr1', 123),
    ('chr2', 456),
    ...
]
```

## Build
to build chromsize from this repo, do:

1. get rust
2. run `git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize`
3. run `cargo run --release -- -i <FASTA/FASTA.GZ/2BIT> -o <OUTPUT>`

## Container image
to build the development container image:
1. run `git clone https://github.com/alejandrogzi/chromsize.git && cd chromsize`
2. initialize docker with `start docker` or `systemctl start docker`
3. build the image `docker image build --tag chromsize .`
4. run `docker run --rm -v "[dir_where_your_fa_is]:/dir" chromsize -f /dir/<INPUT> -o /dir/<OUTPUT>`

## Conda
to use chromsize through Conda just:
1. `conda install chromsize -c bioconda` or `conda create -n chromsize -c bioconda chromsize`

> ## Nextflow (not available yet)

## Benchmark

do not believe me? run the benchmark on your own:

1. get .fa from any species you want (or download the ones I used from UCSC/NCBI)
2. install hyperfine: [https://github.com/sharkdp/hyperfine](https://github.com/sharkdp/hyperfine)
3. go to chromsize/bench and modify the `ASSEMBLIES` const with the .fa you've download
4. run `cargo run release --bin chromsize-benchmark -- -d /dir/where/my/fastas/are -a show-output ignore-failure`

here is all the info and metadata from my experiment:

### which tools I used?

| Tool      | Command                                                                                                          | Reference | Discussion |
|-----------|------------------------------------------------------------------------------------------------------------------|-----------|------------|
| seqkit    | `seqkit fx2tab --length --name --header-line {assembly} > chrom.sizes`                                           | [1]       | [2]        |
| chromsize | `target/release/chromsize -f {assembly} -o chrom.sizes`                                                          | [3]       |            |
| pyfaidx   | `faidx {assembly} -i chromsizes > chrom.sizes`                                                                   | [4]       | [5]        |
| samtools  | `samtools faidx {assembly} && wait \| cut -f1,2 {assembly}.fai > chrom.sizes`                                      | [6]       | [5]        |
| faSize    | `faSize -detailed -tab {assembly} > chrom.sizes`                                                                 | [7]       |            |
| awk1      | `awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' {assembly} > chrom.sizes` | [8]       | [9]        |
| awk2      | `awk '/^>/{if (l!=") print l; print; l=0; next}{l+=length($0)}END{print l}' {assembly} > chrom.sizes`             | [8]       | [9]        |
| bioawk1   | `bioawk -c fastx '{print > $name ORS length($seq)}' {assembly} > chrom.sizes`                                     | [10]      | [9]        |
| awk3      | `cat {assembly} \| awk '$0 ~ > {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > chrom.sizes` | [8]       | [11]       |
| bioawk2   | `bioawk -c fastx '{ print $name, length($seq) }' < {assembly} > chrom.sizes`                                      | [10]      | [2]        |

[1]: https://github.com/shenwei356/seqkit
[2]: https://www.biostars.org/p/118954/
[3]: https://github.com/alejandrogzi/chromsize
[4]: https://github.com/mdshw5/pyfaidx
[5]: https://www.biostars.org/p/173963/
[6]: https://github.com/samtools/samtools
[7]: https://github.com/ENCODE-DCC/kentUtils/blob/master/bin/linux.x86_64/faSize
[8]: https://www.gnu.org/software/gawk/manual/gawk.html
[9]: https://stackoverflow.com/questions/23992646/sequence-length-of-fasta-file
[10]: https://github.com/lh3/bioawk
[11]: https://www.danielecook.com/generate-fasta-sequence-lengths/

### detailed data?

| Species        | Assembly    | Size (Gb) | chromsize | seqKit        | awk1          | awk2          | awk3          | bioawk1        | bioawk2        | faSize        | pyfaidx        | samtools       |
|----------------|-------------|-----------|-----------|---------------|---------------|---------------|---------------|----------------|----------------|---------------|----------------|----------------|
| S. cerevisiae  | R64         | 0.01      | 0.004     | 0.016 (X 4.0) | 0.043 (X 10.7)| 0.043 (X 10.7)| 0.05 (X 12.5) | 0.03 (X 7.5)   | 0.03 (X 7.5)   | 0.054 (X 13.5)| 0.101 (X 25.2) | 0.064 (X 16.0) |
| C. elegans     | ce11        | 0.10      | 0.02      | 0.103 (X 5.1) | 0.409 (X 20.4)| 0.408 (X 20.4)| 0.492 (X 24.6)| 0.274 (X 13.7) | 0.274 (X 13.7) | 0.426 (X 21.3)| 0.225 (X 11.2) | 0.472 (X 23.6) |
| D. melanogaster| dm6         | 0.14      | 0.028     | 0.147 (X 5.2) | 0.581 (X 20.7)| 0.583 (X 20.8)| 0.714 (X 25.5)| 0.426 (X 15.2) | 0.418 (X 14.9) | 0.633 (X 22.6)| 0.337 (X 12.0) | 0.667 (X 23.8) |
| D. rerio       | danRer11    | 1.37      | 0.22      | 0.742 (X 3.4) | 6.815 (X 31.0)| 6.803 (X 30.9)| 8.216 (X 37.3)| 3.946 (X 17.9) | 3.95 (X 18.0)  | 7.202 (X 32.7)| 3.029 (X 13.8) | 7.633 (X 34.7) |
| C. familiaris  | canFam4     | 2.48      | 0.311     | 1.209 (X 3.9) | 10.158 (X 32.7)| 10.124 (X 32.6)| 12.206 (X 39.2)| 6.55 (X 21.1)  | 6.518 (X 21.0) | 10.671 (X 34.3)| 4.741 (X 15.2) | 11.394 (X 36.6)|
| H. sapiens     | GRCh38      | 3.10      | 0.43      | 1.696 (X 3.9) | 12.393 (X 28.8)| 12.432 (X 28.9)| 13.681 (X 31.8)| 7.414 (X 17.2) | 7.284 (X 16.9) | 13.102 (X 30.5)| 6.37 (X 14.8)  | 14.074 (X 32.7)|
| B. bombina     | aBomBom1    | 9.80      | 1.554     | 8.501 (X 5.5) | 41.676 (X 26.8)| 41.696 (X 26.8)| 49.064 (X 31.6)| 24.202 (X 15.6)| 24.374 (X 15.7)| 43.856 (X 28.2)| 19.755 (X 12.7)| 45.387 (X 29.2)|
| A. mexicanum   | AmbMex60DD  | 28.20     | 3.327     | 14.375 (X 4.3)| 118.923 (X 35.7)| 118.422 (X 35.6)| 137.781 (X 41.4)| 57.626 (X 17.3)| 57.591 (X 17.3)| 121.257 (X 36.4)| 54.82 (X 16.5) | 128.374 (X 38.6)|
| P. annectens   | PAN1.0      | 40.10     | 4.606     | 18.664 (X 4.1)| 167.85 (X 36.4)| 165.701 (X 36.0)| 196.833 (X 42.7)| 91.747 (X 19.9)| 91.924 (X 20.0)| 170.475 (X 37.0)| 77.707 (X 16.9)| 181.562 (X 39.4)|

### how well performs with .gz?

#### CHM13-T2T.fa.gz

| Tool      | Cores                      | Time                   |
|-----------|----------------------------|------------------------|
| seqkit    | 16                         | 18.993 s ± 0.132 s     |
| chromsize | default (max_cpus: 16)     | 7.631 s ± 0.010 s      |
| seqkit    | default (4)                | 18.525 s ± 0.520 s     |
| chromsize | 4                          | 8.035 s ± 0.077 s      |
| seqkit    | 2                          | 18.535 s ± 0.376 s     |
| chromsize | 2                          | 8.284 s ± 0.030 s      |