rustybam 0.1.25

bioinformatics toolkit in rust
Documentation

rustybam

Actions Status Actions Status Actions Status

Conda (channel only) Downloads

crates.io version crates.io downloads

DOI

rustybam is a bioinformatics toolkit written in the rust programing language focused around manipulation of alignment (bam and PAF), annotation (bed), and sequence (fasta and fastq) files.

Usage

rustybam [OPTIONS] <SUBCOMMAND>

or

rb [OPTIONS] <SUBCOMMAND>

Subcommands

The full manual of subcommands can be found on the docs.

SUBCOMMANDS:
    stats          Get percent identity stats from a sam/bam/cram or PAF
    bed-length     Count the number of bases in a bed file [aliases: bedlen, bl, bedlength]
    filter         Filter PAF records in various ways
    invert         Invert the target and query sequences in a PAF along with the CIGAR string
    liftover       Liftover target sequence coordinates onto query sequence using a PAF
    trim-paf       Trim paf records that overlap in query sequence [aliases: trim, tp]
    orient         Orient paf records so that most of the bases are in the forward direction
    break-paf      Break PAF records with large indels into multiple records (useful for
                   SafFire) [aliases: breakpaf, bp]
    paf-to-sam     Convert a PAF file into a SAM file. Warning, all alignments will be marked as
                   primary! [aliases: paftosam, p2s, paf2sam]
    fasta-split    Reads in a fasta from stdin and divides into files (can compress by adding
                   .gz) [aliases: fastasplit, fasplit]
    fastq-split    Reads in a fastq from stdin and divides into files (can compress by adding
                   .gz) [aliases: fastqsplit, fqsplit]
    get-fasta      Mimic bedtools getfasta but allow for bgzip in both bed and fasta inputs
                   [aliases: getfasta, gf]
    nucfreq        Get the frequencies of each bp at each position
    repeat         Report the longest exact repeat length at every position in a fasta
    suns           Extract the intervals in a genome (fasta) that are made up of SUNs
    help           Print this message or the help of the given subcommand(s)

Install

conda

mamba install -c bioconda rustybam

cargo

cargo install rustybam

Pre-complied binaries

Download from releases (may be slower than locally complied versions).

Source

git clone https://github.com/mrvollger/rustybam.git
cd rustybam
cargo build --release

and the executables will be built here:

target/release/{rustybam,rb}

Examples

PAF or BAM statistics

For BAM files with extended cigar operations we can calculate statistics about the aliment and report them in BED format.

rustybam stats {input.bam} > {stats.bed}

The same can be done with PAF files as long as they are generated with -c --eqx.

rustybam stats --paf {input.paf} > {stats.bed}

PAF liftovers

I have a PAF and I want to subset it for just a particular region in the reference.

With rustybam its easy:

rustybam liftover \
     --bed <(printf "chr1\t0\t250000000\n") \
     input.paf > trimmed.paf

But I also want the alignment statistics for the region.

No problem, rustybam liftover does not just trim the coordinates but also the CIGAR so it is ready for rustybam stats:

rustybam liftover \
    --bed <(printf "chr1\t0\t250000000\n") \
    input.paf \
    | rustybam stats --paf \
    > trimmed.stats.bed

Okay, but Evan asked for an "align slider" so I need to realign in chunks.

No need, just make your bed query to rustybam liftoff a set of sliding windows and it will do the rest.

rustybam liftover \
    --bed <(bedtools makewindows -w 100000 \
        <(printf "chr1\t0\t250000000\n") \
        ) \
    input.paf \
    | rustybam stats --paf \
    > trimmed.stats.bed

You can also use rustybam breakpaf to break up the paf records of indels above a certain size to get more "miropeats" like intervals.

rustybam breakpaf --max-size 1000 input.paf \
    | rustybam liftover \
    --bed <(printf "chr1\t0\t250000000\n") \
    | ./rustybam stats --paf \
    > trimmed.stats.bed

Yeah but how do I visualize the data?

Try out SafFire!

Split fastx files

Split a fasta file between stdout and two other files both compressed and uncompressed.

cat {input.fasta} | rustybam fasta-split two.fa.gz three.fa

Split a fastq file between stdout and two other files both compressed and uncompressed.

cat {input.fastq} | rustybam fastq-split two.fq.gz three.fq

Extract from a fasta

This tools is designed to mimic bedtools getfasta but this tools allows the fasta to be bgzipped.

samtools faidx {seq.fa(.gz)}
rb get-fasta --name --strand --bed {regions.of.interest.bed} --fasta {seq.fa(.gz)}

TODO

  • Finish implementing trim-paf.
  • Add a bedtools getfasta like operation that actually works with bgzipped input.
    • implement bed12/split
  • Allow sam or paf for operations:
    • make a sam header from a PAF file
    • convert sam record to paf record
    • convert paf record to sam record
  • Add D4 for Nucfreq.
  • Finish implementing suns.