fasta_windows 0.2.3

Make quick statistics in windows from a fasta file.
Documentation

fasta_windows

Written for Darwin Tree of Life chromosomal level genome assemblies. The executable takes a fasta formatted file and calculates some statistics of interest in windows:

  • GC content
  • GC proportion
  • GC skew
  • Proportion of G's, C's, A's, T's, N's
  • Shannon entropy
  • Di/tri/tetranucleotide shannon diversity
  • Di/tri/tetranucleotide frequency arrays

Output files can be visualised using fw_plot or grouped using fw_group.

Download

The easiest way to get fasta_windows is through conda/bioconda.

conda create -n fasta_windows -c bioconda fasta_windows

Usage

Fasta windows 0.2.3
Max Brown <mb39@sanger.ac.uk>
Quickly compute statistics over a fasta file in windows.

USAGE:
    fasta_windows [FLAGS] [OPTIONS] --fasta <fasta> --output <output>

FLAGS:
    -d, --description    Add an extra column to _windows.tsv output with fasta header descriptions.
    -h, --help           Prints help information
    -m, --masked         Consider only uppercase nucleotides in the calculations.
    -V, --version        Prints version information

OPTIONS:
    -f, --fasta <fasta>                The input fasta file.
    -o, --output <output>              Output filename for the TSV's (without extension).
    -w, --window_size <window_size>    Integer size of window for statistics to be computed over. [default: 1000]

Building

Building requires Rust.

git clone https://github.com/tolkit/fasta_windows
cd fasta_windows
cargo build --release
# ./target/release/fasta_windows is the executable
# show help
./target/release/fasta_windows --help

The default window size is 1kb.

Output

Output is now a tsv with bed-like format in the first three columns:

ID      start   end     GC_prop GC_skew Shannon_entropy Prop_Gs Prop_Cs Prop_As Prop_Ts Prop_Ns Dinucleotide_Shannon_false      Trinucleotide_Shannon_false Tetranucleotide_Shannon_false
SUPER_1 0       1000    0.452   -0.270  1.929   0.165   0.287   0.361   0.187   0       2.646   3.929   5.134
SUPER_1 1000    2000    0.34    -0.335  1.896   0.113   0.227   0.346   0.314   0       2.617   3.872   5.015
SUPER_1 2000    3000    0.388   -0.912  1.627   0.017   0.371   0.407   0.205   0       1.858   2.049   2.096
SUPER_1 3000    4000    0.634   -0.167  1.933   0.264   0.37    0.199   0.167   0       2.671   3.980   5.215
SUPER_1 4000    5000    0.591   -0.184  1.954   0.241   0.35    0.236   0.173   0       2.701   4.020   5.232
SUPER_1 5000    6000    0.599   -0.229  1.948   0.231   0.368   0.212   0.189   0       2.679   3.991   5.209
SUPER_1 6000    7000    0.596   -0.164  1.961   0.249   0.347   0.214   0.19    0       2.694   3.994   5.206
SUPER_1 7000    8000    0.602   -0.193  1.950   0.243   0.359   0.178   0.22    0       2.672   3.974   5.184
SUPER_1 8000    9000    0.453   -0.214  1.977   0.178   0.275   0.292   0.255   0       2.725   4.031   5.237

Also output (non-optional at the moment), are three more TSV's, which are the arrays of di/tri/tetranucleotide frequencies in each window. These files are large, especially as tetranucleotide frequencies will contain 4e4 columns. The kmers are sorted lexicographically from left -> right (AA(AA) to TT(TT)).

e.g. for dinucleotide frequencies:

ID	start	end	AA	AC	AG	AT	CA	CC	CG	CT	GA	GC	GG	GT	TA	TC	TG	TT
SUPER_1 0       1000    122     120     45      73      134     68      39      46      50      55      45      15      54      44 36       53
SUPER_1 1000    2000    140     83      32      90      85      54      22      66      30      25      19      39      91      65 40       118
SUPER_1 2000    3000    216     181     4       5       4       181     5       181     3       8       3       3       183     1  516
SUPER_1 3000    4000    40      61      54      44      80      137     86      66      54      99      76      35      24      73 48       22
SUPER_1 4000    5000    55      68      75      38      88      138     66      57      58      78      59      46      35      65 41       32
SUPER_1 5000    6000    32      71      63      46      85      137     71      75      65      66      65      34      30      94 31       34
SUPER_1 6000    7000    47      62      63      42      91      132     60      64      58      84      74      32      18      69 51       52
SUPER_1 7000    8000    29      49      64      35      67      143     52      97      58      82      72      31      24      85 55       56
SUPER_1 8000    9000    114     67      43      68      63      86      52      73      51      49      43      35      64      73 40       78
SUPER_1 9000    10000   97      97      44      63      72      95      50      67      46      44      33      46      85      49 42       69

Comments, updates & bugs

As of version 0.2.2, I've removed canonical kmers as an option; it was really computationally expensive and I couldn't think of a way to efficienty add it in. End users that wish this are pointed in the direction of fw_group, which will at some point soon provide this functionality.

The masked (-m) flag only affects GC content, GC proportion, GC skew, proportion of G's, C's, A's, T's, N's. Kmers are coerced to uppercase automatically. Shannon index counts only uppercase nucleotides.

Please use, test, and let me know if there are any bugs or features you want implemented. Either raise an issue, or email me (see email in usage).