fastats
CLI to generate statistics from FASTA files:
- Generates BED files for non-masked (
A|C|G|T), soft-masked (a|c|g|t), and hard-masked regions (n|N), per sequence. - Stores overall statistics (GC content, ratios of masked bases) to
stdoutand JSON.
Details
CLI to generate FASTA file statistics (masking, GC content, etc.).
Usage: fastats [OPTIONS] <FASTA_FILE>
Arguments:
<FASTA_FILE>
Options:
-o, --output-dir <OUTPUT_DIR>
The output directory for the BED and summary files. [default: .]
-q, --quiet
Do not print results on stdout.
--ignore-iupac
Enable this to avoid failing when encountering a sequence character that is not in ('A', 'C', 'T', 'G', 'N', 'a', 'c', 't', 'g', 'n').
--no-bed-output
Do not store masking regions into BED files.
--match-regex <SEQUENCE_MATCH_REGEX>
Regular expression to focus the analysis on sequences matching a specific regular expression. [default: .*]
-h, --help
Print help
-V, --version
Print version
Sample output
Bed file per sequence
For each sequence, BED files that report the non-masked, soft-masked, and hard-masked regions are define. They use the simple three-column BED format. Sample output:
chr9 0 10000
chr9 40529470 40529480
...
Summary statistics
Summary statistics are printed out to stdout and into a summary.json file.
Sample output:
Usage examples
Get sorted list of sequence names
fastats hg38.fasta | jq '.[].sequence_name'
Calculate the overall sequence length
fastats hg38.fasta | jq '.[].sequence_length' | paste -sd+ | bc
Print stats for all sequences without a _ in the name
fastats hg38.fasta --match-regex "[^_]*"
Notes
-
Note that the base
nis not considered soft-masked (so the sum of all non-masked, soft-masked, hard-masked, and non-supported IUPAC code bases equals the overall sequence length). -
Ambiguous IUPAC codes (i.e., any code except
N,A,C,G, orT) are not supported. To ingest sequences containing such IUPAC codes, use--ignore-iupac.