Crate faiquery

source ·
Expand description

faiquery

faiquery is a library for querying FASTA files using the FAI index format. It is designed to be fast and memory efficient, and is suitable for use in high-throughput applications.

It keeps a memory-mapped index of the FASTA file, and uses this to query the file on demand using interval queries.

Mutability

The IndexedFasta has the option of keeping an internal buffer for reading. This buffer is reused for all queries, and is cleared after each query. This is the default behaviour.

It will remove all newlines from the resulting queries, and return the resulting sequence as a &[u8].

However, if you need to keep the newlines, or if you need to keep the memory usage low then you can use the query_buffer method instead. This will return a &[u8] directly from the memory map and avoid copying the sequence into a buffer. This will not remove newlines from the resulting sequence.

Example

Here is an example fasta file:

example.fa

>chr1
ACCTACGATCGACTGATCGTAGCTAGCT
CATCGATCGTACGGACGATCGATCGGTT
CACACCGGGCATGACTGATCGGGGGCCC
ACGTGTGTGCAGCGCGCGGCGCGCGCGG
>chr2
TTTTGATCGATCGGCGGGCGCGCGCGGC
CAGATTCGGGCGCGATTATATATTAGCT
CGACGGCGACTCGAGCTACACGTCGGGC
GCGAGCGGGACGCGCGGCGCGCGCGGCC
AAAAAAATTTTTATATATTATTACGCGC
CGACTCAGTCGACTGGGGGCGCGCGCGC
AAACCACA

and its corresponding index file:

example.fa.fai

chr1	112	6	28	29
chr2	176	128	28	29

Querying the FASTA file

Let’s show the default behavior which includes keeping an internal buffer and requires a mutable IndexedFasta object.

use faiquery::{FastaIndex, IndexedFasta};
use anyhow::Result;

let index = FastaIndex::from_filepath("example_data/example.fa.fai")
    .expect("Could not read index file");
let mut faidx = IndexedFasta::new(index, "example_data/example.fa")
    .expect("Could not read FASTA file");

// Query the first 10 bases of chr1
let seq = faidx.query("chr1", 0, 10).unwrap();
assert_eq!(seq, b"ACCTACGATC");

// Query the first 10 bases of chr2
let seq = faidx.query("chr2", 0, 10).unwrap();
assert_eq!(seq, b"TTTTGATCGA");

// Query the first 40 bases of chr1
let seq = faidx.query("chr1", 0, 40).unwrap();

// The resulting sequence is 40 bases long
assert_eq!(seq.len(), 40);

// The resulting sequence has no newlines
let num_newlines = seq.iter().filter(|&&b| b == b'\n').count();
assert_eq!(num_newlines, 0);

Querying the FASTA file immutably

Let’s now show the immutable behavior which does not keep an internal buffer and does not require a mutable IndexedFasta object.

use faiquery::{FastaIndex, IndexedFasta};
use anyhow::Result;

let index = FastaIndex::from_filepath("example_data/example.fa.fai")
    .expect("Could not read index file");
let faidx = IndexedFasta::new(index, "example_data/example.fa")
    .expect("Could not read FASTA file");

// Query the first 10 bases of chr1
let seq = faidx.query_buffer("chr1", 0, 10).unwrap();
assert_eq!(seq, b"ACCTACGATC");

// Query the first 10 bases of chr2
let seq = faidx.query_buffer("chr2", 0, 10).unwrap();
assert_eq!(seq, b"TTTTGATCGA");

// Query the first 40 bases of chr1
let seq = faidx.query_buffer("chr1", 0, 40).unwrap();

// The resulting sequence is 41 characters long
// This is because 1 newline is included
assert_eq!(seq.len(), 41);

// The resulting sequence has 1 newline
let num_newlines = seq.iter().filter(|&&b| b == b'\n').count();
assert_eq!(num_newlines, 1);

Structs

  • The FastaIndex struct represents a FAI index file. A FASTA index.
  • The IndexEntry struct represents a single entry in a FAI index file. A FASTA index entry.
  • The IndexedFasta struct represents a FASTA file that has been indexed using the FAI format. An indexed FASTA file.