Skip to main content

split_fa

Function split_fa 

Source
pub fn split_fa(args: &Args) -> Result<()>
Expand description

Splits a non-gzipped FASTA file into multiple smaller FASTA files.

This function reads a FASTA file, identifies the start positions of all records using memchr_iter to find FA_NEEDLE (typically >). It then divides the file’s content (memory-mapped for efficiency) into chunks based on the specified SplitMode (either ChunkSize or NumFiles). Each chunk is then written to a new output file within the designated output directory, utilizing a Rayon thread pool for parallel processing.

§Arguments

  • args - A reference to an Args struct containing the input file path, output directory, number of threads, splitting mode, and an optional suffix.

§Returns

  • Result<()> - An Ok(()) on successful completion, or an anyhow::Error if any operation (file opening, memory mapping, directory creation, writing to files, or thread pool building) fails.

§Errors

  • Returns an error if the input FASTA file cannot be opened or memory-mapped.
  • Returns an error if no FASTA records are found in the input file.
  • Returns an error if the output directory cannot be created.
  • Returns an error if SplitMode::NumFiles is 0.
  • Returns any std::io::Error during file writing.

§Parallelism

This function uses rayon for parallel processing of chunks, improving performance for large files. The number of threads is configured via args.threads.

§Example

use anyhow::Result;
use std::path::PathBuf;
// Assuming Args and SplitMode are defined as in lib_iso_split example

fn main() -> Result<()> {
    let args = cli::Args {
        file: PathBuf::from("input.fa"),
        outdir: PathBuf::from("fa_chunks"),
        threads: 4,
        suffix: Some("part".to_string()),
        mode_chunk_size: Some(100), // Split into chunks of 100 records
        mode_num_files: None,
    };
    // split_fa(&args)?;
    println!("Successfully split FASTA file.");
    Ok(())
}