SEGUL
SEGUL is an ultrafast and memory efficient command-line (cli) application for working with sequence alignments. It is a cross-platform, single executable app, and has zero runtime dependency on MacOS and Windows. On Linux, it only relies on a library provided by the OS. In a simple word, you don't need to worry about dependencies. It is designed to handle the computational burden of operations on large genomic datasets, but it also provides convenient features for working on smaller datasets (e.g., Sanger datasets). In our test, it consistently offers a faster and more efficient (low memory footprint) alternative to existing applications for a wide variety of sequence alignment manipulations (see benchmark).
Features:
- Converting alignments to different formats.
- Concatenating alignments with partition settings.
- Splitting alignments by partitions.
- Filtering alignments based on minimal taxon completeness, alignment length, or numbers of parsimony informative sites.
- Computing alignment summary statistics.
- Getting sample IDs from a collection of alignments.
- Mapping sample distribution in a collection of alignments.
- Extracting sequences from a collection of alignments based on user-defined IDs (include regular expression support).
- Batch renaming sequence IDs.
- Converting DNA sequences to amino acid sequences.
Supported sequence formats:
- NEXUS
- Relaxed PHYLIP
- FASTA
All of the formats are supported in interleave and sequential versions. The app supports both DNA and amino acid sequences.
Supported partition formats:
- RaXML
- NEXUS
The NEXUS partition can be written as a charset block embedded in NEXUS formatted sequences or a separate file.
Documentation: GitHub Wiki
Citation:
Handika, H. and Esselstyn, J. A. In prep. SEGUL: An ultrafast, memory efficient, and cross-platform alignment manipulation tool for phylogenomics.
Quick Links
- SEGUL
- Quick Links
- Supported Platforms
- Quick Start
- Installation
- Command Structure
- Input Options
- Datatype
- Output
- Converting alignments
- Concatenating alignments
- Splitting alignments by partitions
- Computing sequence summary statistics
- Getting sample IDs from a collection of alignments
- Map sample distribution in a collection of alignments
- Filtering alignments
- Extracting sequences in alignments
- Batch renaming sequence IDs
- Translating DNA sequences
- Logging
- Contribution
Supported Platforms
The app may work in any Rust supported platform. Below is a list of operating system that we tested and is guaranteed to work:
- Linux
- MacOS
- Windows
- Windows Subsystem for Linux (WSL)
:warning: SEGUL modern terminal output comes with a cost of requiring a terminal application that supports UTF-8 encoding. For MacOS and native Linux, your default terminal should have supported UTF-8 encoding by default. For Windows (including WSL) users, we recommend using Windows Terminal to ensure consistent terminal output. Windows Terminal requires separate installation for Windows 10. It should come pre-installed on Windows 11.
Quick Start
The instruction below assumes familiarity with command line application and only highlight some common features that users may need for alignment manipulation and generating sequence statistics tasks. We provide more detailed instruction in the documentation.
Installation
Using pre-compiled binary
For a quick installation, we provide pre-compiled binaries in the release page. For WSL, either the ManyLinux or Linux binary should work. In our test system, the ManyLinux binary is a little faster. For native Linux OS, first check your GLIBC version:
If your system GLIBC is >=2.18, use the Linux binary. If lower, use the ManyLinux binary. The installation is similar to any other single executable command-line app, such as the phylogenetic programs IQ-Tree or RaXML. You only need to make sure the path to the app is registered in your environment variable, so that the app can be called from anywhere in your system (see instructions). If you are still having issues running the program, try to install it using the package manager. This installation method will optimize the compiled binary for your system (see below).
ATTENTION!: For MacOS users, when you run
segulfor the first time, MacOS gatekeeper will prevent the program to run becausesegulis not signed by Apple. Go to Security Setting to allowsegulto run. More details are here in Apple Website.
Install from the Rust Package Manager
The Rust package manager is called cargo. Cargo is easy to install (also easy to uninstall) and will help you to manage the app (see details in the installation instruction). Installing SEGUL through Cargo is similar to installing it from source code, except that it only use the stable version of the code. The source code is managed on crates.io. The badge at top of this Readme has information on the latest version of the app available on crates.io.
After you have Cargo installed in your computer, in Linux system (including WSL), first install the C-development toolkit, build-essential for Debian-based distributions (Debian, Ubuntu, PopOS, Linux Mint, etc.) or its equivalent in other Linux distributions:
On Windows, you only need to install the GNU compiler toolchain available using Rustup. Rustup is installed automatically when you install Cargo. To install the toolchain:
Then, install SEGUL:
You could also install SEGUL from the GitHub repository. Learn more about SEGUL installation here.
Command Structure
The app command structure is similar to git, gh-cli, or any other app that use subcommands. The app file name will be segul for Linux/MacOS/WSL and segul.exe for Windows.
To check for available subcommand:
To check for available options and flags for each sub-command:
Across the app functions, most generic arguments are also available in short format to save time typing them. For example, below we use short arguments to concat alignments in a directory named nexus-alignments:
Learn more about SEGUL command structure and expected behaviors for each argument here.
Input Options
The app has two input options. The standard input --input or -i in short format and --dir or -d in short format. If your input files are all in a single directory, you should use the --dir or -d option and specify the file format:
When dealing with a single file, more complex folder structure, or unusual file extensions, use the --input or -i option.
For a single file:
Multiple file in a directory using wildcard:
Multiple files in multiple directories:
For unusual file extensions or if the app failed to detect the file format, specify the input format:
Both of the input options are available in all subcommands. To keep it simple, the command examples below use --dir as an input.
Datatype
The app support both DNA and amino acid sequences. It will check whether the sequences contain only valid IUPAC characters of the datatype. By default, it sets to DNA sequences. Use the option --datatype aa if your input is amino acid sequences. For example:
Output
Most functions will save into their default directory. For example, the concat function will default to create SEGUL-concat directory and will save its output files into the directory. To specify the output directory, use the --output or -o option. For example:
The app avoids over-writting files with similar names. The app will check if a such file or directory exists and will ask if you like to remove it. The app will exit if you decide to not remove it.
Converting alignments
Segul can convert a single sequence file or multiple sequence files in a directory:
Concatenating alignments
To concat all alignments in a directory:
Splitting alignments by partitions
To split alignment by partions, you need the alignment file and the alignment partion in a separate file:
If it is not provided, segul will use the alignment name as an output directory. To provide the output directory name, use the --output or -o option.
Computing sequence summary statistics
To generate sequence summary statistics of alignments in a directory:
Getting sample IDs from a collection of alignments
You have multiple alignments and want to know what are samples you have in all of those alignment. You can easily do it using segul. The app can find all the unique IDs across thousands of alignments within seconds.
It will generate a text file that contains all the unique IDs across your alignments.
Map sample distribution in a collection of alignments
If you would like to know how the samples distributed across your alignments, you only need to add the --map flag when searching for unique IDs. It will generate both the unique IDs (in a text file) and the sample distribution (in csv).
Filtering alignments
Segul provide multiple filtering parameters.
For example, to filter based on taxon completeness:
Other available parameters are multiple minimal taxon completeness --npercent, alignment length --len, numbers of minimal parsimony informative sites --pinf, and percent of minimal parsimony informative sites --percent-inf.
By default, the app will copy files that are match with the parameter to a new folder. If you would like to concat the results instead, you can specify it by passing --concat flags. All the options available for the concat function above also available for concatenating filtered alignments.
Extracting sequences in alignments
You can also extract sequences from a collection of alignments. It can be done by supplying a list of IDs directly on the command line or in text file. The app finds for the exact match. You can also use regular expression to search for matching IDs.
To extract sequences by inputing the IDs in the command line:
You can specify as many id as you would like. However, for long list of IDs, it may be better to input it using a text file. In the file it should be only the ID list, one ID each line:
The the command will be:
For using regular expression:
The app uses the rust regex library to parse regular expression. The syntax is similar to Perl regular expression (find out more here).
Batch renaming sequence IDs
To rename sequence IDs in multiple alignments, you need to input the sequence IDs in tsv or csv format with header. For example:
| Original_names | New_names |
|---|---|
| Genus_species1_random | Genus_species1_voucherID |
| Genus_species2_random | Genus_species2_voucherID |
To simplify this process, you can generate unique IDs for all of your alignments using the id sub-command.
Copy the IDs to Excel and then write a new names and the header names as above. Save the file as csv or tsv. The program will infer the file format based on the file extension. Use it as an --names or -n input for renaming the sequence IDs using rename sub-command:
Example:
You can also change the output format by using --output-format or -F option.
Translating DNA sequences
List of supported NCBI Genetic Code Tables is available here.
To translate dna alignment to amino acid:
By default, the app will use the standard code table (NCBI Table 1). To set the translation table, use the --table option. For example, to translate dna sequences using NCBI Table 2 (vertebrate MtDNA):
You can also set the reading frame using the --rf option:
To show all the table options, use the --show-tables flag:
Logging
Most information that is printed to the terminal is written to the log file (named segul.log). It is written to the current working directoy. Unlike the terminal output that we try to keep it clean and only show the most important information, the log file will also contain the dates, times, and the log level status. Each time you run the app, if the log file exists in the same directory, the app will append the log output to the same log file. Rename this file or move it to a different directory if you would like to keep a different log file for each task.
Learn more about using SEGUL here.
Contribution
We welcome any kind of contribution, from issue reporting, ideas to improve the app, to code contribution. For ideas and issue reporting please post in the Github issues page. For code contribution, please fork the repository and send pull requests to this repo