galah 0.4.2

Microbial genome dereplicator
Documentation
<img src="images/galah_logo.png" alt="Galah logo" width="600"/>

- [Galah]#galah
  - [Installation]#installation
    - [Install through the bioconda package]#install-through-the-bioconda-package
    - [Pre-compiled binary]#pre-compiled-binary
    - [Compiling from source]#compiling-from-source
    - [Development]#development
    - [Dependencies]#dependencies
  - [Usage]#usage
    - [Precluster ANI]#precluster-ani
  - [License]#license

# Galah

[![Anaconda-Server Badge](https://anaconda.org/bioconda/galah/badges/version.svg)](https://anaconda.org/bioconda/galah)

Galah aims to be a more scalable metagenome assembled genome (MAG) dereplication
method. That is, it clusters microbial genomes together based on their average
nucleotide identity (ANI), and chooses a single member of each cluster as the
representative.

Galah uses a greedy clustering approach to speed up genome dereplication,
relative to e.g. [dRep](https://drep.readthedocs.io/), particularly when there
are many closely related genomes (i.e. >95% ANI). Generated cluster
representatives have 2 properties. If the ANI threshold was set to 99%, then:

1. Each representative is <99% ANI to each other representative.
2. All members are >=99% ANI to the representative.

If [CheckM](https://ecogenomics.github.io/CheckM/) genome qualities were
specified, then the clusters have an additional property:

3. Each representative genome has a better quality score than other members of
   the cluster. Each genome is assigned a quality score based on the formula
   `completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000`, which is reduced from a quality formula described in
  Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8.

If instead CheckM qualities were not provided, then the following holds instead:

3. Each representative genome was specified to galah before other members of the
   cluster.

The overall greedy clustering approach was largely inspired by the work of
Donovan Parks, as described in [Parks et. al. 2020](https://doi.org/10.1038/s41587-020-0501-8). It
operates in 3 steps. In the first step, genomes are assigned as representative
if no genomes of higher quality are >99% ANI. In the second step, each
non-representative genome is assigned to the representative genome it has the
highest ANI with.

## Installation

### Install through the bioconda package

Galah can be installed through the [bioconda](https://bioconda.github.io/user/install.html) conda channel. After initial setup of conda and the bioconda channel, it can be installed with mamba (or conda) with:

```
mamba install galah
```

One can see [details of the galah recipe](https://bioconda.github.io/recipes/galah/README.html).

Galah can also be used indirectly through
[CoverM](https://github.com/wwood/CoverM) via its `cluster` subcommand, which is also available on bioconda.

### Pre-compiled binary

Galah can be installed by downloading statically compiled binaries, available on
the [releases page](https://github.com/wwood/Galah/releases).

Third party dependencies listed below are required for this method.

### Compiling from source

Galah can also be installed from source, using the cargo build system after
installing [Rust](https://www.rust-lang.org/).

```
cargo install galah
```
Third party dependencies listed below are required for this method.

### Development

To run an unreleased version of Galah, after installing
[Rust](https://www.rust-lang.org/):

```
git clone https://github.com/wwood/galah
cd galah
cargo run -- cluster ...etc...
```
Third party dependencies listed below are required for this method.

### Dependencies

For some advanced usage of Galah, 3rd party tools are required, which must be installed separately:

* Dashing v0.4.0 https://github.com/dnbaker/dashing
* FastANI v1.31 https://github.com/ParBLiSS/FastANI

## Usage
For clustering a set of genomes at 99% ANI:
```
galah cluster --genome-fasta-files /path/to/genome1.fna /path/to/genome2.fna --output-cluster-definition clusters.tsv
```
There are several other options for specifying genomes, ANI cutoffs, etc. 

The full usage is described on the [manual page](https://wwood.github.io/galah/galah-cluster.html), which can be accessed on the command line running `galah cluster --full-help`.

### Precluster ANI
Similar to dRep, galah operates in two stages. In the first, a fast
pre-clustering distance ([dashing](https://github.com/dnbaker/dashing)) is
calculated between each pair of genomes. Genome pairs are only considered as
potentially in the same cluster with
[FastANI](https://github.com/ParBLiSS/FastANI) if the prethreshold ANI is
greater than the specified value. By default, the precluster ANI is set at 95%
and the final ANI is set at 99%.

## License

Galah is made available under GPL3+. See LICENSE.txt for details. Copyright Ben
Woodcroft.

Developed by Ben Woodcroft at the [Centre for Microbiome Research, Queensland University of Technology](https://www.qut.edu.au/health/schools/school-of-biomedical-sciences/centre-for-microbiome-research).

[galah]: Eolophus_roseicapilla_-Wamboin,_NSW,_Australia_-juvenile-8.smaller.jpg