tidk 0.2.1

A toolkit for finding telomeric repeats in a genome.
Documentation
tidk-0.2.1 has been yanked.

A Telomere Identification toolKit (tidk)

tidk is a toolkit to identify and visualise telomeric repeats for the Darwin Tree of Life genomes. tidk works especially well on chromosomal genomes, but can also work on PacBio HiFi reads well (see the telomeric repeat database for many examples). There are a few modules in the tool, which may be useful to anyone investigating telomeric repeat sequences in a genome.

  1. explore - tries to find the telomeric repeat unit in the genome.
  2. find and search are essentially the same. They identify a repeat sequence in windows across the genome. find uses an in-built table of telomeric repeats, in search you supply your own.
  3. plot does what is says on the tin, and plots the csv output of find or search as an SVG.
  4. min returns the lexicographically minimal string of an input DNA string(s) or file of DNA sequences.
  5. trim trims fasta sequences using a supplied base repeat string.

Install

The easiest way to install is through conda:

conda install -c bioconda tidk

Otherwise...

As with other Rust projects, you have to complile yourself. Download rust, clone this repo, and then run:

cargo build --release

Compiling takes anywhere from 1-6 minutes from fresh (tested on the farm). The executable will be at the location ./target/release/tidk.

Usage

Overall

TIDK 0.1.5
Max Brown <mb39@sanger.ac.uk>
A Telomere Identification Toolkit.

USAGE:
    tidk [SUBCOMMAND]

OPTIONS:
    -h, --help       Print help information
    -V, --version    Print version information

SUBCOMMANDS:
    explore    Use a search of all substrings of length k to query a genome for a telomere
               sequence.
    find       Supply the name of a clade your organsim belongs to, and this submodule will find
               all telomeric repeat matches for that clade.
    help       Print this message or the help of the given subcommand(s)
    min        Emit the canonical lexicographically minimal DNA string.
    plot       SVG plot of CSV generated from search or find.
    search     Search the input genome with a specific telomeric repeat search string.
    trim       Trim a specific telomeric repeat from the input reads and yield reads oriented at
               the telomere start.

Explore

tidk explore will identify all sequences of length k, which repeat at least twice throughout a genome. Repeats of high number toward the beginning or end of sequences are likely candidates for telomeric repeats. The reported repeats are the lexicographically most minimal of all possible string rotations of the telomeric repeat, in both forward and reverse complement forms.

It outputs either a csv or bedgraph of potential telomeric repeats and their locations, in addition to a text file of the potential telomeric repeat sequences, and how often they are found in the genome.

For example: tidk explore --fasta fastas/iyBomHort1_1.20210303.curated_primary.fa --minimum 5 --maximum 12 -o test_dist -t 500 searches the genome for repeats from length 5 to length 12 sequentially (definite potential to be made concurrent) on the freshly minted Bombus hortorum genome.

tidk-explore 0.1.5
Use a search of all substrings of length k to query a genome for a telomere sequence.

USAGE:
    tidk explore [OPTIONS] --fasta <fasta>

OPTIONS:
    -d, --distance <distance>      The distance in base pairs from the beginning or end of a
                                   chromosome, to report potential telomeric repeats in. [default:
                                   150000]
    -e, --extension <extension>    The extension, defining the output type of the file. [default:
                                   tsv] [possible values: tsv, bedgraph]
    -f, --fasta <fasta>            The input fasta file.
    -h, --help                     Print help information
    -l, --length <length>          Length of substring.
    -m, --minimum <minimum>        Minimum length of substring. [default: 5]
    -o, --output <output>          Output filename for the TSVs (without extension). [default: tidk-
                                   explore]
    -t, --threshold <threshold>    Positions of repeats are only reported if they occur sequentially
                                   in a greater number than the threshold. [default: 100]
    -v, --verbose                  Print verbose output.
    -V, --version                  Print version information
    -x, --maximum <maximum>        Maximum length of substring. [default: 12]

Find

tidk find will take an input clade, and match the known telomeric repeat for that clade (or repeats plural) and search the genome. Uses the telomeric repeat database. As more telomeric repeats are found and added, the dictionary of sequences used will increase (perhaps there is a more elegant way to parse the command line input?).

tidk-find 0.1.5
Supply the name of a clade your organsim belongs to, and this submodule will find all telomeric
repeat matches for that clade.

USAGE:
    tidk find [OPTIONS]

OPTIONS:
    -c, --clade <clade>      The clade of organism to identify telomeres in. [possible values:
                             vertebrate, ascidian, echinodermata, mollusca, coleoptera, hymenoptera,
                             lepidoptera, megaloptera, trichoptera, neuroptera, blattodea,
                             orthoptera, nematoda, amoeba, plants, ciliates]
    -f, --fasta <fasta>      The input fasta file.
    -h, --help               Print help information
    -o, --output <output>    Output filename for the CSVs (without extension). [default: tidk-find]
    -p, --print              Print a table of clades, along with their telomeric sequences.
    -V, --version            Print version information
    -w, --window <window>    Window size to calculate telomeric repeat counts in. [default: 10000]

Search

tidk search will search the genome for an input string. If you know the telomeric repeat of your sequenced organism, this will hopefully find it.

tidk-search 0.1.5
Search the input genome with a specific telomeric repeat search string.

USAGE:
    tidk search [OPTIONS] --fasta <fasta> --string <string>

OPTIONS:
    -e, --extension <extension>    The extension, defining the output type of the file. [default:
                                   csv] [possible values: csv, bedgraph]
    -f, --fasta <fasta>            The input fasta file.
    -h, --help                     Print help information
    -o, --output <output>          Output filename for the CSVs (without extension). [default: tidk-
                                   search]
    -s, --string <string>          Supply a DNA string to query the genome with.
    -V, --version                  Print version information
    -w, --window <window>          Window size to calculate telomeric repeat counts in. [default:
                                   10000]

Plot

tidk plot will plot a CSV from the output of tidk search. Working on plotting for tidk find (i.e. extending to multiple telomeric repeat sequences in same CSV).

tidk-plot 0.1.5
SVG plot of CSV generated from search or find.

USAGE:
    tidk plot [OPTIONS] --csv <csv>

OPTIONS:
    -c, --csv <csv>          The input CSV file.
    -h, --height <height>    The height of subplots (px). [default: 200]
    -h, --help               Print help information
    -o, --output <output>    Output filename for the SVG (without extension). [default: tidk-plot]
    -V, --version            Print version information
    -w, --width <width>      The width of plot (px). [default: 1000]

As an example on the ol' Square Spot Rustic Xestia xanthographa:

tidk find -f fastas/ilXesXant1_1.20201023.curated_primary.fa -c lepidoptera -o Xes

tidk plot -c finder/Xes_telomeric_repeat_windows.csv -o ilXes -h 120 -w 800

Min

tidk min returns the lexicographically minimal DNA string given an input. Useful for universally comparing repeats using this tool. Surprisingly, it works on even chromosomal length DNA strings quickly, not that you would want to do that...

Examples:

  • tidk min AATGCG or process multiple tidk min AATGCG AAGGTTC GGTTAAT
  • tidk min -f input.fa proccesses input fasta, and outputs fasta. Otherwise tidk min -f input.txt reads lines and outputs.
  • echo "AATTGC" | tidk min, pipes work.
  • cat input.fasta | tidk min -x outputs a fasta. Otherwise cat input.txt | tidk min reads lines and outputs.

Note cat here is redundant and creates extra work, but it just shows the piping in action.

tidk-min 0.1.5
Emit the canonical lexicographically minimal DNA string.

USAGE:
    tidk min [OPTIONS] [DNA string]...

ARGS:
    <DNA string>...    Input DNA string. Multiple inputs allowed.

OPTIONS:
    -f, --file <file>    The input file.
    -h, --help           Print help information
    -V, --version        Print version information
    -x, --fasta          STDIN is in fasta format.

Trim

tidk trim - a rust port of https://github.com/pgonzale60/telomeric-trim. Thanks Pablo!

tidk-trim 0.1.5
Trim a specific telomeric repeat from the input reads and yield reads oriented at the telomere
start.

USAGE:
    tidk trim [OPTIONS] --fasta <fasta> --string <string>

OPTIONS:
    -f, --fasta <fasta>            The input fasta file.
    -h, --help                     Print help information
    -l, --min_len <min_len>        Minimum length of trimmed reads. [default: 1000]
    -m, --min_occur <min_occur>    Number of contiguous occurrences of telomeric repeat to start
                                   trimming. [default: 3]
    -o, --output <output>          Output filename for the trimmed fasta output. [default: tidk-
                                   trim]
    -s, --string <string>          Supply a DNA string to trim the reads with.
    -V, --version                  Print version information