Crate kbo

Source
Expand description

kbo is an approximate local aligner based on converting k-bounded matching statistics into a character representation of the underlying alignment sequence.

Currently, kbo supports two main operations:

  • kbo find matches the k-mers in a query sequence with the reference and reports the local alignment segments found within the reference. Find is useful for problems that can be solved with blast.
  • kbo map maps the query sequence against a reference sequence, and reports the nucleotide sequence of the alignment relative to the reference. Map solves the same problem as snippy and ska map.

kbo uses the Spectral Burrows-Wheeler Transform data structure that allows efficient k-mer matching between a target and a query sequence and fast retrieval of the k-bounded matching statistic for each k-mer match.

§Installing the kbo executable

See installation instructions at GitHub.

§Usage

kbo can be run directly on fasta files without an initial indexing step. Prebuilt indexes are supported via kbo build but are only relevant in kbo find analyses where the reference k-mers can be concatenated into a single contig.

kbo can read inputs compressed in the DEFLATE format (gzip, zlib, etc.). bzip2 and xz support can be enabled by adding the “bzip2” and “xz” feature flags to needletail in the kbo Cargo.toml.

§kbo find

To set up the example, download the fasta sequence of the Escherichia coli Nissle 1917 genome from the NCBI and the pks island gene sequences from GitHub. Example output was generated with versions ASM71459v1 and rev 43bbd36.

§Find gene sequence locations

In the directory containing the input files, run

kbo find --reference db.fasta GCF_000714595.1_ASM71459v1_genomic.fna
This will produce the output (click to expand)
queryrefq.startq.endstrandlengthmismatchesquery.contigref.contig
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta226708227226+5190db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta22895962290543+9490db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31520393161660+96231db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31617013164301+26010db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31643113165180+8700db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31652103165458+2490db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31654623167857+23971db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31679053172701+47972db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31727513175783+30330db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31758273182327+65010db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31823383190258+79221db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31903203196124+58071db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31961553198614+24600db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31986273200856+22310db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta32008913201403+5131db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta45028874503405+5190db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta51459625147210+12490db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta51472725149449+21790db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta53510155351533+5190db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta53522805352503+2240db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta53546745356713+20401db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta53817955381945+1510db.fastaNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome

§Find gene sequence locations with names

If you need to know which gene in db.fasta the matches are for, add the --detailed toggle:

kbo find --detailed --reference db.fasta GCF_000714595.1_ASM71459v1_genomic.fna
This replaces the query.contig column with the name of the contig (click to expand)
queryrefq.startq.endstrandlengthmismatchesquery.contigref.contig
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta226708227226+5190clbS-like_4ce09aNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta22895962289808+2130clbR locus_tag=ECOK1_RS11410 product=“colibactin biosynthesis LuxR family transcriptional regulator ClbR” protein_id=WP_000357141.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta22898092290543+7350clbA locus_tag=ECOK1_RS11415 product=“colibactin biosynthesis phosphopantetheinyl transferase ClbA” protein_id=WP_001217110.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31520393161660+96231clbB locus_tag=ECOK1_RS11405 product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbB” protein_id=WP_001518711.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31617013164301+26010clbC locus_tag=ECOK1_RS11400 product=“colibactin polyketide synthase ClbC” protein_id=WP_001297908.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31643113165180+8700clbD locus_tag=ECOK1_RS11395 product=“colibactin biosynthesis dehydrogenase ClbD” protein_id=WP_000982270.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31652103165458+2490clbE locus_tag=ECOK1_RS11390 product=“colibactin biosynthesis aminomalonyl-acyl carrier protein ClbE” protein_id=WP_001297917.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31654623166592+11310clbF locus_tag=ECOK1_RS11385 product=“colibactin biosynthesis dehydrogenase ClbF” protein_id=WP_000337350.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31665893167857+12691clbG locus_tag=ECOK1_RS11380 product=“colibactin biosynthesis acyltransferase ClbG” protein_id=WP_000159201.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31679053172701+47972clbH locus_tag=ECOK1_RS11375 product=“colibactin non-ribosomal peptide synthetase ClbH” protein_id=WP_001304254.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31727513175783+30330clbI locus_tag=ECOK1_RS11370 product=“colibactin polyketide synthase ClbI” protein_id=WP_000829570.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31758273182327+65010clbJ locus_tag=ECOK1_RS11365 product=“colibactin non-ribosomal peptide synthetase ClbJ” protein_id=WP_001468003.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31804173181703+12872clbK locus_tag=ECOK1_RS11360 product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbK” protein_id=WP_000222467.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31823383188802+64652clbK locus_tag=ECOK1_RS11360 product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbK” protein_id=WP_000222467.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31860703187356+12871clbJ locus_tag=ECOK1_RS11365 product=“colibactin non-ribosomal peptide synthetase ClbJ” protein_id=WP_001468003.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31887953190258+14640clbL locus_tag=ECOK1_RS11355 product=“colibactin biosynthesis amidase ClbL” protein_id=WP_001297937.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31903203191759+14400clbM locus_tag=ECOK1_RS11350 product=“precolibactin export MATE transporter ClbM” protein_id=WP_000217768.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31917563196124+43701clbN locus_tag=ECOK1_RS11345 product=“colibactin non-ribosomal peptide synthetase ClbN” protein_id=WP_001327259.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31961553198614+24600clbO locus_tag=ECOK1_RS11340 product=“colibactin polyketide synthase ClbO” protein_id=WP_001029878.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta31986273200141+15150clbP locus_tag=ECOK1_RS11335 product=“precolibactin peptidase ClbP” protein_id=WP_002430641.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta32001343200856+7230clbQ locus_tag=ECOK1_RS11330 product=“colibactin biosynthesis thioesterase ClbQ” protein_id=WP_000065646.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta32008913201403+5131clbS locus_tag=ECOK1_RS11325 product=“colibactin self-protection protein ClbS” protein_id=WP_000290498.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta45028874503405+5190clbS-like_4ce09aNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta51459625147210+12490clbL locus_tag=ECOK1_RS11355 product=“colibactin biosynthesis amidase ClbL” protein_id=WP_001297937.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta51472725148479+12080clbM locus_tag=ECOK1_RS11350 product=“precolibactin export MATE transporter ClbM” protein_id=WP_000217768.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta51484785149449+9720clbN locus_tag=ECOK1_RS11345 product=“colibactin non-ribosomal peptide synthetase ClbN” protein_id=WP_001327259.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta53510155351533+5190clbS-like_4ce09aNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta53522805352503+2240clbS-like_4ce09aNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta53546745356713+20401clbN locus_tag=ECOK1_RS11345 product=“colibactin non-ribosomal peptide synthetase ClbN” protein_id=WP_001327259.1NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome
GCF_000714595.1_ASM71459v1_genomic.fnadb.fasta53817955381945+1510clbS-like_4ce09aNZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genome

Note that the current implementation --detailed significantly slows down the algorithm. Future versions of kbo may address this by incorporating colors in the index structure.

§Find containment of gene sequences in assembly

Alternatively, if you are only interested in whether the contigs in db.fasta are present in the assembly, run

kbo find --reference GCF_000714595.1_ASM71459v1_genomic.fna db.fasta
which will return (click to expand)
queryrefq.startq.endstrandlengthmismatchesquery.contigref.contig
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna1513+5131GCF_000714595.1_ASM71459v1_genomic.fnaclbS|locus_tag=ECOK1_RS11325|product=“colibactin self-protection protein ClbS”|protein_id=WP_000290498.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna1723+7230GCF_000714595.1_ASM71459v1_genomic.fnaclbQ|locus_tag=ECOK1_RS11330|product=“colibactin biosynthesis thioesterase ClbQ”|protein_id=WP_000065646.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna11515+15150GCF_000714595.1_ASM71459v1_genomic.fnaclbP|locus_tag=ECOK1_RS11335|product=“precolibactin peptidase ClbP”|protein_id=WP_002430641.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna12460+24600GCF_000714595.1_ASM71459v1_genomic.fnaclbO|locus_tag=ECOK1_RS11340|product=“colibactin polyketide synthase ClbO”|protein_id=WP_001029878.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna14368+43691GCF_000714595.1_ASM71459v1_genomic.fnaclbN|locus_tag=ECOK1_RS11345|product=“colibactin non-ribosomal peptide synthetase ClbN”|protein_id=WP_001327259.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna11208+12080GCF_000714595.1_ASM71459v1_genomic.fnaclbM|locus_tag=ECOK1_RS11350|product=“precolibactin export MATE transporter ClbM”|protein_id=WP_000217768.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna11440+14400GCF_000714595.1_ASM71459v1_genomic.fnaclbM|locus_tag=ECOK1_RS11350|product=“precolibactin export MATE transporter ClbM”|protein_id=WP_000217768.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna11464+14640GCF_000714595.1_ASM71459v1_genomic.fnaclbL|locus_tag=ECOK1_RS11355|product=“colibactin biosynthesis amidase ClbL”|protein_id=WP_001297937.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna16465+64652GCF_000714595.1_ASM71459v1_genomic.fnaclbK|locus_tag=ECOK1_RS11360|product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbK”|protein_id=WP_000222467.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna16501+65010GCF_000714595.1_ASM71459v1_genomic.fnaclbJ|locus_tag=ECOK1_RS11365|product=“colibactin non-ribosomal peptide synthetase ClbJ”|protein_id=WP_001468003.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna13033+30330GCF_000714595.1_ASM71459v1_genomic.fnaclbI|locus_tag=ECOK1_RS11370|product=“colibactin polyketide synthase ClbI”|protein_id=WP_000829570.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna14797+47972GCF_000714595.1_ASM71459v1_genomic.fnaclbH|locus_tag=ECOK1_RS11375|product=“colibactin non-ribosomal peptide synthetase ClbH”|protein_id=WP_001304254.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna11269+12691GCF_000714595.1_ASM71459v1_genomic.fnaclbG|locus_tag=ECOK1_RS11380|product=“colibactin biosynthesis acyltransferase ClbG”|protein_id=WP_000159201.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna11131+11310GCF_000714595.1_ASM71459v1_genomic.fnaclbF|locus_tag=ECOK1_RS11385|product=“colibactin biosynthesis dehydrogenase ClbF”|protein_id=WP_000337350.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna1249+2490GCF_000714595.1_ASM71459v1_genomic.fnaclbE|locus_tag=ECOK1_RS11390|product=“colibactin biosynthesis aminomalonyl-acyl carrier protein ClbE”|protein_id=WP_001297917.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna1870+8700GCF_000714595.1_ASM71459v1_genomic.fnaclbD|locus_tag=ECOK1_RS11395|product=“colibactin biosynthesis dehydrogenase ClbD”|protein_id=WP_000982270.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna12601+26010GCF_000714595.1_ASM71459v1_genomic.fnaclbC|locus_tag=ECOK1_RS11400|product=“colibactin polyketide synthase ClbC”|protein_id=WP_001297908.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna19621+96221GCF_000714595.1_ASM71459v1_genomic.fnaclbB|locus_tag=ECOK1_RS11405|product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbB”|protein_id=WP_001518711.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna1213+2130GCF_000714595.1_ASM71459v1_genomic.fnaclbR|locus_tag=ECOK1_RS11410|product=“colibactin biosynthesis LuxR family transcriptional regulator ClbR”|protein_id=WP_000357141.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna1735+7350GCF_000714595.1_ASM71459v1_genomic.fnaclbA|locus_tag=ECOK1_RS11415|product=“colibactin biosynthesis phosphopantetheinyl transferase ClbA”|protein_id=WP_001217110.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna1519+5190GCF_000714595.1_ASM71459v1_genomic.fnaclbS-like_4ce09a
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna1519+5190GCF_000714595.1_ASM71459v1_genomic.fnaclbS-like_4ce09a
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna2161464+12490GCF_000714595.1_ASM71459v1_genomic.fnaclbL|locus_tag=ECOK1_RS11355|product=“colibactin biosynthesis amidase ClbL”|protein_id=WP_001297937.1
db.fastaGCF_000714595.1_ASM71459v1_genomic.fna11564167+30131GCF_000714595.1_ASM71459v1_genomic.fnaclbN|locus_tag=ECOK1_RS11345|product=“colibactin non-ribosomal peptide synthetase ClbN”|protein_id=WP_001327259.1

§kbo map

kbo map can be used to align a query sequence against a reference sequence. This is useful in for example generating a reference-based alignment of multiple related genomes against a good reference assembly.

To run this example, download the genome sequence of the E. coli UTI89 strain from the NCBI (ASM1326v1).

§Reference-based alignment

Run

kbo map --reference GCF_000714595.1_ASM71459v1_genomic.fna GCF_000013265.1_ASM1326v1_genomic.fna > result.aln

which will write the alignment sequence to result.aln. Note that kbo map always writes to stdout.

If you have multiple sequences you need to align, either supply them as arguments to kbo map or process them using gnu parallel:

parallel -j 'kbo map --reference GCF_000714595.1_ASM71459v1_genomic.fna {}' < query_paths.txt > result.aln

kbo map also accepts the --threads argument to parallelise either the index construction (in the case of a single query), or run in parallel over the input files (multiple queries).

Modules§

  • Derandomizing noisy k-bounded matching statistics.
  • Converting alignment representations into various output formats.
  • Wrapper for using the sbwt API to build and query SBWT indexes.
  • Translating deterministic k-bounded matching statistics into alignments.

Structs§

Functions§

  • Builds an SBWT index from some fasta or fastq files.
  • Finds the k-mers from an SBWT index in a query fasta or fastq file.
  • Maps a query sequence against a reference sequence.
  • Matches a query fasta or fastq file against an SBWT index.