Crate kbo

Source
Expand description

kbo is an approximate local aligner based on converting k-bounded matching statistics into a character representation of the underlying alignment sequence.

Currently, kbo supports three main operations:

  • kbo call calls single and multi base substitutions, insertions, and deletions in a query sequence against a reference and reports their positions and sequences. Call is useful for problems that require .vcf files.
  • kbo find matches the k-mers in a query sequence with the reference and reports the local alignment segments found within the reference. Find is useful for problems that can be solved with blast.
  • kbo map maps the query sequence against a reference sequence, and reports the nucleotide sequence of the alignment relative to the reference. Map solves the same problem as snippy and ska map.

kbo uses the Spectral Burrows-Wheeler Transform data structure that allows efficient k-mer matching between a target and a query sequence and fast retrieval of the k-bounded matching statistic for each k-mer match.

§Installing kbo

§Usage

kbo can be run directly on fasta files without an initial indexing step. Prebuilt indexes are supported via kbo build but are only relevant in kbo find analyses where the reference k-mers can be concatenated into a single contig.

kbo can read inputs compressed in the DEFLATE format (gzip, zlib, etc.). bzip2 and xz support can be enabled by adding the “bzip2” and “xz” feature flags to needletail in the kbo Cargo.toml.

§kbo call

Set up the example by downloading the fasta file for the Streptococcus pneumoniae Spn23F genome from the NCBI and the S. pneumoniae 6952_7#3 assembly from the ENA.

§Calling variants in a reference genome

In the directory with the downloaded files, run

kbo call --reference GCF_000026665.1_ASM2666v1_genomic.fna GCA_001156685.2.fasta.gz > variants.vcf

This will write the variants in the vcf v4.4 format

(click to view the first 20 lines)
##fileformat=VCFv4.4
##contig=<ID=NC_011900.1,length=2221315>
##fileDate=20250324
##source=kbo-cli v0.1.1
##reference=GCF_000026665.1_ASM2666v1_genomic.fna
##phasing=none
#CHROM          POS     ID  REF  ALT  QUAL  FILTER  INFO   FORMAT  unknown
NC_011900.1     83      .   G    A    .     .       .      GT      1
NC_011900.1     845     .   A    C    .     .       .      GT      1
NC_011900.1     1064    .   G    A    .     .       .      GT      1
NC_011900.1     1981    .   G    A    .     .       .      GT      1
NC_011900.1     2392    .   C    T    .     .       .      GT      1
NC_011900.1     2746    .   C    T    .     .       .      GT      1
NC_011900.1     3236    .   T    C    .     .       .      GT      1
NC_011900.1     3397    .   A    G    .     .       .      GT      1
NC_011900.1     3993    .   C    T    .     .       .      GT      1
NC_011900.1     4335    .   AA   A    .     .       INDEL  GT      1
NC_011900.1     4504    .   C    A    .     .       .      GT      1
NC_011900.1     4861    .   A    G    .     .       .      GT      1
NC_011900.1     5007    .   A    T    .     .       .      GT      1

§kbo find

First download the fasta sequence of the Escherichia coli Nissle 1917 genome from the NCBI and the pks island gene sequences from GitHub. Example output was generated with versions ASM71459v1 and rev 021e09f.

§Find gene sequence locations

In the directory containing the input files, run

kbo find --max-gap-len 100 --reference IHE3034_pks_island_genes.fasta GCF_000714595.1_ASM71459v1_genomic.fna
This will produce the output (click to expand)
queryrefq.startq.endstrandlengthmismatchesgap_basesgap_opensidentitycoveragequery.contigref.contig
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22895962290543+948000100.001.90NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeIHE3034_pks_island_genes.fasta
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22397982289162-4936573671299.2498.06NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeIHE3034_pks_island_genes.fasta
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta51459625149449+3488061198.256.86NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeIHE3034_pks_island_genes.fasta
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta53546745356713+204010099.954.08NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeIHE3034_pks_island_genes.fasta

§Find gene sequence locations with names

If you need to know which gene in db.fasta the matches are for, add the --detailed toggle:

kbo find --detailed --reference IHE3034_pks_island_genes.fasta GCF_000714595.1_ASM71459v1_genomic.fna
This replaces the query.contig column with the name of the contig (click to expand)
queryrefq.startq.endstrandlengthmismatchesgap_basesgap_opensidentitycoveragequery.contigref.contig
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22895962289808+213000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbR|locus_tag=ECOK1_RS11410|product=“colibactin biosynthesis LuxR family transcriptional regulator ClbR”|protein_id=WP_000357141.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22898092290543+735000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbA|locus_tag=ECOK1_RS11415|product=“colibactin biosynthesis phosphopantetheinyl transferase ClbA”|protein_id=WP_001217110.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22795412289162-962210099.99100.01NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbB|locus_tag=ECOK1_RS11405|product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbB”|protein_id=WP_001518711.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22769002279500-2601000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbC|locus_tag=ECOK1_RS11400|product=“colibactin polyketide synthase ClbC”|protein_id=WP_001297908.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22760212276890-870000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbD|locus_tag=ECOK1_RS11395|product=“colibactin biosynthesis dehydrogenase ClbD”|protein_id=WP_000982270.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22757432275991-249000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbE|locus_tag=ECOK1_RS11390|product=“colibactin biosynthesis aminomalonyl-acyl carrier protein ClbE”|protein_id=WP_001297917.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22746092275739-1131000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbF|locus_tag=ECOK1_RS11385|product=“colibactin biosynthesis dehydrogenase ClbF”|protein_id=WP_000337350.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22733442274612-126910099.92100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbG|locus_tag=ECOK1_RS11380|product=“colibactin biosynthesis acyltransferase ClbG”|protein_id=WP_000159201.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22685002273296-479720099.96100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbH|locus_tag=ECOK1_RS11375|product=“colibactin non-ribosomal peptide synthetase ClbH”|protein_id=WP_001304254.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22654182268450-3033000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbI|locus_tag=ECOK1_RS11370|product=“colibactin polyketide synthase ClbI”|protein_id=WP_000829570.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22588742265374-6501000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbJ|locus_tag=ECOK1_RS11365|product=“colibactin non-ribosomal peptide synthetase ClbJ”|protein_id=WP_001468003.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22594982260784-128720099.8419.91NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbK|locus_tag=ECOK1_RS11360|product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbK”|protein_id=WP_000222467.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22523992258863-646520099.97100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbK|locus_tag=ECOK1_RS11360|product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbK”|protein_id=WP_000222467.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22538452255131-128710099.9219.80NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbJ|locus_tag=ECOK1_RS11365|product=“colibactin non-ribosomal peptide synthetase ClbJ”|protein_id=WP_001468003.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22509432252406-1464000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbL|locus_tag=ECOK1_RS11355|product=“colibactin biosynthesis amidase ClbL”|protein_id=WP_001297937.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22494422250881-1440000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbM|locus_tag=ECOK1_RS11350|product=“precolibactin export MATE transporter ClbM”|protein_id=WP_000217768.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22450772249445-436910099.98100.02NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbN|locus_tag=ECOK1_RS11345|product=“colibactin non-ribosomal peptide synthetase ClbN”|protein_id=WP_001327259.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22425872245046-2460000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbO|locus_tag=ECOK1_RS11340|product=“colibactin polyketide synthase ClbO”|protein_id=WP_001029878.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22410602242574-1515000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbP|locus_tag=ECOK1_RS11335|product=“precolibactin peptidase ClbP”|protein_id=WP_002430641.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22403452241067-723000100.00100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbQ|locus_tag=ECOK1_RS11330|product=“colibactin biosynthesis thioesterase ClbQ”|protein_id=WP_000065646.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta22397982240310-51310099.81100.00NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbS|locus_tag=ECOK1_RS11325|product=“colibactin self-protection protein ClbS”|protein_id=WP_000290498.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta51459625147210+1249000100.0085.31NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbL|locus_tag=ECOK1_RS11355|product=“colibactin biosynthesis amidase ClbL”|protein_id=WP_001297937.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta51472725148479+1208000100.0083.89NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbM|locus_tag=ECOK1_RS11350|product=“precolibactin export MATE transporter ClbM”|protein_id=WP_000217768.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta51484785149449+972000100.0022.25NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbN|locus_tag=ECOK1_RS11345|product=“colibactin non-ribosomal peptide synthetase ClbN”|protein_id=WP_001327259.1
GCF_000714595.1_ASM71459v1_genomic.fnaIHE3034_pks_island_genes.fasta53546745356713+204010099.9546.70NZ_CP007799.1 Escherichia coli Nissle 1917 chromosome, complete genomeclbN|locus_tag=ECOK1_RS11345|product=“colibactin non-ribosomal peptide synthetase ClbN”|protein_id=WP_001327259.1

Note that the current implementation --detailed slows down the algorithm. Future versions of kbo may address this by incorporating colors in the index structure.

§Find containment of gene sequences in assembly

Alternatively, if you are only interested in whether the contigs in db.fasta are present in the assembly, swap the reference and query above run

kbo find --reference GCF_000714595.1_ASM71459v1_genomic.fna IHE3034_pks_island_genes.fasta
which will return (click to expand)
queryrefq.startq.endstrandlengthmismatchesgap_basesgap_opensidentitycoveragequery.contigref.contig
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna1513-51310099.810.01clbS|locus_tag=ECOK1_RS11325|product=“colibactin self-protection protein ClbS”|protein_id=WP_000290498.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna1723-723000100.000.01clbQ|locus_tag=ECOK1_RS11330|product=“colibactin biosynthesis thioesterase ClbQ”|protein_id=WP_000065646.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna11515-1515000100.000.03clbP|locus_tag=ECOK1_RS11335|product=“precolibactin peptidase ClbP”|protein_id=WP_002430641.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna12460-2460000100.000.05clbO|locus_tag=ECOK1_RS11340|product=“colibactin polyketide synthase ClbO”|protein_id=WP_001029878.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna14368-436810099.980.08clbN|locus_tag=ECOK1_RS11345|product=“colibactin non-ribosomal peptide synthetase ClbN”|protein_id=WP_001327259.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna11208+1208000100.000.02clbM|locus_tag=ECOK1_RS11350|product=“precolibactin export MATE transporter ClbM”|protein_id=WP_000217768.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna11440-1440000100.000.03clbM|locus_tag=ECOK1_RS11350|product=“precolibactin export MATE transporter ClbM”|protein_id=WP_000217768.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna11464-1464000100.000.03clbL|locus_tag=ECOK1_RS11355|product=“colibactin biosynthesis amidase ClbL”|protein_id=WP_001297937.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna16465-646520099.970.12clbK|locus_tag=ECOK1_RS11360|product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbK”|protein_id=WP_000222467.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna16501-6501000100.000.12clbJ|locus_tag=ECOK1_RS11365|product=“colibactin non-ribosomal peptide synthetase ClbJ”|protein_id=WP_001468003.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna13033-3033000100.000.06clbI|locus_tag=ECOK1_RS11370|product=“colibactin polyketide synthase ClbI”|protein_id=WP_000829570.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna14797-479720099.960.09clbH|locus_tag=ECOK1_RS11375|product=“colibactin non-ribosomal peptide synthetase ClbH”|protein_id=WP_001304254.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna11269-126910099.920.02clbG|locus_tag=ECOK1_RS11380|product=“colibactin biosynthesis acyltransferase ClbG”|protein_id=WP_000159201.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna11131-1131000100.000.02clbF|locus_tag=ECOK1_RS11385|product=“colibactin biosynthesis dehydrogenase ClbF”|protein_id=WP_000337350.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna1249-249000100.000.00clbE|locus_tag=ECOK1_RS11390|product=“colibactin biosynthesis aminomalonyl-acyl carrier protein ClbE”|protein_id=WP_001297917.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna1870-870000100.000.02clbD|locus_tag=ECOK1_RS11395|product=“colibactin biosynthesis dehydrogenase ClbD”|protein_id=WP_000982270.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna12601-2601000100.000.05clbC|locus_tag=ECOK1_RS11400|product=“colibactin polyketide synthase ClbC”|protein_id=WP_001297908.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna19621-962110099.990.18clbB|locus_tag=ECOK1_RS11405|product=“colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbB”|protein_id=WP_001518711.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna1213+213000100.000.00clbR|locus_tag=ECOK1_RS11410|product=“colibactin biosynthesis LuxR family transcriptional regulator ClbR”|protein_id=WP_000357141.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna1735+735000100.000.01clbA|locus_tag=ECOK1_RS11415|product=“colibactin biosynthesis phosphopantetheinyl transferase ClbA”|protein_id=WP_001217110.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna2161464+1249000100.000.02clbL|locus_tag=ECOK1_RS11355|product=“colibactin biosynthesis amidase ClbL”|protein_id=WP_001297937.1GCF_000714595.1_ASM71459v1_genomic.fna
IHE3034_pks_island_genes.fastaGCF_000714595.1_ASM71459v1_genomic.fna11564167+301210099.970.06clbN|locus_tag=ECOK1_RS11345|product=“colibactin non-ribosomal peptide synthetase ClbN”|protein_id=WP_001327259.1GCF_000714595.1_ASM71459v1_genomic.fna

§kbo map

kbo map can be used to align a query sequence against a reference sequence. This is useful in for example generating a reference-based alignment of multiple related genomes against a good reference assembly.

To run this example, download the genome sequence of the E. coli UTI89 strain from the NCBI (ASM1326v1) and E. coli Nissle 1917 (ASM71459v1).

§Reference-based alignment

Run

kbo map --reference GCF_000714595.1_ASM71459v1_genomic.fna GCF_000013265.1_ASM1326v1_genomic.fna > result.aln

which will write the alignment sequence to result.aln.

Modules§

derandomize
Derandomizing noisy k-bounded matching statistics.
format
Converting alignment representations into various output formats.
gap_filling
Gap filling using matching statistics and SBWT interval lookups.
index
Wrapper for using the sbwt API to build and query SBWT indexes.
translate
Translating deterministic k-bounded matching statistics into alignments.
variant_calling
Call all variants between a query and a reference.

Structs§

BuildOpts
Options and parameters for SBWT construction.
CallOpts
Options and parameters for call
FindOpts
Options and parameters for find
MapOpts
Options and parameters for map
MatchOpts
Options and parameters for matches

Functions§

build
Builds an SBWT index from some fasta or fastq files.
call
Calls variants between a query and a reference sequence.
find
Finds the k-mers from an SBWT index in a query fasta or fastq file.
map
Maps a query sequence against a reference sequence.
matches
Matches a query fasta or fastq file against an SBWT index.