rustkmer 0.5.2 - Docs.rs

# Command Line Interface

Complete guide to using RustKmer's command line interface for k-mer counting and database operations.

## Overview

RustKmer provides a powerful command line interface (CLI) for all k-mer operations. The CLI is optimized for performance, supports batch processing, and integrates well with bioinformatics pipelines.

## Installation

```bash
# Install from crates.io
cargo install rustkmer

# Or build from source
git clone https://github.com/rustkmer/rustkmer.git
cd rustkmer
cargo build --release
```

## Basic Usage

```bash
# Count k-mers from a file
rustkmer count -i input.fa -o output.rkdb

# Query a database
rustkmer query -d database.rkdb -q "ATCGATCGATCGATCGATCG"

# Get help
rustkmer --help
rustkmer count --help
```

## Commands

### `count` - Count K-mers

Count k-mers from genomic sequence files and create databases.

#### Basic Counting
```bash
# Count k-mers with default settings (k=21)
rustkmer count -i genome.fa -o genome_k21.rkdb

# Specify k-mer size
rustkmer count -i genome.fa -k 31 -o genome_k31.rkdb

# Use canonical k-mers (recommended for genomes)
rustkmer count -i genome.fa --canonical -o genome_canonical.rkdb
```

#### Input Formats
```bash
# FASTA files
rustkmer count -i genome.fa -o genome.rkdb
rustkmer count -i genome.fa.gz -o genome.rkdb

# FASTQ files
rustkmer count -i reads.fq -o reads.rkdb
rustkmer count -i reads.fq.gz -o reads.rkdb

# Multiple files
rustkmer count -i chr1.fa -i chr2.fa -i chr3.fa -o genome.rkdb
```

#### Advanced Counting Options
```bash
# Specify thread count
rustkmer count -i genome.fa -o genome.rkdb --threads 8

# Enable verbose output
rustkmer count -i genome.fa -o genome.rkdb --verbose

# Create sorted database for faster querying
rustkmer count -i genome.fa -o genome.rkdb --sorted

# Create indexed database for very fast querying
rustkmer count -i genome.fa -o genome.rkdb --indexed

# Compress database to save disk space
rustkmer count -i genome.fa -o genome.rkdb --compress
```

#### Counting Examples
```bash
# Example 1: Basic genome analysis
rustkmer count \
  -i human_genome.fa \
  -o human_genome_k21.rkdb \
  --canonical \
  --threads 16 \
  --verbose

# Example 2: Large dataset processing
rustkmer count \
  -i large_dataset.fa.gz \
  -o large_dataset.rkdb \
  -k 31 \
  --canonical \
  --sorted \
  --compress \
  --threads 32

# Example 3: Fast counting for testing
rustkmer count \
  -i test_data.fa \
  -o test_data.rkdb \
  -k 13 \
  --threads 4
```

### `query` - Query Databases

Query k-mer databases for exact matches and retrieve counts.

#### Basic Querying
```bash
# Single k-mer query
rustkmer query -d database.rkdb -q "ATCGATCGATCGATCGATCG"

# Query from file
rustkmer query -d database.rkdb -f queries.txt

# Batch query from text file (one k-mer per line)
rustkmer query database.rkdb --batch kmer_list.txt

# Multiple queries
rustkmer query -d database.rkdb -q "ATCGATCGATCGATCGATCG" -q "GCTAGCTAGCTAGCTAGCTAG"
```

#### Query Formats
```bash
# Query file format (one k-mer per line)
cat > queries.txt << EOF
ATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAG
TTTTTTTTTTTTTTTTTTTTT
CCCCCCCCCCCCCCCCCCCCCC
EOF

rustkmer query -d database.rkdb -f queries.txt

# Batch query format (one k-mer per line, supports comments)
cat > kmer_list.txt << EOF
# This is a comment and will be ignored
ATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAG

# Empty lines are also ignored
TTTTTTTTTTTTTTTTTTTTT
CCCCCCCCCCCCCCCCCCCCCC
EOF

rustkmer query database.rkdb --batch kmer_list.txt
```

#### Output Formats
```bash
# Simple output (default)
rustkmer query -d database.rkdb -q "ATCGATCGATCGATCGATCG"
# Output: ATCGATCGATCGATCGATCG: 42

# Tab-separated output
rustkmer query -d database.rkdb -f queries.txt --output-format tsv

# JSON output
rustkmer query -d database.rkdb -f queries.txt --output-format json

# CSV output with headers
rustkmer query -d database.rkdb -f queries.txt --output-format csv --header
```

#### Query Examples
```bash
# Example 1: Basic gene analysis
rustkmer query \
  -d genome_database.rkdb \
  -q "ATCGATCGATCGATCGATCG" \
  --output-format json

# Example 2: High-throughput querying
rustkmer query \
  -d genome_database.rkdb \
  -f gene_kmers.txt \
  --output-format tsv \
  --threads 8

# Example 3: Batch query with multiple databases
for db in chr*.rkdb; do
  echo "Querying $db..."
  rustkmer query -d "$db" -f queries.txt -o "${db%.rkdb}_results.tsv"
done
```

### `fuzzy-query` - Fuzzy Search

Perform fuzzy k-mer searches with wildcards and distance constraints.

#### Basic Fuzzy Search
```bash
# Search with wildcards (N = any base)
rustkmer fuzzy-query \
  -d database.rkdb \
  -p "ATCGATCGNATCGATCG" \
  --max-matches 100

# Search with distance constraint
rustkmer fuzzy-query \
  -d database.rkdb \
  -p "ATCGATCGATCGATCGATCG" \
  --max-distance 2 \
  --max-matches 50
```

#### Fuzzy Search Options
```bash
# Search with pattern containing multiple wildcards
rustkmer fuzzy-query \
  -d database.rkdb \
  -p "ATNNGNCGATCG" \
  --max-matches 1000

# Exhaustive search (slower but complete)
rustkmer fuzzy-query \
  -d database.rkdb \
  -p "ATCGATCGATCGATCGATCG" \
  --max-distance 3 \
  --max-matches 1000 \
  --exhaustive

# Output in different formats
rustkmer fuzzy-query \
  -d database.rkdb \
  -p "ATCGATCGATCGATCGATCG" \
  --output-format json \
  --include-distance
```

#### Fuzzy Examples
```bash
# Example 1: Pattern matching with ambiguity
rustkmer fuzzy-query \
  -d genome.rkdb \
  -p "ATCGATCGATCGATCGATCG" \
  --max-distance 1 \
  --max-matches 20 \
  --verbose

# Example 2: Search for similar sequences
rustkmer fuzzy-query \
  -d proteins.rkdb \
  -p "ATCGATCGATCGATCGATCG" \
  --max-distance 2 \
  --exhaustive \
  --output-format json

# Example 3: High-throughput fuzzy search
rustkmer fuzzy-query \
  -d metagenome.rkdb \
  -p "ATNNGATCGATCG" \
  --max-matches 5000 \
  --threads 16 \
  -o fuzzy_results.json
```

### `info` - Database Information

Display information about k-mer databases.

#### Basic Info
```bash
# Show database information
rustkmer info -d database.rkdb

# Detailed information
rustkmer info -d database.rkdb --detailed

# Information for multiple databases
rustkmer info -d db1.rkdb -d db2.rkdb -d db3.rkdb
```

#### Info Examples
```bash
# Example 1: Quick database check
rustkmer info -d genome_k21.rkdb

# Example 2: Detailed analysis
rustkmer info -d genome_k31.rkdb --detailed

# Example 3: Batch database analysis
for db in *.rkdb; do
  echo "=== $db ==="
  rustkmer info -d "$db"
  echo
done
```

### `compare` - Compare Databases

Compare two k-mer databases and find similarities/differences.

#### Basic Comparison
```bash
# Compare two databases
rustkmer compare -d1 database1.rkdb -d2 database2.rkdb

# Comparison with statistics
rustkmer compare \
  -d1 genome1.rkdb \
  -d2 genome2.rkdb \
  --statistics \
  --output comparison_results.txt
```

#### Advanced Comparison
```bash
# Detailed comparison with threshold
rustkmer compare \
  -d1 sample1.rkdb \
  -d2 sample2.rkdb \
  --min-count 10 \
  --similarity-threshold 0.8 \
  --output detailed_comparison.txt

# Export common k-mers
rustkmer compare \
  -d1 control.rkdb \
  -d2 treatment.rkdb \
  --export-common common_kmers.txt

# Export unique k-mers
rustkmer compare \
  -d1 sample1.rkdb \
  -d2 sample2.rkdb \
  --export-unique1 unique_to_sample1.txt \
  --export-unique2 unique_to_sample2.txt
```

#### Compare Examples
```bash
# Example 1: Basic genome comparison
rustkmer compare \
  -d1 human_genome.rkdb \
  -d2 mouse_genome.rkdb \
  --statistics

# Example 2: Differential analysis
rustkmer compare \
  -d1 control_group.rkdb \
  -d2 treatment_group.rkdb \
  --min-count 50 \
  --export-common shared_kmers.txt \
  --export-unique2 treatment_specific.txt

# Example 3: Multiple sample comparison
rustkmer compare \
  -d1 sampleA.rkdb \
  -d2 sampleB.rkdb \
  --similarity-threshold 0.9 \
  --output similarity_report.txt \
  --detailed
```

### `merge` - Merge Databases

Merge multiple k-mer databases into a single database.

#### Basic Merging
```bash
# Merge two databases
rustkmer merge -d1 db1.rkdb -d2 db2.rkdb -o merged.rkdb

# Merge multiple databases
rustkmer merge \
  -d1 chr1.rkdb \
  -d2 chr2.rkdb \
  -d3 chr3.rkdb \
  -o complete_genome.rkdb
```

#### Merge Options
```bash
# Merge with specific k-mer size (must match)
rustkmer merge \
  -d1 sample1.rkdb \
  -d2 sample2.rkdb \
  -o merged.rkdb \
  -k 21

# Merge and sort result
rustkmer merge \
  -d1 part1.rkdb \
  -d2 part2.rkdb \
  -o complete.rkdb \
  --sort

# Merge with compression
rustkmer merge \
  -d1 batch1.rkdb \
  -d2 batch2.rkdb \
  -o final.rkdb \
  --compress
```

#### Merge Examples
```bash
# Example 1: Combine chromosome databases
rustkmer merge \
  -d1 chr1.rkdb -d2 chr2.rkdb -d3 chr3.rkdb \
  -d4 chr4.rkdb -d5 chr5.rkdb \
  -o genome_complete.rkdb \
  --sort

# Example 2: Merge batch processing results
rustkmer merge \
  -d1 batch1.rkdb \
  -d2 batch2.rkdb \
  -d3 batch3.rkdb \
  -o all_batches.rkdb \
  --compress \
  --verbose

# Example 3: Create consensus database
rustkmer merge \
  -d1 sample1.rkdb \
  -d2 sample2.rkdb \
  -d3 sample3.rkdb \
  -o consensus.rkdb \
  --sort \
  --threads 8
```

## Global Options

These options are available for all commands:

### Verbosity and Output
```bash
# Verbose output
rustkmer count -i input.fa -o output.rkdb --verbose

# Quiet mode (minimal output)
rustkmer count -i input.fa -o output.rkdb --quiet

# Progress bar
rustkmer count -i input.fa -o output.rkdb --progress
```

### Threading
```bash
# Auto-detect threads (default)
rustkmer count -i input.fa -o output.rkdb

# Specify thread count
rustkmer count -i input.fa -o output.rkdb --threads 8

# Single-threaded
rustkmer count -i input.fa -o output.rkdb --threads 1
```

### Configuration
```bash
# Use configuration file
rustkmer --config config.toml count -i input.fa -o output.rkdb

# Set working directory
rustkmer --working-dir /path/to/work count -i input.fa -o output.rkdb
```

## Configuration Files

Create a TOML configuration file to store default settings:

```toml
# rustkmer.toml
[general]
default_threads = 8
default_k = 21
working_directory = "/data/rustkmer"

[counting]
canonical = true
sort = true
compress = false

[querying]
default_output_format = "tsv"
include_zero_counts = false

[fuzzy_search]
max_default_distance = 2
max_default_matches = 100
```

Use configuration:
```bash
rustkmer --config rustkmer.toml count -i input.fa -o output.rkdb
```

## Performance Tips

### Counting Performance
```bash
# Use optimal thread count
THREADS=$(nproc)
rustkmer count -i large_file.fa -o output.rkdb --threads $THREADS

# Use sorted databases for better query performance
rustkmer count -i input.fa -o output.rkdb --sorted

# Use compression for storage efficiency
rustkmer count -i input.fa -o output.rkdb --compress

# Choose appropriate k-mer size
rustkmer count -i input.fa -o output.rkdb -k 21  # Balanced
rustkmer count -i input.fa -o output.rkdb -k 13  # Faster, less memory
rustkmer count -i input.fa -o output.rkdb -k 31  # Slower, more specific
```

### Query Performance
```bash
# Batch queries for better performance
rustkmer query -d database.rkdb -f large_query_file.txt --threads 8

# Use appropriate output format
rustkmer query -d database.rkdb -f queries.txt --output-format tsv  # Fast
rustkmer query -d database.rkdb -f queries.txt --output-format json  # Slower but more detailed

# Use sorted/indexed databases for frequent querying
rustkmer count -i input.fa -o indexed.rkdb --indexed
rustkmer query -d indexed.rkdb -f queries.txt
```

### Memory Usage
```bash
# Monitor memory usage with verbose output
rustkmer count -i large_file.fa -o output.rkdb --verbose

# Use smaller k-mer sizes for memory efficiency
rustkmer count -i input.fa -o output.rkdb -k 13

# Process files in batches for very large datasets
for file in *.fa; do
  rustkmer count -i "$file" -o "${file%.fa}.rkdb"
done
rustkmer merge *.rkdb -o merged.rkdb
```

## Pipeline Integration

### Bash Scripting
```bash
#!/bin/bash
# pipeline.sh - Complete k-mer analysis pipeline

INPUT_DIR="data"
OUTPUT_DIR="results"
THREADS=16

mkdir -p "$OUTPUT_DIR"

echo "Starting k-mer analysis..."

# Step 1: Count k-mers for all samples
for file in "$INPUT_DIR"/*.fa; do
  sample=$(basename "$file" .fa)
  echo "Processing $sample..."

  rustkmer count \
    -i "$file" \
    -o "$OUTPUT_DIR/${sample}.rkdb" \
    --canonical \
    --sorted \
    --threads "$THREADS" \
    --verbose
done

# Step 2: Create summary report
echo "Generating summary..."
{
  echo "Sample,K-mer Size,Total K-mers,Unique K-mers"
  for db in "$OUTPUT_DIR"/*.rkdb; do
    sample=$(basename "$db" .rkdb)
    info=$(rustkmer info -d "$db")
    # Extract key statistics from info output
    echo "$sample,21,$(echo "$info" | grep "Total" | cut -d: -f2 | tr -d ' '),$(echo "$info" | grep "Unique" | cut -d: -f2 | tr -d ' ')"
  done
} > "$OUTPUT_DIR/summary.csv"

echo "Pipeline complete! Results in $OUTPUT_DIR"
```

### Snakemake Integration
```python
# Snakefile
rule all:
    input:
        "results/summary.csv"

rule count_kmers:
    input:
        "data/{sample}.fa"
    output:
        "results/{sample}.rkdb"
    threads: 8
    shell:
        """
        rustkmer count \
            -i {input} \
            -o {output} \
            --canonical \
            --sorted \
            --threads {threads}
        """

rule generate_summary:
    input:
        expand("results/{sample}.rkdb", sample=SAMPLES)
    output:
        "results/summary.csv"
    shell:
        """
        # Generate summary using rustkmer info
        echo "Sample,Total K-mers,Unique K-mers" > {output}
        for db in {input}; do
            sample=$(basename $db .rkdb)
            info=$(rustkmer info -d $db)
            echo "$sample,$(echo "$info" | grep "Total" | cut -d: -f2 | tr -d ' '),$(echo "$info" | grep "Unique" | cut -d: -f2 | tr -d ' ')" >> {output}
        done
        """
```

### Nextflow Integration
```groovy
// main.nf
process count_kmers {
    input:
    path fasta_file from samples_ch

    output:
    path "${fasta_file.baseName}.rkdb"

    cpus 8
    memory '16 GB'

    script:
    """
    rustkmer count \
        -i ${fasta_file} \
        -o ${fasta_file.baseName}.rkdb \
        --canonical \
        --sorted \
        --threads ${task.cpus}
    """
}

process summarize_results {
    input:
    path db_files from count_kmers.out.collect()

    output:
    path "summary.csv"

    script:
    """
    echo "Sample,Total K-mers,Unique K-mers" > summary.csv
    for db in ${db_files}; do
        sample=\$(basename \$db .rkdb)
        info=\$(rustkmer info -d \$db)
        echo "\$sample,\$(echo "\$info" | grep "Total" | cut -d: -f2 | tr -d ' '),\$(echo "\$info" | grep "Unique" | cut -d: -f2 | tr -d ' ')" >> summary.csv
    done
    """
}
```

## Error Handling

### Common Errors and Solutions

#### File Not Found
```bash
# Error: Input file not found
rustkmer count -i missing.fa -o output.rkdb
# Solution: Check file path and permissions
ls -la missing.fa
```

#### Memory Issues
```bash
# Error: Out of memory
rustkmer count -i huge_file.fa -o output.rkdb -k 31
# Solution: Use smaller k-mer size or more threads
rustkmer count -i huge_file.fa -o output.rkdb -k 13 --threads 32
```

#### Database Format Errors
```bash
# Error: Invalid database format
rustkmer query -d corrupted.rkdb -q "ATCG"
# Solution: Recreate database or check integrity
rustkmer info -d corrupted.rkdb
```

#### Permission Errors
```bash
# Error: Permission denied
rustkmer count -i /protected/file.fa -o /protected/output.rkdb
# Solution: Check file permissions or use different directory
chmod 644 /protected/file.fa
rustkmer count -i /protected/file.fa -o ./output.rkdb
```

### Debug Mode
```bash
# Enable debug output
rustkmer count -i input.fa -o output.rkdb --verbose --debug

# Check database integrity
rustkmer info -d database.rkdb --detailed

# Test with small sample
rustkmer count -i small_test.fa -o test.rkdb --verbose
```

---

## Need More Help?

- **[Getting Started](../getting-started/)** - Installation and basic usage
- **[Performance Tips](performance-tips.md)** - Optimization strategies
- **[User Guide](index.md)** - Complete user guide
- **[API Reference](../api-reference/)** - Python API documentation