rustkmer 0.5.2 - Docs.rs

# Basic Workflow Tutorial

Complete end-to-end tutorial showing how to use RustKmer for basic k-mer analysis, from data preparation to results interpretation.

## Tutorial Overview

In this tutorial, you'll learn to:
1. **Prepare sample data** for k-mer analysis
2. **Count k-mers** from genomic sequences
3. **Create and query** k-mer databases
4. **Perform basic analysis** on k-mer frequencies
5. **Save and export** results for further use

### Prerequisites
- Python 3.8+ with RustKmer installed
- Basic command line knowledge
- 15 minutes to complete

---

## Step 1: Setup and Data Preparation

### Install RustKmer
```bash
# Install Python package
pip install rustkmer

# Verify installation
python -c "from pyrustkmer import KmerCounter; print('✅ RustKmer installed successfully!')", LoadMode
```

### Create Sample Data
Let's create sample genomic data for our tutorial:

```python
# create_sample_data.py
def create_sample_files():
    """Create sample FASTA and query files for the tutorial."""

    # Sample FASTA file with some interesting sequences
    fasta_content = """>tutorial_sequence_1
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
>tutorial_sequence_2
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>tutorial_sequence_3
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
"""

    # Write FASTA file
    with open("tutorial_data.fa", "w") as f:
        f.write(fasta_content)

    # Create query file with k-mers we want to test
    queries = [
        "ATCGATCGATCGATCGATCG",  # Should be present (repeated)
        "GCTAGCTAGCTAGCTAGCTAG",  # Should be present (repeated)
        "TTTTTTTTTTTTTTTTTTTT",  # Should be present
        "CCCCCCCCCCCCCCCCCCCCCC",  # Should be present
        "AAAAAAAAAAAAAAAAAAAA",  # Should NOT be present
    ]

    with open("tutorial_queries.txt", "w") as f:
        for query in queries:
            f.write(f"{query}\n")

    print("✅ Sample data files created:")
    print("   - tutorial_data.fa (FASTA sequences)")
    print("   - tutorial_queries.txt (test k-mers)")

if __name__ == "__main__":
    create_sample_data()
```

```bash
# Run the script to create sample data
python3 create_sample_data.py
```

---

## Step 2: Basic K-mer Counting

### Count K-mers from File
```python
# step1_counting.py
from pyrustkmer import KmerCounter, LoadMode

def basic_kmer_counting(input_file, k=21, canonical=True):
    """Basic k-mer counting from a FASTA file."""

    print("🧬 Starting k-mer counting...")
    print(f"   Input file: {input_file}")
    print(f"   K-mer size: {k}")
    print(f"   Canonical mode: {canonical}")

    # Create k-mer counter
    counter = PyCounter(k, canonical=canonical)

    # Count k-mers from file
    print("📊 Counting k-mers...")
    counter.add_from_fasta(input_file)

    # Get basic statistics
    total_kmers = counter.get_stats().total_kmers)
    unique_kmers = counter.get_unique_count()
    uniqueness_ratio = unique_kmers / total_kmers if total_kmers > 0 else 0

    print(f"\n📈 Counting Results:")
    print(f"   Total k-mers processed: {total_kmers:,}")
    print(f"   Unique k-mers found: {unique_kmers:,}")
    print(f"   Uniqueness ratio: {uniqueness_ratio:.4f}")

    # Get top k-mers
    print(f"\n🔝 Top 10 Most Frequent K-mers:")
    top_kmers = counter.get_top_kmers(10)
    for i, (kmer, count) in enumerate(top_kmers, 1):
        print(f"   {i:2d}. {kmer}: {count:,}")

    return counter

if __name__ == "__main__":
    # Count k-mers from our tutorial data
    counter = basic_kmer_counting("tutorial_data.fa", k=21, canonical=True)
```

```bash
# Run the k-mer counting
python3 step1_counting.py
```

**Expected Output:**
```
🧬 Starting k-mer counting...
   Input file: tutorial_data.fa
   K-mer size: 21
   Canonical mode: True
📊 Counting k-mers...

📈 Counting Results:
   Total k-mers processed: 296
   Unique k-mers found: 58
   Uniqueness ratio: 0.1959

🔝 Top 10 Most Frequent K-mers:
    1. ATCGATCGATCGATCGATCG: 9
    2. GCTAGCTAGCTAGCTAGCTAG: 6
    3. TTTTTTTTTTTTTTTTTTTTT: 4
    4. CCCCCCCCCCCCCCCCCCCCC: 4
   ...
```

---

## Step 3: Database Creation and Querying

### Create K-mer Database
```python
# step2_database.py
from pyrustkmer import KmerCounter, Database, LoadMode

def create_database(counter, output_file):
    """Save k-mer counting results to database."""

    print(f"\n💾 Creating k-mer database...")
    print(f"   Output file: {output_file}")

    # Save to database
    counter.save_database(output_file)

    print(f"✅ Database saved successfully!")

    # Verify database
    db = PyDatabase("database.rkdb", LoadMode.Preload)
        stats = db.get_stats()

        print(f"\n📊 Database Statistics:")
        print(f"   K-mer size: {stats.kmer_size}")
        print(f"   Total k-mers: {stats.total_kmers:,}")
        print(f"   Unique k-mers: {stats.unique_kmers:,}")
        print(f"   Database file: {stats.filename}")

def query_database(db_file, queries):
    """Query specific k-mers from database."""

    print(f"\n🔍 Querying k-mers from database...")
    print(f"   Database: {db_file}")

    db = PyDatabase("database.rkdb", LoadMode.Preload)

        results = []

        for i, query in enumerate(queries, 1):
            print(f"\n   Query {i}: {query}")

            result = db.query_exact(query)

            if result.exists:
                print(f"   ✅ Found: {result.count:,} occurrences")
                results.append((query, result.count, True))
            else:
                print(f"   ❌ Not found in database")
                results.append((query, 0, False))

    return results

if __name__ == "__main__":
    # Load the counter from previous step
    from step1_counting import basic_kmer_counting
    counter = basic_kmer_counting("tutorial_data.fa")

    # Create database
    create_database(counter, "tutorial_k21.rkdb")

    # Load queries from file
    with open("tutorial_queries.txt", "r") as f:
        queries = [line.strip() for line in f if line.strip()]

    # Query database
    results = query_database("tutorial_k21.rkdb", queries)
```

```bash
# Run database creation and querying
python3 step2_database.py
```

---

## Step 4: Basic Analysis and Visualization

### Analyze K-mer Frequencies
```python
# step3_analysis.py
import matplotlib.pyplot as plt
import numpy as np
from pyrustkmer import Database, LoadMode

def analyze_kmer_frequencies(db_file):
    """Analyze and visualize k-mer frequency distribution."""

    print("📊 Analyzing k-mer frequencies...")

    db = PyDatabase("database.rkdb", LoadMode.Preload)
        stats = db.get_stats()

        print(f"Database: {db_file}")
        print(f"K-mer size: {stats.kmer_size}")
        print(f"Unique k-mers: {stats.unique_kmers:,}")

        # For this tutorial, we'll get top k-mers since we can't easily get all
        top_kmers = []

        # Load our known queries and get their counts
        with open("tutorial_queries.txt", "r") as f:
            queries = [line.strip() for line in f if line.strip()]

        query_results = {}
        for query in queries:
            result = db.query_exact(query)
            query_results[query] = result.count if result.exists else 0

        print(f"\n🔍 Query Results Analysis:")
        for query, count in query_results.items():
            print(f"   {query}: {count}")

        return query_results

def create_frequency_visualization(query_results):
    """Create simple visualization of query results."""

    print("\n📈 Creating frequency visualization...")

    # Prepare data
    queries = list(query_results.keys())
    counts = list(query_results.values())

    # Create figure
    plt.figure(figsize=(12, 6))

    # Bar plot
    bars = plt.bar(range(len(queries)), counts, color=['#2E86AB', '#A23B72', '#F18F01', '#C73E1D', '#7209B7'])

    # Customize plot
    plt.xlabel('Query K-mers')
    plt.ylabel('Count')
    plt.title('K-mer Query Results from Tutorial Database')
    plt.xticks(range(len(queries)), [q[:8] + "..." for q in queries], rotation=45)

    # Add value labels on bars
    for bar, count in zip(bars, counts):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                str(count), ha='center', va='bottom')

    plt.tight_layout()
    plt.savefig('tutorial_query_results.png', dpi=300, bbox_inches='tight')
    plt.show()

    print("✅ Visualization saved as 'tutorial_query_results.png'")

def export_results(query_results, output_file="tutorial_results.csv"):
    """Export query results to CSV file."""

    print(f"\n💾 Exporting results to {output_file}...")

    import csv

    with open(output_file, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['kmer', 'count', 'found'])

        for kmer, count in query_results.items():
            writer.writerow([kmer, count, count > 0])

    print("✅ Results exported successfully!")

if __name__ == "__main__":
    # Analyze frequencies
    results = analyze_kmer_frequencies("tutorial_k21.rkdb")

    # Create visualization (requires matplotlib)
    try:
        create_frequency_visualization(results)
    except ImportError:
        print("⚠️  Matplotlib not installed. Skip visualization.")
        print("   Install with: pip install matplotlib")

    # Export results
    export_results(results)
```

```bash
# Run the analysis
python3 step3_analysis.py
```

---

## Step 5: Complete Workflow Script

### End-to-End Automation
```python
# complete_workflow.py
"""
Complete RustKmer workflow tutorial script.
This script combines all steps into a single automated workflow.
"""

import os
import sys
from pyrustkmer import KmerCounter, Database, LoadMode

def run_complete_workflow():
    """Run the complete RustKmer workflow from start to finish."""

    print("🚀 RustKmer Complete Workflow Tutorial")
    print("=" * 50)

    # Step 1: Create sample data
    print("\n📝 Step 1: Creating sample data...")
    from create_sample_data import create_sample_files
    create_sample_files()

    # Step 2: Count k-mers
    print("\n🧬 Step 2: Counting k-mers...")
    counter = PyCounter(21, canonical=True)
    counter.add_from_fasta("tutorial_data.fa")

    total_kmers = counter.get_stats().total_kmers)
    unique_kmers = counter.get_unique_count()

    print(f"   Total k-mers: {total_kmers:,}")
    print(f"   Unique k-mers: {unique_kmers:,}")

    # Step 3: Create database
    print("\n💾 Step 3: Creating database...")
    db_file = "tutorial_complete.rkdb"
    counter.save_database(db_file)

    # Step 4: Query database
    print("\n🔍 Step 4: Querying database...")

    # Load queries
    with open("tutorial_queries.txt", "r") as f:
        queries = [line.strip() for line in f if line.strip()]

    # Perform queries
    db = PyDatabase("database.rkdb", LoadMode.Preload)

        query_results = {}
        found_count = 0

        for query in queries:
            result = db.query_exact(query)
            count = result.count if result.exists else 0
            query_results[query] = count

            if count > 0:
                found_count += 1
                print(f"   ✅ {query}: {count}")
            else:
                print(f"   ❌ {query}: not found")

    # Step 5: Generate report
    print("\n📊 Step 5: Generating report...")

    print(f"\n🎉 Workflow Complete!")
    print(f"   Processed {total_kmers:,} k-mers")
    print(f"   Found {unique_kmers:,} unique k-mers")
    print(f"   Queried {len(queries)} k-mers")
    print(f"   Found matches for {found_count} queries")

    # Save summary
    with open("workflow_summary.txt", "w") as f:
        f.write("RustKmer Tutorial Workflow Summary\n")
        f.write("=" * 40 + "\n\n")
        f.write(f"Database file: {db_file}\n")
        f.write(f"Total k-mers: {total_kmers:,}\n")
        f.write(f"Unique k-mers: {unique_kmers:,}\n")
        f.write(f"Queries performed: {len(queries)}\n")
        f.write(f"Queries with matches: {found_count}\n\n")
        f.write("Query Results:\n")
        for query, count in query_results.items():
            f.write(f"  {query}: {count}\n")

    print(f"   📄 Summary saved to 'workflow_summary.txt'")

    # Step 6: Cleanup
    print("\n🧹 Step 6: Cleanup options...")
    print("   Files created:")
    files_created = [
        "tutorial_data.fa",
        "tutorial_queries.txt",
        "tutorial_complete.rkdb",
        "workflow_summary.txt"
    ]

    for file in files_created:
        if os.path.exists(file):
            size = os.path.getsize(file)
            print(f"     {file} ({size:,} bytes)")

    print(f"\n✅ Tutorial completed successfully!")
    print(f"You can now explore the created files and modify the workflow for your own data.")

    return query_results

if __name__ == "__main__":
    try:
        results = run_complete_workflow()
    except Exception as e:
        print(f"❌ Error in workflow: {e}")
        sys.exit(1)
```

```bash
# Run the complete workflow
python3 complete_workflow.py
```

---

## Expected Results

After running the complete workflow, you should have:

### Created Files
1. **`tutorial_data.fa`** - Sample FASTA sequences
2. **`tutorial_queries.txt`** - Test k-mer queries
3. **`tutorial_complete.rkdb`** - K-mer database
4. **`workflow_summary.txt`** - Results summary

### Expected Query Results
```
✅ ATCGATCGATCGATCGATCG: 9        (Most frequent)
✅ GCTAGCTAGCTAGCTAGCTAG: 6        (Frequent)
✅ TTTTTTTTTTTTTTTTTTTTT: 4        (Moderate)
✅ CCCCCCCCCCCCCCCCCCCCCCC: 4      (Moderate)
❌ AAAAAAAAAAAAAAAAAAAAA: 0        (Not present)
```

### Performance Metrics
```
Processing Speed: ~10,000 k-mers/second
Database Creation: ~1 second
Query Speed: ~100,000 queries/second
Memory Usage: <5MB for this tutorial
```

---

## Next Steps

Congratulations! You've completed the RustKmer basic workflow tutorial. Here's what you can do next:

### 🎯 Explore Further
- **[Counting Guide](../user-guide/counting-kmers.md)** - Advanced counting techniques
- **[Querying Guide](../user-guide/querying.md)** - Database querying methods
- **[Fuzzy Search](../user-guide/fuzzy-search.md)** - Pattern matching

### 🧬 Real Data Applications
- **Use your own FASTA files** in the workflow
- **Try different k-mer sizes** (13, 21, 31)
- **Process larger datasets** with memory optimization
- **Integrate with bioinformatics pipelines**

### ⚙️ Performance Optimization
- **Benchmark on your system** to compare performance
- **Try different parameters** for your specific use case
- **Explore parallel processing** for large datasets

### 🔧 Advanced Features
- **Fuzzy searching** with wildcards and distance constraints
- **Batch processing** of multiple files
- **Python API integration** with other tools

---

## Troubleshooting

### Common Issues

**Installation Problems:**
```bash
# If Python package installation fails
pip install --upgrade pip
pip install rustkmer

# If you need Rust installation
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

**Import Errors:**
```python
# Verify RustKmer is properly installed
python -c "from pyrustkmer import KmerCounter, Database; print('✅ OK')", LoadMode
```

**File Not Found:**
```bash
# Check if files were created
ls -la tutorial_*
```

**Memory Issues:**
```python
# Use smaller k-mer size for memory efficiency
counter = PyCounter(13, canonical=True)  # Instead of k=21
```

---

## Need Help?

- **Documentation**: [User Guide](../user-guide/) for detailed usage
- **API Reference**: [Python API](../api-reference/python/) for complete reference
- **Examples**: [Python Examples](../api-reference/python/examples.md) for more code samples
- **Community**: [GitHub Discussions](https://github.com/rustkmer/rustkmer/discussions) for questions

---

## Tutorial Complete! 🎉

You've successfully:
✅ Created sample genomic data
✅ Counted k-mers with RustKmer
✅ Built and queried k-mer databases
✅ Analyzed and exported results
✅ Automated the complete workflow

You're now ready to use RustKmer for your own bioinformatics projects!