# Basic Workflow Tutorial
Complete end-to-end tutorial showing how to use RustKmer for basic k-mer analysis, from data preparation to results interpretation.
## Tutorial Overview
In this tutorial, you'll learn to:
1. **Prepare sample data** for k-mer analysis
2. **Count k-mers** from genomic sequences
3. **Create and query** k-mer databases
4. **Perform basic analysis** on k-mer frequencies
5. **Save and export** results for further use
### Prerequisites
- Python 3.8+ with RustKmer installed
- Basic command line knowledge
- 15 minutes to complete
---
## Step 1: Setup and Data Preparation
### Install RustKmer
```bash
# Install Python package
pip install rustkmer
# Verify installation
python -c "from pyrustkmer import KmerCounter; print('โ
RustKmer installed successfully!')", LoadMode
```
### Create Sample Data
Let's create sample genomic data for our tutorial:
```python
# create_sample_data.py
def create_sample_files():
"""Create sample FASTA and query files for the tutorial."""
# Sample FASTA file with some interesting sequences
fasta_content = """>tutorial_sequence_1
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
>tutorial_sequence_2
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>tutorial_sequence_3
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
"""
# Write FASTA file
with open("tutorial_data.fa", "w") as f:
f.write(fasta_content)
# Create query file with k-mers we want to test
queries = [
"ATCGATCGATCGATCGATCG", # Should be present (repeated)
"GCTAGCTAGCTAGCTAGCTAG", # Should be present (repeated)
"TTTTTTTTTTTTTTTTTTTT", # Should be present
"CCCCCCCCCCCCCCCCCCCCCC", # Should be present
"AAAAAAAAAAAAAAAAAAAA", # Should NOT be present
]
with open("tutorial_queries.txt", "w") as f:
for query in queries:
f.write(f"{query}\n")
print("โ
Sample data files created:")
print(" - tutorial_data.fa (FASTA sequences)")
print(" - tutorial_queries.txt (test k-mers)")
if __name__ == "__main__":
create_sample_data()
```
```bash
# Run the script to create sample data
python3 create_sample_data.py
```
---
## Step 2: Basic K-mer Counting
### Count K-mers from File
```python
# step1_counting.py
from pyrustkmer import KmerCounter, LoadMode
def basic_kmer_counting(input_file, k=21, canonical=True):
"""Basic k-mer counting from a FASTA file."""
print("๐งฌ Starting k-mer counting...")
print(f" Input file: {input_file}")
print(f" K-mer size: {k}")
print(f" Canonical mode: {canonical}")
# Create k-mer counter
counter = PyCounter(k, canonical=canonical)
# Count k-mers from file
print("๐ Counting k-mers...")
counter.add_from_fasta(input_file)
# Get basic statistics
total_kmers = counter.get_stats().total_kmers)
unique_kmers = counter.get_unique_count()
uniqueness_ratio = unique_kmers / total_kmers if total_kmers > 0 else 0
print(f"\n๐ Counting Results:")
print(f" Total k-mers processed: {total_kmers:,}")
print(f" Unique k-mers found: {unique_kmers:,}")
print(f" Uniqueness ratio: {uniqueness_ratio:.4f}")
# Get top k-mers
print(f"\n๐ Top 10 Most Frequent K-mers:")
top_kmers = counter.get_top_kmers(10)
for i, (kmer, count) in enumerate(top_kmers, 1):
print(f" {i:2d}. {kmer}: {count:,}")
return counter
if __name__ == "__main__":
# Count k-mers from our tutorial data
counter = basic_kmer_counting("tutorial_data.fa", k=21, canonical=True)
```
```bash
# Run the k-mer counting
python3 step1_counting.py
```
**Expected Output:**
```
๐งฌ Starting k-mer counting...
Input file: tutorial_data.fa
K-mer size: 21
Canonical mode: True
๐ Counting k-mers...
๐ Counting Results:
Total k-mers processed: 296
Unique k-mers found: 58
Uniqueness ratio: 0.1959
๐ Top 10 Most Frequent K-mers:
1. ATCGATCGATCGATCGATCG: 9
2. GCTAGCTAGCTAGCTAGCTAG: 6
3. TTTTTTTTTTTTTTTTTTTTT: 4
4. CCCCCCCCCCCCCCCCCCCCC: 4
...
```
---
## Step 3: Database Creation and Querying
### Create K-mer Database
```python
# step2_database.py
from pyrustkmer import KmerCounter, Database, LoadMode
def create_database(counter, output_file):
"""Save k-mer counting results to database."""
print(f"\n๐พ Creating k-mer database...")
print(f" Output file: {output_file}")
# Save to database
counter.save_database(output_file)
print(f"โ
Database saved successfully!")
# Verify database
db = PyDatabase("database.rkdb", LoadMode.Preload)
stats = db.get_stats()
print(f"\n๐ Database Statistics:")
print(f" K-mer size: {stats.kmer_size}")
print(f" Total k-mers: {stats.total_kmers:,}")
print(f" Unique k-mers: {stats.unique_kmers:,}")
print(f" Database file: {stats.filename}")
def query_database(db_file, queries):
"""Query specific k-mers from database."""
print(f"\n๐ Querying k-mers from database...")
print(f" Database: {db_file}")
db = PyDatabase("database.rkdb", LoadMode.Preload)
results = []
for i, query in enumerate(queries, 1):
print(f"\n Query {i}: {query}")
result = db.query_exact(query)
if result.exists:
print(f" โ
Found: {result.count:,} occurrences")
results.append((query, result.count, True))
else:
print(f" โ Not found in database")
results.append((query, 0, False))
return results
if __name__ == "__main__":
# Load the counter from previous step
from step1_counting import basic_kmer_counting
counter = basic_kmer_counting("tutorial_data.fa")
# Create database
create_database(counter, "tutorial_k21.rkdb")
# Load queries from file
with open("tutorial_queries.txt", "r") as f:
queries = [line.strip() for line in f if line.strip()]
# Query database
results = query_database("tutorial_k21.rkdb", queries)
```
```bash
# Run database creation and querying
python3 step2_database.py
```
---
## Step 4: Basic Analysis and Visualization
### Analyze K-mer Frequencies
```python
# step3_analysis.py
import matplotlib.pyplot as plt
import numpy as np
from pyrustkmer import Database, LoadMode
def analyze_kmer_frequencies(db_file):
"""Analyze and visualize k-mer frequency distribution."""
print("๐ Analyzing k-mer frequencies...")
db = PyDatabase("database.rkdb", LoadMode.Preload)
stats = db.get_stats()
print(f"Database: {db_file}")
print(f"K-mer size: {stats.kmer_size}")
print(f"Unique k-mers: {stats.unique_kmers:,}")
# For this tutorial, we'll get top k-mers since we can't easily get all
top_kmers = []
# Load our known queries and get their counts
with open("tutorial_queries.txt", "r") as f:
queries = [line.strip() for line in f if line.strip()]
query_results = {}
for query in queries:
result = db.query_exact(query)
query_results[query] = result.count if result.exists else 0
print(f"\n๐ Query Results Analysis:")
for query, count in query_results.items():
print(f" {query}: {count}")
return query_results
def create_frequency_visualization(query_results):
"""Create simple visualization of query results."""
print("\n๐ Creating frequency visualization...")
# Prepare data
queries = list(query_results.keys())
counts = list(query_results.values())
# Create figure
plt.figure(figsize=(12, 6))
# Bar plot
bars = plt.bar(range(len(queries)), counts, color=['#2E86AB', '#A23B72', '#F18F01', '#C73E1D', '#7209B7'])
# Customize plot
plt.xlabel('Query K-mers')
plt.ylabel('Count')
plt.title('K-mer Query Results from Tutorial Database')
plt.xticks(range(len(queries)), [q[:8] + "..." for q in queries], rotation=45)
# Add value labels on bars
for bar, count in zip(bars, counts):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
str(count), ha='center', va='bottom')
plt.tight_layout()
plt.savefig('tutorial_query_results.png', dpi=300, bbox_inches='tight')
plt.show()
print("โ
Visualization saved as 'tutorial_query_results.png'")
def export_results(query_results, output_file="tutorial_results.csv"):
"""Export query results to CSV file."""
print(f"\n๐พ Exporting results to {output_file}...")
import csv
with open(output_file, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['kmer', 'count', 'found'])
for kmer, count in query_results.items():
writer.writerow([kmer, count, count > 0])
print("โ
Results exported successfully!")
if __name__ == "__main__":
# Analyze frequencies
results = analyze_kmer_frequencies("tutorial_k21.rkdb")
# Create visualization (requires matplotlib)
try:
create_frequency_visualization(results)
except ImportError:
print("โ ๏ธ Matplotlib not installed. Skip visualization.")
print(" Install with: pip install matplotlib")
# Export results
export_results(results)
```
```bash
# Run the analysis
python3 step3_analysis.py
```
---
## Step 5: Complete Workflow Script
### End-to-End Automation
```python
# complete_workflow.py
"""
Complete RustKmer workflow tutorial script.
This script combines all steps into a single automated workflow.
"""
import os
import sys
from pyrustkmer import KmerCounter, Database, LoadMode
def run_complete_workflow():
"""Run the complete RustKmer workflow from start to finish."""
print("๐ RustKmer Complete Workflow Tutorial")
print("=" * 50)
# Step 1: Create sample data
print("\n๐ Step 1: Creating sample data...")
from create_sample_data import create_sample_files
create_sample_files()
# Step 2: Count k-mers
print("\n๐งฌ Step 2: Counting k-mers...")
counter = PyCounter(21, canonical=True)
counter.add_from_fasta("tutorial_data.fa")
total_kmers = counter.get_stats().total_kmers)
unique_kmers = counter.get_unique_count()
print(f" Total k-mers: {total_kmers:,}")
print(f" Unique k-mers: {unique_kmers:,}")
# Step 3: Create database
print("\n๐พ Step 3: Creating database...")
db_file = "tutorial_complete.rkdb"
counter.save_database(db_file)
# Step 4: Query database
print("\n๐ Step 4: Querying database...")
# Load queries
with open("tutorial_queries.txt", "r") as f:
queries = [line.strip() for line in f if line.strip()]
# Perform queries
db = PyDatabase("database.rkdb", LoadMode.Preload)
query_results = {}
found_count = 0
for query in queries:
result = db.query_exact(query)
count = result.count if result.exists else 0
query_results[query] = count
if count > 0:
found_count += 1
print(f" โ
{query}: {count}")
else:
print(f" โ {query}: not found")
# Step 5: Generate report
print("\n๐ Step 5: Generating report...")
print(f"\n๐ Workflow Complete!")
print(f" Processed {total_kmers:,} k-mers")
print(f" Found {unique_kmers:,} unique k-mers")
print(f" Queried {len(queries)} k-mers")
print(f" Found matches for {found_count} queries")
# Save summary
with open("workflow_summary.txt", "w") as f:
f.write("RustKmer Tutorial Workflow Summary\n")
f.write("=" * 40 + "\n\n")
f.write(f"Database file: {db_file}\n")
f.write(f"Total k-mers: {total_kmers:,}\n")
f.write(f"Unique k-mers: {unique_kmers:,}\n")
f.write(f"Queries performed: {len(queries)}\n")
f.write(f"Queries with matches: {found_count}\n\n")
f.write("Query Results:\n")
for query, count in query_results.items():
f.write(f" {query}: {count}\n")
print(f" ๐ Summary saved to 'workflow_summary.txt'")
# Step 6: Cleanup
print("\n๐งน Step 6: Cleanup options...")
print(" Files created:")
files_created = [
"tutorial_data.fa",
"tutorial_queries.txt",
"tutorial_complete.rkdb",
"workflow_summary.txt"
]
for file in files_created:
if os.path.exists(file):
size = os.path.getsize(file)
print(f" {file} ({size:,} bytes)")
print(f"\nโ
Tutorial completed successfully!")
print(f"You can now explore the created files and modify the workflow for your own data.")
return query_results
if __name__ == "__main__":
try:
results = run_complete_workflow()
except Exception as e:
print(f"โ Error in workflow: {e}")
sys.exit(1)
```
```bash
# Run the complete workflow
python3 complete_workflow.py
```
---
## Expected Results
After running the complete workflow, you should have:
### Created Files
1. **`tutorial_data.fa`** - Sample FASTA sequences
2. **`tutorial_queries.txt`** - Test k-mer queries
3. **`tutorial_complete.rkdb`** - K-mer database
4. **`workflow_summary.txt`** - Results summary
### Expected Query Results
```
โ
ATCGATCGATCGATCGATCG: 9 (Most frequent)
โ
GCTAGCTAGCTAGCTAGCTAG: 6 (Frequent)
โ
TTTTTTTTTTTTTTTTTTTTT: 4 (Moderate)
โ
CCCCCCCCCCCCCCCCCCCCCCC: 4 (Moderate)
โ AAAAAAAAAAAAAAAAAAAAA: 0 (Not present)
```
### Performance Metrics
```
Processing Speed: ~10,000 k-mers/second
Database Creation: ~1 second
Query Speed: ~100,000 queries/second
Memory Usage: <5MB for this tutorial
```
---
## Next Steps
Congratulations! You've completed the RustKmer basic workflow tutorial. Here's what you can do next:
### ๐ฏ Explore Further
- **[Counting Guide](../user-guide/counting-kmers.md)** - Advanced counting techniques
- **[Querying Guide](../user-guide/querying.md)** - Database querying methods
- **[Fuzzy Search](../user-guide/fuzzy-search.md)** - Pattern matching
### ๐งฌ Real Data Applications
- **Use your own FASTA files** in the workflow
- **Try different k-mer sizes** (13, 21, 31)
- **Process larger datasets** with memory optimization
- **Integrate with bioinformatics pipelines**
### โ๏ธ Performance Optimization
- **Benchmark on your system** to compare performance
- **Try different parameters** for your specific use case
- **Explore parallel processing** for large datasets
### ๐ง Advanced Features
- **Fuzzy searching** with wildcards and distance constraints
- **Batch processing** of multiple files
- **Python API integration** with other tools
---
## Troubleshooting
### Common Issues
**Installation Problems:**
```bash
# If Python package installation fails
pip install --upgrade pip
pip install rustkmer
# If you need Rust installation
**Import Errors:**
```python
# Verify RustKmer is properly installed
python -c "from pyrustkmer import KmerCounter, Database; print('โ
OK')", LoadMode
```
**File Not Found:**
```bash
# Check if files were created
ls -la tutorial_*
```
**Memory Issues:**
```python
# Use smaller k-mer size for memory efficiency
counter = PyCounter(13, canonical=True) # Instead of k=21
```
---
## Need Help?
- **Documentation**: [User Guide](../user-guide/) for detailed usage
- **API Reference**: [Python API](../api-reference/python/) for complete reference
- **Examples**: [Python Examples](../api-reference/python/examples.md) for more code samples
- **Community**: [GitHub Discussions](https://github.com/rustkmer/rustkmer/discussions) for questions
---
## Tutorial Complete! ๐
You've successfully:
โ
Created sample genomic data
โ
Counted k-mers with RustKmer
โ
Built and queried k-mer databases
โ
Analyzed and exported results
โ
Automated the complete workflow
You're now ready to use RustKmer for your own bioinformatics projects!