# Performance Tips
Optimize RustKmer for maximum speed and efficiency with these performance tuning strategies.
## Overview
RustKmer is designed for high-performance k-mer analysis, but optimal performance requires proper configuration and understanding of the underlying algorithms. This guide covers performance optimization techniques for different use cases.
## Quick Performance Checklist
### Essential Settings for Maximum Speed
- **Use appropriate k-mer size** (13-31 for most applications)
- **Enable canonical k-mers** for genome analysis
- **Use sorted databases** for faster querying
- **Optimize thread count** for your system
- **Choose appropriate file formats** (compressed for storage, uncompressed for speed)
## Memory Optimization
### K-mer Size Selection
```bash
# Smaller k-mers = less memory, faster processing
rustkmer count -k 13 -i input.fa -o output.rkdb
# Larger k-mers = more specificity, more memory
rustkmer count -k 31 -i input.fa -o output.rkdb
```
**Memory Usage Guidelines:**
- **k=13**: ~1GB per 1M k-mers
- **k=21**: ~4GB per 1M k-mers
- **k=31**: ~8GB per 1M k-mers
### Database Format Options
```bash
rustkmer count -i input.fa -o output.rkdb
rustkmer count -i input.fa -o output.rkdb --compress
```
## Speed Optimization
### Thread Configuration
```bash
# Auto-detect optimal threads (recommended)
rustkmer count -i input.fa -o output.rkdb
# Manual thread specification
rustkmer count -i input.fa -o output.rkdb --threads 8
# Disable threading for small files
rustkmer count -i input.fa -o output.rkdb --threads 1
```
### Canonical vs Non-Canonical
```bash
# Canonical k-mers (slower counting, smaller database)
rustkmer count -i input.fa --canonical -o output.rkdb
# Non-canonical k-mers (faster counting, larger database)
rustkmer count -i input.fa -o output.rkdb
```
### Database Sorting
```bash
# Create sorted database for faster querying
rustkmer count -i input.fa -o output.rkdb --sorted
# Query performance comparison
# Unsorted: ~50,000 queries/second
# Sorted: ~200,000 queries/second
```
## Large Dataset Handling
### Batch Processing
```python
import os
from pyrustkmer import KmerCounter
def process_large_dataset(input_files, k=21, batch_size=1000000):
"""Process large genomic datasets in batches."""
for i, file in enumerate(input_files):
print(f"Processing file {i+1}/{len(input_files)}: {file}")
# Create counter with memory-efficient settings
counter = PyCounter(k, canonical=True)
# Process in chunks if very large
if os.path.getsize(file) > 1_000_000_000: # 1GB
counter.add_from_fasta(file, chunk_size=1000000)
else:
counter.add_from_fasta(file)
# Save intermediate results
output_file = f"batch_{i+1:03d}.rkdb"
counter.save_database(output_file)
print(f"Saved: {output_file}")
```
### Streaming Processing
```python
from pyrustkmer import KmerCounter
def stream_process_fasta(fasta_file, k=21):
"""Stream large FASTA files without loading entirely into memory."""
counter = PyCounter(k, canonical=True)
# Process file in streaming mode
with open(fasta_file, 'r') as f:
counter.count_stream(f)
return counter
```
## Query Performance
### Database Indexing
```bash
# Create indexed database for fast queries
rustkmer count -i input.fa -o output.rkdb --indexed
# Query performance comparison
# Non-indexed: ~10,000 queries/second
# Indexed: ~500,000 queries/second
```
### Batch Querying
```python
from pyrustkmer import Database
import time
def batch_query_examples():
"""Optimize querying with batch operations."""
queries = [
"ATCGATCGATCGATCGATCG",
"GCTAGCTAGCTAGCTAGCTAG",
# ... thousands more queries
]
# PyDatabase doesn't use context manager
db.load("large_database.rkdb")
# Batch query (much faster)
start_time = time.time()
results = db.query_exact_batch(queries) # 推荐使用新命名
batch_time = time.time() - start_time
# 兼容性说明:旧方法名仍然可用但已废弃
# results = db.query_batch(queries) # 已废弃,请使用 query_exact_batch()
print(f"Batch query: {len(queries)} queries in {batch_time:.2f}s")
print(f"Rate: {len(queries)/batch_time:.0f} queries/second")
```
### Fuzzy Query Optimization
```python
from pyrustkmer import Database
def optimize_fuzzy_queries():
"""Optimize fuzzy search parameters for speed vs accuracy."""
# PyDatabase doesn't use context manager
db.load("database.rkdb")
# Fast fuzzy search (less accurate)
results = fuzzy.query_fuzzy(
pattern="ATCGATCGATCGATCGATCG",
max_distance=2,
max_results=100
)
# Comprehensive fuzzy search (slower, more accurate)
results = fuzzy.query_fuzzy(
pattern="ATCGATCGATCGATCGATCG",
max_distance=3,
max_results=1000,
exhaustive=True
)
```
## System-Level Optimization
### CPU Optimization
```bash
# Check CPU cores for optimal threading
nproc # Linux/macOS
sysctl -n hw.ncpu # macOS
# Set threads based on CPU cores
THREADS=$(nproc)
rustkmer count -i input.fa -o output.rkdb --threads $THREADS
```
### Disk I/O Optimization
```bash
# Use fast storage for temporary files
export TMPDIR="/fast/ssd/temp"
rustkmer count -i input.fa -o output.rkdb
# Prefer uncompressed files on fast storage
rustkmer count -i input.fa -o /fast/storage/output.rkdb
```
### Memory Mapping
```python
from pyrustkmer import Database
def use_memory_mapping():
"""Enable memory mapping for large databases."""
# PyDatabase doesn't use context manager
# Use memory-mapped access for large files
db.load("large_database.rkdb", memory_mapped=True)
# Queries will be much faster for repeated access
result = db.query_exact("ATCGATCGATCGATCGATCG")
```
## Benchmarking Your System
### Performance Test Script
```python
import time
import psutil
from pyrustkmer import KmerCounter, Database
def benchmark_system():
"""Benchmark RustKmer performance on your system."""
print("🚀 RustKmer Performance Benchmark")
print("=" * 40)
# System info
print(f"CPU Cores: {psutil.cpu_count()}")
print(f"Memory: {psutil.virtual_memory().total / (1024**3):.1f} GB")
# Test k-mer counting performance
test_file = "test_data.fa"
k_sizes = [13, 21, 31]
for k in k_sizes:
print(f"\n🧬 Testing k={k}")
start_time = time.time()
counter = PyCounter(k, canonical=True)
counter.add_from_fasta(test_file)
count_time = time.time() - start_time
total_kmers = counter.get_stats().total_kmers)
unique_kmers = counter.get_unique_count()
print(f" Counting time: {count_time:.2f}s")
print(f" Processing rate: {total_kmers/count_time:.0f} k-mers/sec")
print(f" Memory usage: {psutil.Process().memory_info().rss / (1024**2):.1f} MB")
print(f" Total k-mers: {total_kmers:,}")
print(f" Unique k-mers: {unique_kmers:,}")
# Test querying performance
print(f"\n🔍 Testing query performance")
db_file = "test_database.rkdb"
queries = ["ATCGATCGATCGATCGATCG"] * 10000 # 10k queries
# PyDatabase doesn't use context manager
db.load(db_file)
start_time = time.time()
for query in queries:
db.query_exact(query)
query_time = time.time() - start_time
print(f" Query time: {query_time:.2f}s")
print(f" Query rate: {len(queries)/query_time:.0f} queries/sec")
if __name__ == "__main__":
benchmark_system()
```
## Performance Comparison Tables
### K-mer Counting Performance
| k=13 | Low | ~50,000 | Small |
| k=21 | Medium | ~30,000 | Medium |
| k=31 | High | ~15,000 | Large |
### Querying Performance
| Unsorted | Fast | ~50,000 q/s | Low |
| Sorted | Medium | ~200,000 q/s | Medium |
| Indexed | Slow | ~500,000 q/s | High |
### File Format Performance
| FASTA (uncompressed) | Fastest | Large | Fastest |
| FASTA (gzip) | Medium | Small | Medium |
| FASTQ (uncompressed) | Fast | Large | Fast |
| FASTQ (gzip) | Slow | Small | Slow |
## Troubleshooting Performance Issues
### Slow Counting
**Problem**: K-mer counting is taking too long
**Solutions**:
- Reduce k-mer size (k=13 instead of k=31)
- Disable canonical k-mers if not needed
- Use more threads (check CPU cores)
- Use uncompressed input files
- Split large files into smaller chunks
### High Memory Usage
**Problem**: Running out of memory
**Solutions**:
- Use smaller k-mer size
- Enable database compression
- Process files in batches
- Use streaming mode for very large files
- Close databases when not in use
### Slow Querying
**Problem**: Database queries are slow
**Solutions**:
- Create sorted databases (`--sorted`)
- Use indexed databases for frequent queries
- Batch multiple queries together
- Enable memory mapping for large databases
- Consider using a smaller k-mer size
### Database Loading Issues
**Problem**: Databases take too long to load
**Solutions**:
- Use uncompressed databases on fast storage
- Enable memory mapping
- Load databases once and reuse
- Consider using multiple smaller databases
## Best Practices
### For High-Performance Computing
1. **Use SSD storage** for database files
2. **Allocate sufficient RAM** (2x database size for optimal performance)
3. **Use thread count equal to CPU cores**
4. **Prefer sorted databases** for frequent querying
5. **Batch operations** when possible
### For Production Systems
1. **Monitor memory usage** and implement limits
2. **Use database compression** to save disk space
3. **Implement proper error handling** and retry logic
4. **Use connection pooling** for multi-user systems
5. **Regular performance monitoring** and optimization
### For Development/Testing
1. **Start with small datasets** to test workflows
2. **Use canonical k-mers** for consistency
3. **Implement proper logging** for performance analysis
4. **Test different k-mer sizes** for your use case
5. **Benchmark on target hardware** before deployment
---
## Need More Help?
- **[API Reference](../api-reference/)** - Complete function documentation
- **[Troubleshooting](../getting-started/troubleshooting.md)** - Common issues and solutions
- **[GitHub Issues](https://github.com/rustkmer/rustkmer/issues)** - Report performance problems
- **[Discussions](https://github.com/rustkmer/rustkmer/discussions)** - Performance optimization tips