rustkmer 0.5.2

High-performance k-mer counting tool in Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
# Basic Workflow Tutorial

Complete end-to-end tutorial showing how to use RustKmer for basic k-mer analysis, from data preparation to results interpretation.

## Tutorial Overview

In this tutorial, you'll learn to:
1. **Prepare sample data** for k-mer analysis
2. **Count k-mers** from genomic sequences
3. **Create and query** k-mer databases
4. **Perform basic analysis** on k-mer frequencies
5. **Save and export** results for further use

### Prerequisites
- Python 3.8+ with RustKmer installed
- Basic command line knowledge
- 15 minutes to complete

---

## Step 1: Setup and Data Preparation

### Install RustKmer
```bash
# Install Python package
pip install rustkmer

# Verify installation
python -c "from pyrustkmer import KmerCounter; print('โœ… RustKmer installed successfully!')", LoadMode
```

### Create Sample Data
Let's create sample genomic data for our tutorial:

```python
# create_sample_data.py
def create_sample_files():
    """Create sample FASTA and query files for the tutorial."""

    # Sample FASTA file with some interesting sequences
    fasta_content = """>tutorial_sequence_1
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
>tutorial_sequence_2
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>tutorial_sequence_3
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
ATCGATCGATCGATCGATCGATCGATCGATCGATCG
"""

    # Write FASTA file
    with open("tutorial_data.fa", "w") as f:
        f.write(fasta_content)

    # Create query file with k-mers we want to test
    queries = [
        "ATCGATCGATCGATCGATCG",  # Should be present (repeated)
        "GCTAGCTAGCTAGCTAGCTAG",  # Should be present (repeated)
        "TTTTTTTTTTTTTTTTTTTT",  # Should be present
        "CCCCCCCCCCCCCCCCCCCCCC",  # Should be present
        "AAAAAAAAAAAAAAAAAAAA",  # Should NOT be present
    ]

    with open("tutorial_queries.txt", "w") as f:
        for query in queries:
            f.write(f"{query}\n")

    print("โœ… Sample data files created:")
    print("   - tutorial_data.fa (FASTA sequences)")
    print("   - tutorial_queries.txt (test k-mers)")

if __name__ == "__main__":
    create_sample_data()
```

```bash
# Run the script to create sample data
python3 create_sample_data.py
```

---

## Step 2: Basic K-mer Counting

### Count K-mers from File
```python
# step1_counting.py
from pyrustkmer import KmerCounter, LoadMode

def basic_kmer_counting(input_file, k=21, canonical=True):
    """Basic k-mer counting from a FASTA file."""

    print("๐Ÿงฌ Starting k-mer counting...")
    print(f"   Input file: {input_file}")
    print(f"   K-mer size: {k}")
    print(f"   Canonical mode: {canonical}")

    # Create k-mer counter
    counter = PyCounter(k, canonical=canonical)

    # Count k-mers from file
    print("๐Ÿ“Š Counting k-mers...")
    counter.add_from_fasta(input_file)

    # Get basic statistics
    total_kmers = counter.get_stats().total_kmers)
    unique_kmers = counter.get_unique_count()
    uniqueness_ratio = unique_kmers / total_kmers if total_kmers > 0 else 0

    print(f"\n๐Ÿ“ˆ Counting Results:")
    print(f"   Total k-mers processed: {total_kmers:,}")
    print(f"   Unique k-mers found: {unique_kmers:,}")
    print(f"   Uniqueness ratio: {uniqueness_ratio:.4f}")

    # Get top k-mers
    print(f"\n๐Ÿ” Top 10 Most Frequent K-mers:")
    top_kmers = counter.get_top_kmers(10)
    for i, (kmer, count) in enumerate(top_kmers, 1):
        print(f"   {i:2d}. {kmer}: {count:,}")

    return counter

if __name__ == "__main__":
    # Count k-mers from our tutorial data
    counter = basic_kmer_counting("tutorial_data.fa", k=21, canonical=True)
```

```bash
# Run the k-mer counting
python3 step1_counting.py
```

**Expected Output:**
```
๐Ÿงฌ Starting k-mer counting...
   Input file: tutorial_data.fa
   K-mer size: 21
   Canonical mode: True
๐Ÿ“Š Counting k-mers...

๐Ÿ“ˆ Counting Results:
   Total k-mers processed: 296
   Unique k-mers found: 58
   Uniqueness ratio: 0.1959

๐Ÿ” Top 10 Most Frequent K-mers:
    1. ATCGATCGATCGATCGATCG: 9
    2. GCTAGCTAGCTAGCTAGCTAG: 6
    3. TTTTTTTTTTTTTTTTTTTTT: 4
    4. CCCCCCCCCCCCCCCCCCCCC: 4
   ...
```

---

## Step 3: Database Creation and Querying

### Create K-mer Database
```python
# step2_database.py
from pyrustkmer import KmerCounter, Database, LoadMode

def create_database(counter, output_file):
    """Save k-mer counting results to database."""

    print(f"\n๐Ÿ’พ Creating k-mer database...")
    print(f"   Output file: {output_file}")

    # Save to database
    counter.save_database(output_file)

    print(f"โœ… Database saved successfully!")

    # Verify database
    db = PyDatabase("database.rkdb", LoadMode.Preload)
        stats = db.get_stats()

        print(f"\n๐Ÿ“Š Database Statistics:")
        print(f"   K-mer size: {stats.kmer_size}")
        print(f"   Total k-mers: {stats.total_kmers:,}")
        print(f"   Unique k-mers: {stats.unique_kmers:,}")
        print(f"   Database file: {stats.filename}")

def query_database(db_file, queries):
    """Query specific k-mers from database."""

    print(f"\n๐Ÿ” Querying k-mers from database...")
    print(f"   Database: {db_file}")

    db = PyDatabase("database.rkdb", LoadMode.Preload)

        results = []

        for i, query in enumerate(queries, 1):
            print(f"\n   Query {i}: {query}")

            result = db.query_exact(query)

            if result.exists:
                print(f"   โœ… Found: {result.count:,} occurrences")
                results.append((query, result.count, True))
            else:
                print(f"   โŒ Not found in database")
                results.append((query, 0, False))

    return results

if __name__ == "__main__":
    # Load the counter from previous step
    from step1_counting import basic_kmer_counting
    counter = basic_kmer_counting("tutorial_data.fa")

    # Create database
    create_database(counter, "tutorial_k21.rkdb")

    # Load queries from file
    with open("tutorial_queries.txt", "r") as f:
        queries = [line.strip() for line in f if line.strip()]

    # Query database
    results = query_database("tutorial_k21.rkdb", queries)
```

```bash
# Run database creation and querying
python3 step2_database.py
```

---

## Step 4: Basic Analysis and Visualization

### Analyze K-mer Frequencies
```python
# step3_analysis.py
import matplotlib.pyplot as plt
import numpy as np
from pyrustkmer import Database, LoadMode

def analyze_kmer_frequencies(db_file):
    """Analyze and visualize k-mer frequency distribution."""

    print("๐Ÿ“Š Analyzing k-mer frequencies...")

    db = PyDatabase("database.rkdb", LoadMode.Preload)
        stats = db.get_stats()

        print(f"Database: {db_file}")
        print(f"K-mer size: {stats.kmer_size}")
        print(f"Unique k-mers: {stats.unique_kmers:,}")

        # For this tutorial, we'll get top k-mers since we can't easily get all
        top_kmers = []

        # Load our known queries and get their counts
        with open("tutorial_queries.txt", "r") as f:
            queries = [line.strip() for line in f if line.strip()]

        query_results = {}
        for query in queries:
            result = db.query_exact(query)
            query_results[query] = result.count if result.exists else 0

        print(f"\n๐Ÿ” Query Results Analysis:")
        for query, count in query_results.items():
            print(f"   {query}: {count}")

        return query_results

def create_frequency_visualization(query_results):
    """Create simple visualization of query results."""

    print("\n๐Ÿ“ˆ Creating frequency visualization...")

    # Prepare data
    queries = list(query_results.keys())
    counts = list(query_results.values())

    # Create figure
    plt.figure(figsize=(12, 6))

    # Bar plot
    bars = plt.bar(range(len(queries)), counts, color=['#2E86AB', '#A23B72', '#F18F01', '#C73E1D', '#7209B7'])

    # Customize plot
    plt.xlabel('Query K-mers')
    plt.ylabel('Count')
    plt.title('K-mer Query Results from Tutorial Database')
    plt.xticks(range(len(queries)), [q[:8] + "..." for q in queries], rotation=45)

    # Add value labels on bars
    for bar, count in zip(bars, counts):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                str(count), ha='center', va='bottom')

    plt.tight_layout()
    plt.savefig('tutorial_query_results.png', dpi=300, bbox_inches='tight')
    plt.show()

    print("โœ… Visualization saved as 'tutorial_query_results.png'")

def export_results(query_results, output_file="tutorial_results.csv"):
    """Export query results to CSV file."""

    print(f"\n๐Ÿ’พ Exporting results to {output_file}...")

    import csv

    with open(output_file, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['kmer', 'count', 'found'])

        for kmer, count in query_results.items():
            writer.writerow([kmer, count, count > 0])

    print("โœ… Results exported successfully!")

if __name__ == "__main__":
    # Analyze frequencies
    results = analyze_kmer_frequencies("tutorial_k21.rkdb")

    # Create visualization (requires matplotlib)
    try:
        create_frequency_visualization(results)
    except ImportError:
        print("โš ๏ธ  Matplotlib not installed. Skip visualization.")
        print("   Install with: pip install matplotlib")

    # Export results
    export_results(results)
```

```bash
# Run the analysis
python3 step3_analysis.py
```

---

## Step 5: Complete Workflow Script

### End-to-End Automation
```python
# complete_workflow.py
"""
Complete RustKmer workflow tutorial script.
This script combines all steps into a single automated workflow.
"""

import os
import sys
from pyrustkmer import KmerCounter, Database, LoadMode

def run_complete_workflow():
    """Run the complete RustKmer workflow from start to finish."""

    print("๐Ÿš€ RustKmer Complete Workflow Tutorial")
    print("=" * 50)

    # Step 1: Create sample data
    print("\n๐Ÿ“ Step 1: Creating sample data...")
    from create_sample_data import create_sample_files
    create_sample_files()

    # Step 2: Count k-mers
    print("\n๐Ÿงฌ Step 2: Counting k-mers...")
    counter = PyCounter(21, canonical=True)
    counter.add_from_fasta("tutorial_data.fa")

    total_kmers = counter.get_stats().total_kmers)
    unique_kmers = counter.get_unique_count()

    print(f"   Total k-mers: {total_kmers:,}")
    print(f"   Unique k-mers: {unique_kmers:,}")

    # Step 3: Create database
    print("\n๐Ÿ’พ Step 3: Creating database...")
    db_file = "tutorial_complete.rkdb"
    counter.save_database(db_file)

    # Step 4: Query database
    print("\n๐Ÿ” Step 4: Querying database...")

    # Load queries
    with open("tutorial_queries.txt", "r") as f:
        queries = [line.strip() for line in f if line.strip()]

    # Perform queries
    db = PyDatabase("database.rkdb", LoadMode.Preload)

        query_results = {}
        found_count = 0

        for query in queries:
            result = db.query_exact(query)
            count = result.count if result.exists else 0
            query_results[query] = count

            if count > 0:
                found_count += 1
                print(f"   โœ… {query}: {count}")
            else:
                print(f"   โŒ {query}: not found")

    # Step 5: Generate report
    print("\n๐Ÿ“Š Step 5: Generating report...")

    print(f"\n๐ŸŽ‰ Workflow Complete!")
    print(f"   Processed {total_kmers:,} k-mers")
    print(f"   Found {unique_kmers:,} unique k-mers")
    print(f"   Queried {len(queries)} k-mers")
    print(f"   Found matches for {found_count} queries")

    # Save summary
    with open("workflow_summary.txt", "w") as f:
        f.write("RustKmer Tutorial Workflow Summary\n")
        f.write("=" * 40 + "\n\n")
        f.write(f"Database file: {db_file}\n")
        f.write(f"Total k-mers: {total_kmers:,}\n")
        f.write(f"Unique k-mers: {unique_kmers:,}\n")
        f.write(f"Queries performed: {len(queries)}\n")
        f.write(f"Queries with matches: {found_count}\n\n")
        f.write("Query Results:\n")
        for query, count in query_results.items():
            f.write(f"  {query}: {count}\n")

    print(f"   ๐Ÿ“„ Summary saved to 'workflow_summary.txt'")

    # Step 6: Cleanup
    print("\n๐Ÿงน Step 6: Cleanup options...")
    print("   Files created:")
    files_created = [
        "tutorial_data.fa",
        "tutorial_queries.txt",
        "tutorial_complete.rkdb",
        "workflow_summary.txt"
    ]

    for file in files_created:
        if os.path.exists(file):
            size = os.path.getsize(file)
            print(f"     {file} ({size:,} bytes)")

    print(f"\nโœ… Tutorial completed successfully!")
    print(f"You can now explore the created files and modify the workflow for your own data.")

    return query_results

if __name__ == "__main__":
    try:
        results = run_complete_workflow()
    except Exception as e:
        print(f"โŒ Error in workflow: {e}")
        sys.exit(1)
```

```bash
# Run the complete workflow
python3 complete_workflow.py
```

---

## Expected Results

After running the complete workflow, you should have:

### Created Files
1. **`tutorial_data.fa`** - Sample FASTA sequences
2. **`tutorial_queries.txt`** - Test k-mer queries
3. **`tutorial_complete.rkdb`** - K-mer database
4. **`workflow_summary.txt`** - Results summary

### Expected Query Results
```
โœ… ATCGATCGATCGATCGATCG: 9        (Most frequent)
โœ… GCTAGCTAGCTAGCTAGCTAG: 6        (Frequent)
โœ… TTTTTTTTTTTTTTTTTTTTT: 4        (Moderate)
โœ… CCCCCCCCCCCCCCCCCCCCCCC: 4      (Moderate)
โŒ AAAAAAAAAAAAAAAAAAAAA: 0        (Not present)
```

### Performance Metrics
```
Processing Speed: ~10,000 k-mers/second
Database Creation: ~1 second
Query Speed: ~100,000 queries/second
Memory Usage: <5MB for this tutorial
```

---

## Next Steps

Congratulations! You've completed the RustKmer basic workflow tutorial. Here's what you can do next:

### ๐ŸŽฏ Explore Further
- **[Counting Guide]../user-guide/counting-kmers.md** - Advanced counting techniques
- **[Querying Guide]../user-guide/querying.md** - Database querying methods
- **[Fuzzy Search]../user-guide/fuzzy-search.md** - Pattern matching

### ๐Ÿงฌ Real Data Applications
- **Use your own FASTA files** in the workflow
- **Try different k-mer sizes** (13, 21, 31)
- **Process larger datasets** with memory optimization
- **Integrate with bioinformatics pipelines**

### โš™๏ธ Performance Optimization
- **Benchmark on your system** to compare performance
- **Try different parameters** for your specific use case
- **Explore parallel processing** for large datasets

### ๐Ÿ”ง Advanced Features
- **Fuzzy searching** with wildcards and distance constraints
- **Batch processing** of multiple files
- **Python API integration** with other tools

---

## Troubleshooting

### Common Issues

**Installation Problems:**
```bash
# If Python package installation fails
pip install --upgrade pip
pip install rustkmer

# If you need Rust installation
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

**Import Errors:**
```python
# Verify RustKmer is properly installed
python -c "from pyrustkmer import KmerCounter, Database; print('โœ… OK')", LoadMode
```

**File Not Found:**
```bash
# Check if files were created
ls -la tutorial_*
```

**Memory Issues:**
```python
# Use smaller k-mer size for memory efficiency
counter = PyCounter(13, canonical=True)  # Instead of k=21
```

---

## Need Help?

- **Documentation**: [User Guide]../user-guide/ for detailed usage
- **API Reference**: [Python API]../api-reference/python/ for complete reference
- **Examples**: [Python Examples]../api-reference/python/examples.md for more code samples
- **Community**: [GitHub Discussions]https://github.com/rustkmer/rustkmer/discussions for questions

---

## Tutorial Complete! ๐ŸŽ‰

You've successfully:
โœ… Created sample genomic data
โœ… Counted k-mers with RustKmer
โœ… Built and queried k-mer databases
โœ… Analyzed and exported results
โœ… Automated the complete workflow

You're now ready to use RustKmer for your own bioinformatics projects!