rustkmer 0.5.2

High-performance k-mer counting tool in Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
# API Reference Overview

This section provides comprehensive documentation for the RustKmer Python API. The API is designed to be intuitive, efficient, and fully compatible with the RustKmer CLI commands.

## Core Classes

The RustKmer Python API consists of three main classes and supporting data structures:

### [Database]database.md
The primary class for interacting with RKDB k-mer database files.

```python
from pyrustkmer import PyDatabase, LoadMode, PyFuzzyQuery

# Initialize database connection
db = PyDatabase("database.rkdb", LoadMode.Preload)
fuzzy = PyFuzzyQuery(db)

# Query k-mers
result = db.query_exact("ATCGATCGATCGATCGATCGATCGATCGATCGATCG")
print(f"Count: {result.count}")

# Fuzzy queries with mutation tolerance
fuzzy_result = fuzzy.query_fuzzy("ATCGATCGATCGATCGATCGATCGATCGATCGATCG", mutations=2)
print(f"Found {fuzzy_result.total_matches} similar k-mers")

# Get database statistics
stats = db.get_stats()
print(f"K-mer size: {stats.kmer_size}")
print(f"Unique k-mers: {stats.unique_kmers}")
```

**Key Features:**
- Memory-mapped access for large databases
- Exact and fuzzy k-mer queries
- Batch query optimization
- Position-specific mutation constraints
- Database statistics and metadata
- Context manager support for resource management

### [QueryResult]query.md
Represents the result of an exact k-mer query.

```python
from pyrustkmer import PyQueryResult, PyFuzzyQuery

result = PyQueryResult(
    kmer="ATCGATCGATCGATCGATCGATCGATCGATCGATCG",
    count=42,
    canonical="ATCGATCGATCGATCGATCGATCGATCGATCGATCG"
)

print(f"K-mer: {result.kmer}")
print(f"Count: {result.count}")
print(f"Present: {result.found}")
```

**Key Features:**
- Simple dataclass structure
- JSON serialization support
- Dictionary conversion
- Presence checking

### [DatabaseStats]stats.md
Contains statistics and metadata about a k-mer database.

```python
from pyrustkmer import PyDatabaseStats, PyFuzzyQuery

stats = PyDatabaseStats(
    kmer_size=31,
    unique_kmers=1000000,
    total_counts=5000000,
    min_count=1,
    max_count=1000,
    file_size=25000000,
    format_version="2.0"
)

print(f"Database contains {stats.unique_kmers:,} unique k-mers")
print(f"Total occurrences: {stats.total_counts:,}")
```

**Key Features:**
- Complete database metadata
- JSON serialization support
- Statistical information for analysis
- File format version tracking

### [Fuzzy Query Classes]fuzzyquery.md
Classes for representing fuzzy query results with mutation tolerance.

```python
from pyrustkmer import PyFuzzyResult, PyFuzzyMatch, PyPrefixQueryResult, PyFuzzyQuery

# Single fuzzy query result
result = PyFuzzyResult(
    query_kmer="ATCGATCGATCGATCGATCGATCGATCGATCGATCG",
    mutations_allowed=2,
    total_matches=15,
    matches=[...]  # List of PyFuzzyMatch objects
)

# Batch query results
batch_result = PyPrefixQueryResult(
    total_queries=10,
    successful_queries=9,
    successes={"ATCG...": result},
    errors={"INVALID": "Invalid k-mer format"}
)
```

**Key Features:**
- Hierarchical result organization
- Distance-based filtering
- Top match selection
- Batch processing support
- Comprehensive error handling

### [Exceptions]exceptions.md
Comprehensive exception hierarchy for robust error management.

```python
from pyrustkmer import PyDatabase, DatabaseNotFoundError, InvalidKmerError, FuzzyQueryError, PyFuzzyQuery

try:
    db = PyDatabase("nonexistent.rkdb", LoadMode.Preload)
fuzzy = PyFuzzyQuery(db)
except DatabaseNotFoundError as e:
    print(f"Database not found: {e.path}")

try:
    result = db.query_exact("INVALID")
except InvalidKmerError as e:
    print(f"Invalid k-mer: {e.kmer}, reason: {e.reason}")
```

**Key Features:**
- Granular exception types
- Helpful error context
- Structured error information
- Inheritance hierarchy for easy catching

## API Design Principles

### 1. **Performance First**
All operations are optimized for speed and memory efficiency:
- Subprocess calls to optimized Rust CLI
- Memory-mapped file access where possible
- Parallel processing for batch operations
- Lazy loading of metadata

### 2. **Pythonic Interface**
- Follows Python naming conventions
- Uses type hints throughout
- Implements context managers where appropriate
- Returns Python native types and dataclasses

### 3. **CLI Compatibility**
The Python API provides 100% functional parity with CLI commands:
- Same algorithms and parameters
- Identical output formats
- Consistent error handling
- Position-mutation feature support

### 4. **Error Handling**
- Comprehensive exception hierarchy
- Informative error messages with context
- Graceful degradation options
- Validation at multiple levels

## Common Patterns

### Database Operations

```python
from pyrustkmer import PyDatabase, LoadMode, PyFuzzyQuery

# Context manager usage (recommended)
db = PyDatabase("database.rkdb", LoadMode.Preload)
fuzzy = PyFuzzyQuery(db)
    result = db.query_exact("ATCGATCGATCGATCGATCGATCGATCGATCGATCG")
    stats = db.get_stats()
# Database automatically closed

# Manual management
db = PyDatabase("database.rkdb", LoadMode.Preload)
fuzzy = PyFuzzyQuery(db)
try:
    result = db.query_exact("ATCGATCGATCGATCGATCGATCGATCGATCGATCG")
    process_result(result)
finally:
```

### Fuzzy Querying

```python
from pyrustkmer import PyDatabase, LoadMode, PyFuzzyQuery

db = PyDatabase("database.rkdb", LoadMode.Preload)
fuzzy = PyFuzzyQuery(db)

# Basic fuzzy query
result = fuzzy.query_fuzzy("ATCGATCGATCGATCGATCGATCGATCGATCGATCG", mutations=2)

# Position-specific mutations
result = fuzzy.query_fuzzy(
    "ATCGATCGATCGATCGATCGATCGATCGATCGATCG",
    position_mutations="10,15:2"  # Allow 2 mutations at positions 10 and 15
)

# Batch fuzzy queries
kmers = ["ATCGATCG...", "GCTAGCTA...", "TTTTTTTT..."]
batch_result = db.fuzzy_query_batch(kmers, mutations=2, max_workers=8)
```

### Error Handling

```python
from pyrustkmer import (, PyFuzzyQuery
    PyDatabase, DatabaseNotFoundError, InvalidKmerError,
    FuzzyQueryError, QueryError
)

def safe_query(db_path, kmer):
    try:
        db = PyDatabase(db_path, LoadMode.Preload)
fuzzy = PyFuzzyQuery(db)
            result = db.query_exact(kmer)
            return result

    except DatabaseNotFoundError:
        print(f"Database file not found: {db_path}")
        return None

    except InvalidKmerError as e:
        print(f"Invalid k-mer: {e.kmer} - {e.reason}")
        return None

    except QueryError as e:
        print(f"Query failed: {e}")
        return None
```

## Performance Considerations

### Database Initialization

```python
# Fast initialization (recommended)
db = PyDatabase("database.rkdb", LoadMode.Preload)
fuzzy = PyFuzzyQuery(db)  # validate=False by default

# Full validation (slower)
db = PyDatabase("database.rkdb", LoadMode.Preload)
fuzzy = PyFuzzyQuery(db)
```

### Fuzzy Query Optimization

```python
# Use position mutations to limit search space
result = fuzzy.query_fuzzy(
    kmer,
    mutations=2,
    position_mutations="10,15:2"  # Much faster than allowing mutations anywhere
)

# Limit variants to prevent combinatorial explosion
result = fuzzy.query_fuzzy(
    kmer,
    mutations=3,
    max_variants=1000  # Conservative limit
)

# Batch processing for multiple queries
batch_result = db.fuzzy_query_batch(kmers, mutations=2, max_workers=8)
```

### Memory Management

```python
# Database automatically handles memory mapping
# Use context managers for proper resource cleanup
db = PyDatabase("large_db.rkdb", LoadMode.Preload, LoadMode.Preload)
    for result in db.dump(limit=100000):
        process_result(result)
# Resources automatically freed
```

## Integration with Bioinformatics Libraries

### Pandas Integration

```python
import pandas as pd
from pyrustkmer import PyDatabase, LoadMode, PyFuzzyQuery

def query_dataframe(db_path, df, sequence_col='sequence'):
    """Query k-mers from a pandas DataFrame."""
    db = PyDatabase(db_path)
fuzzy = PyFuzzyQuery(db)

    # Add count column
    df['count'] = df[sequence_col].apply(
        lambda seq: db.query_exact(seq).count
    )

    return df

# Usage
df = pd.read_csv("sequences.csv")
df_with_counts = query_dataframe("database.rkdb", df)
```

### NumPy Integration

```python
import numpy as np
from pyrustkmer import PyDatabase, LoadMode, PyFuzzyQuery

def batch_query_numpy(db_path, sequences):
    """Vectorized batch querying."""
    db = PyDatabase(db_path)
fuzzy = PyFuzzyQuery(db)

    # Convert to array for processing
    seq_array = np.array(sequences)
    counts = np.array([db.query_exact(seq).count for seq in seq_array])

    return counts

# Usage
sequences = ["ATCGATCG...", "GCTAGCTA...", "TTTTTTTT..."]
counts = batch_query_numpy("database.rkdb", sequences)
```

### Biopython Integration

```python
from Bio import SeqIO
from pyrustkmer import PyDatabase, LoadMode, PyFuzzyQuery

def extract_kmers_from_fasta(fasta_file, k=31):
    """Extract k-mers from FASTA sequences."""
    kmers = []
    for record in SeqIO.parse(fasta_file, "fasta"):
        seq = str(record.seq).upper()
        for i in range(len(seq) - k + 1):
            kmer = seq[i:i+k]
            if 'N' not in kmer:  # Skip ambiguous k-mers
                kmers.append(kmer)
    return kmers

def analyze_sequences(db_path, fasta_file):
    """Analyze sequences from FASTA against database."""
    db = PyDatabase(db_path)
fuzzy = PyFuzzyQuery(db)
    kmers = extract_kmers_from_fasta(fasta_file)

    results = []
    for kmer in kmers[:1000]:  # Limit for demo
        result = db.query_exact(kmer)
        if result.found:
            results.append({
                'kmer': kmer,
                'count': result.count,
                'canonical': result.canonical
            })

    return results
```

## Type Hints Reference

```python
from typing import Optional, List, Dict, Any, Union
from pathlib import Path

# Database class type signatures
def __init__(self,
             path: Union[str, Path],
             validate: bool = False) -> None: ...

def query(self,
          kmer: str,
          validate_strict: bool = True) -> QueryResult: ...

def fuzzy_query(self,
                kmer: str,
                mutations: int = 1,
                max_variants: Optional[int] = None,
                output_format: str = 'auto',
                position_mutations: Optional[str] = None) -> FuzzyQueryResult: ...

def fuzzy_query_batch(self,
                      kmers: List[str],
                      mutations: int = 1,
                      max_variants: Optional[int] = None,
                      max_workers: int = 4,
                      output_format: str = 'auto',
                      position_mutations: Optional[str] = None) -> FuzzyBatchResult: ...

def stats(self) -> DatabaseStats: ...

def dump(self,
         limit: Optional[int] = None,
         canonical_only: bool = False,
         output_format: str = 'auto') -> Iterator[DumpResult]: ...

# Result classes
class QueryResult:
    kmer: str
    count: int
    canonical: str

class DatabaseStats:
    kmer_size: int
    unique_kmers: int
    total_counts: int
    min_count: int
    max_count: int
    file_size: int
    format_version: str
```

## Best Practices

1. **Always use context managers** (`with` statements) for Database objects
2. **Choose validate=False** for Database initialization unless validation is needed
3. **Use position mutations** in fuzzy queries to limit search space
4. **Batch queries** when possible to reduce overhead
5. **Handle specific exceptions** rather than generic catching
6. **Use max_variants** to prevent combinatorial explosion in fuzzy queries
7. **Close databases** manually when not using context managers

## Migration from CLI

If you're migrating from the CLI, here's a quick reference:

| CLI Command | Python Equivalent |
|-------------|-------------------|
| `rustkmer query database.rkdb ATCG` | `PyDatabase("database.rkdb").query_exact("ATCG")` |
| `rustkmer fuzzy-query database.rkdb ATCG --mutations 2` | `PyDatabase("database.rkdb").fuzzy_query("ATCG", mutations=2)` |
| `rustkmer stats database.rkdb` | `PyDatabase("database.rkdb").stats()` |
| `rustkmer dump database.rkdb --limit 1000` | `PyDatabase("database.rkdb").dump(limit=1000)` |
| `rustkmer fuzzy-query database.rkdb ATCG --position-mutations "10,15:2"` | `PyDatabase("database.rkdb").fuzzy_query("ATCG", position_mutations="10,15:2")` |

For detailed migration guides, see the [User Guide](../user-guide/).

## Version History

### Current Version (0.1.0)
- Complete database query API
- Fuzzy query with position-mutation support
- Comprehensive exception hierarchy
- Batch processing capabilities
- High-performance PyO3 native implementation

### Key Features Added
- Position-specific mutation constraints
- Parallel batch fuzzy queries
- Enhanced error handling with context
- Google-style docstring documentation
- Type hints throughout the API