# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
**Rype** is a high-performance genomic sequence classification library using minimizer-based k-mer sketching in RY (purine/pyrimidine) space. It's written in Rust and provides both a Rust library, CLI tool, and C API for FFI integration.
All indices are stored in Parquet format (`.ryxdi` directories).
## Build and Development Commands
### Building
```bash
# Build the project
cargo build --release
# Build for development with debug symbols
cargo build
```
### Testing
```bash
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test
cargo test test_name
```
### C API Development
```bash
# Build C example
gcc example.c -L target/debug -lrype -o c_example
# Set library path and run
LD_LIBRARY_PATH=target/debug ./c_example
```
### Python Examples
```bash
# ctypes extraction example (no extra dependencies)
python3 examples/ctypes_extraction_example.py
# PyArrow extraction example (requires pyarrow via conda env)
conda run -n rype-pyarrow python3 examples/pyarrow_extraction_example.py
```
### CLI Usage
```bash
# Create an index from reference sequences
cargo run --release -- index create -o index.ryxdi -r ref1.fasta -r ref2.fasta -k 64 -w 50
# Create index with one bucket per sequence
cargo run --release -- index create -o genes.ryxdi -r genes.fasta --separate-buckets
# Show index statistics
cargo run --release -- index stats -i index.ryxdi
# Show source details for a bucket
cargo run --release -- index bucket-source-detail -i index.ryxdi -b 1
# Build index from a TOML configuration file
cargo run --release -- index from-config -c config.toml
# Merge two indices into one
cargo run --release -- index merge --index-primary idx1.ryxdi --index-secondary idx2.ryxdi -o merged.ryxdi
# Merge with subtraction (remove secondary minimizers that exist in primary)
# Useful for creating non-host indices
cargo run --release -- index merge --index-primary host.ryxdi --index-secondary sample.ryxdi \
-o non_host.ryxdi --subtract-from-primary
# Classify sequences (single-end)
cargo run --release -- classify run -i index.ryxdi -1 reads.fastq -t 0.1
# Classify sequences (paired-end)
cargo run --release -- classify run -i index.ryxdi -1 reads_R1.fastq -2 reads_R2.fastq -t 0.1
# Classify with negative filtering (host depletion)
cargo run --release -- classify run -i target.ryxdi -N host.ryxdi -1 reads.fastq -t 0.1
# Aggregate classification (for higher sensitivity)
cargo run --release -- classify aggregate -i index.ryxdi -1 reads.fastq -t 0.05
# Best-hit-only classification
cargo run --release -- classify run -i index.ryxdi -1 reads.fastq --best-hit
# Classify with sequence trimming (use first N bases only)
# Useful when read starts are more reliable than ends
cargo run --release -- classify run -i index.ryxdi -1 reads.fastq -t 0.1 --trim-to 100
```
## Architecture Overview
### RY Encoding (Core Concept)
The library uses a reduced 2-bit alphabet that collapses purines and pyrimidines:
- **Purines** (A/G) → 1
- **Pyrimidines** (T/C) → 0
- **Other bases** (N, ambiguous) → invalid (resets k-mer extraction)
This enables purine/pyrimidine-aware matching where AG-purine mutations don't break matches, and allows 64bp k-mers to fit in a single u64.
### Minimizer Sketching Algorithm
The library reduces sequence representation using minimizers:
1. Sliding window of size `w` over k-mers
2. Select minimum hash value within each window as representative
3. Deduplicate consecutive identical minimizers
Implementation uses monotonic deque for O(n) time complexity (see `extract_into()` in src/core.rs).
### Key Data Structures
**InvertedIndex** (src/indices/):
- Minimizer → bucket ID mappings for fast classification
- Loaded from Parquet shards on-demand
**ShardedInvertedIndex** (src/indices/):
- Memory-efficient sharded inverted index
- Holds manifest; shards loaded on-demand during classification
**MinimizerWorkspace** (src/core.rs):
- Reusable workspace to avoid allocations in hot loops
- Contains deques for forward/reverse-complement k-mer tracking
- `buffer: Vec<u64>` - Output minimizers
**HitResult** (src/types.rs):
- Classification result: query_id, bucket_id, score
**BucketData** (src/indices/parquet/):
- Used during index creation: bucket_id, bucket_name, sources, minimizers
### Constants Module (src/constants.rs)
Centralized constants for consistency and maintainability:
**Safety Limits:**
- `MAX_INVERTED_MINIMIZERS` - Maximum minimizers in inverted index (1 trillion)
- `MAX_INVERTED_BUCKET_IDS` - Maximum bucket ID entries (4 billion)
- `MAX_SEQUENCE_LENGTH` - Max sequence size for C API (2GB)
- `MAX_READS` - Bit-packing limit (2^31 - 1)
**Performance Tuning:**
- `GALLOP_THRESHOLD` = 16 - Merge-join vs galloping switch point
- `QUERY_HASHSET_THRESHOLD` = 1000 - Linear vs HashSet lookup
- `PARQUET_BATCH_SIZE`, `DEFAULT_ROW_GROUP_SIZE` - Parquet I/O sizing
**Delimiters:**
- `BUCKET_SOURCE_DELIM` = "::" - Separates filename from sequence name in bucket sources
### Core Algorithms
**Minimizer Extraction** (src/core.rs):
- `extract_into()` - Single-strand minimizer extraction
- `extract_dual_strand_into()` - Forward + reverse-complement extraction
- `get_paired_minimizers_into()` - Paired-end read handling
**Classification** (src/classify.rs):
- `classify_batch_sharded_merge_join()` - Default classification using merge-join
- `classify_batch_sharded_parallel_rg()` - Classification with parallel row group processing
- `classify_with_sharded_negative()` - Classification with negative filtering
**Index Building** (src/indices/parquet/):
- `create_parquet_inverted_index()` - Create Parquet index from BucketData
### C API (src/c_api.rs)
FFI layer exposing core functionality to C:
**Thread Safety**:
- Index loading/freeing: NOT thread-safe
- Classification (`rype_classify`): Thread-safe (multiple threads can use same Index)
- Results: NOT thread-safe (each thread needs own RypeResultArray)
**Key Functions**:
- `rype_index_load(path)` - Load index from disk
- `rype_classify(index, queries, num_queries, threshold)` - Batch classify
- `rype_results_free(results)` - Free result array
- `rype_get_last_error()` - Get thread-local error message
**Safety**:
- Input validation for all C pointers and sizes
- Thread-local error reporting
- MAX_SEQUENCE_LENGTH limit (2GB)
- Panic catching in `rype_classify`
### CLI (src/main.rs)
Nested subcommands using clap:
**`rype index`** - Index operations:
- `create` - Build Parquet index from FASTA/FASTQ
- `stats` - Show index statistics
- `bucket-source-detail` - Show source details for a specific bucket
- `bucket-add` - Add sequences to existing index as new bucket (development pending)
- `from-config` - Build index from TOML configuration file
- `bucket-add-config` - Add files using TOML config (development pending)
- `merge` - Merge two indices into one (with optional subtraction)
- `summarize` - Show detailed minimizer statistics
**`rype classify`** - Classification operations:
- `run` - Per-read classification
- `aggregate` - Aggregated classification for paired-end (alias: `agg`)
**`rype inspect`** - Debugging operations:
- `matches` - Show matching minimizers between queries and buckets (not supported with Parquet)
## Important Constants
- `K ∈ {16, 32, 64}` - K-mer size (configurable per-index, always uses u64 representation)
- `MAX_SEQUENCE_LENGTH = 2_000_000_000` - Max sequence size for C API
## Critical Implementation Details
### K-mer Encoding
The `base_to_bit()` function uses unsafe lookup table for performance. Invalid bases return `u64::MAX` which triggers window reset.
### Canonical K-mers
K-mers and their reverse complements are treated as equivalent. Reverse complement calculated via bitwise NOT: `!kmer` in RY-space.
### Parallel Processing
Uses `rayon` for data parallelism:
- Classification parallelizes minimizer extraction AND bucket scoring
- `map_init()` pattern provides per-thread workspace to avoid allocations
### Parquet Index Format
All indices use Parquet format stored as `.ryxdi` directories:
```
index.ryxdi/
├── manifest.toml # TOML metadata (k, w, salt, bucket info)
├── buckets.parquet # (bucket_id, bucket_name, sources)
└── inverted/
├── shard.0.parquet # (minimizer: u64, bucket_id: u32) sorted pairs
└── ... # Additional shards for large indices
```
**Manifest Format (TOML)**:
```toml
magic = "RYPE_PARQUET_V1"
format_version = 1
k = 64
w = 50
salt = "0x5555555555555555" # Hex string for large values
source_hash = "0xDEADBEEF"
num_buckets = 10
total_minimizers = 1000000
[inverted]
num_shards = 2
total_entries = 5000000
has_overlapping_shards = true # Buckets may share minimizers across shards
```
**Benefits**:
- Parquet provides efficient columnar storage with DELTA_BINARY_PACKED encoding
- Streaming k-way merge enables building large indices with bounded memory
- Human-readable TOML manifest for easy inspection
- Shards loaded on-demand during classification
**Memory Benefits**:
- Manifest loads instantly (no minimizer data)
- Classification loads one shard at a time via `classify_batch_sharded_merge_join`
- Memory usage: O(batch_size × minimizers_per_read) + O(single_shard_size)
- Enables classification when total index exceeds available RAM
### Error Handling
- Rust API: Uses `anyhow::Result<T>` for all fallible operations
- C API: Returns NULL on error, call `rype_get_last_error()` for details
- Safe loading: Validates format, enforces size limits
## Memory Management Notes
### Rust Side
- Workspace reuse pattern minimizes allocations (pass `&mut MinimizerWorkspace`)
- Shards loaded on-demand, not all at once
### C API Side
- Index ownership transferred via `Box::into_raw()` / `Box::from_raw()`
- Results allocated in Rust, freed by caller with `rype_results_free()`
- Never free Index while classification is in progress (use-after-free)
- Never double-free RypeResultArray (undefined behavior)
## Testing Strategy
Existing tests cover:
- Minimizer extraction correctness
- Index creation and loading
- Classification accuracy
- C API validation logic
When adding features:
1. Add unit tests for core logic
2. Add error path tests
3. Test with C example if touching FFI
4. Consider edge cases: empty sequences, N-bases, very long sequences
## Performance Considerations
- **Hot path**: `extract_into()`, classification functions - avoid allocations
- **Deque capacity**: Pre-sized to avoid reallocation during sliding window
- **Parallelism**: Batch processing amortizes thread pool overhead
- **Inverted index**: Reduces per-bucket work from O(queries × minimizers) to O(unique_minimizers)
- **Row group filtering**: Bloom filters can reduce I/O by rejecting row groups early
## Common Pitfalls
1. **K-mer size**: K must be 16, 32, or 64. K is set at index creation and stored in the index.
2. **C API thread safety**: Don't share RypeResultArray across threads
3. **Index compatibility**: Indices with different k, w, or salt cannot be used together for negative filtering
4. **Short sequences**: Sequences < K bases produce no minimizers
## Performance Test Data (Local Only)
The `perf-assessment/` and `perf-data/` directories contain real-world test data for benchmarking (not checked into git):
- **Genomes**: `perf-data/wol2-genomes/` — ~16,000 compressed FASTA genomes (WoL2 database)
- **Query files** (symlinks in `perf-assessment/query-files/`):
- `short_read_R1.fastq.gz` / `short_read_R2.fastq.gz` — paired-end short reads (~108MB/113MB)
- `long_read.fastq.gz` — long reads (~2.2GB)
- `short_read.parquet` / `long_read.parquet` — Parquet-converted versions
- **Pre-built indices**:
- `perf-assessment/parquet-index/n100-w200.ryxdi/` — 160-bucket index (k=64, w=200, 8 shards, ~486M minimizers)
- `perf-assessment/config/numerator-w200.ryxdi/` — single-bucket index (8000 genomes from buckets 1-80, 267M minimizers)
- `perf-assessment/config/denominator-w200.ryxdi/` — single-bucket index (7952 genomes from buckets 81-160, 216M minimizers)
- **Index configs**: `perf-assessment/config/n100-w200.toml` etc. — TOML configs for rebuilding indices
Tests using this data should be `#[ignore]` since it's local-only.
### Building Single-Bucket Indices for Log-Ratio Testing
Log-ratio mode (`classify log-ratio`) requires single-bucket indices (exactly 1 bucket each). The multi-bucket `n100-w200.ryxdi` cannot be used. To build single-bucket indices:
1. **Config format** — each config has `[index]` section and one `[buckets.<name>]` section with a `files` array:
```toml
[index]
window = 200
salt = 6148914691236517205
output = "numerator-w200.ryxdi"
[buckets.numerator]
files = [ "../../perf-data/wol2-genomes/G001873845.fasta.gz", ... ]
```
2. **Build command** — output path is relative to the config file location:
```bash
target/release/rype index from-config -c perf-assessment/config/numerator-w200.toml
```
3. **Verify** — must show exactly 1 bucket:
```bash
target/release/rype index stats -i perf-assessment/config/numerator-w200.ryxdi
```
### Running Performance Tests
**Always measure both time and space.** Use `/usr/bin/time -v` to capture peak RSS (resident set size) alongside `--timing` for per-phase breakdowns. Record both in plan docs.
```bash
# Log-ratio with minimum-length filter (the OOM-prone scenario):
/usr/bin/time -v target/release/rype classify log-ratio \
-n perf-assessment/config/numerator-w200.ryxdi \
-d perf-assessment/config/denominator-w200.ryxdi \
-1 perf-assessment/query-files/long_read.parquet \
--minimum-length 100 --max-memory 4G --timing \
-o scratch/log-ratio-test.tsv
# Standard classification with timing (for build_query_index benchmarking):
/usr/bin/time -v target/release/rype classify run \
-i perf-assessment/parquet-index/n100-w200.ryxdi \
-1 perf-assessment/query-files/long_read.parquet \
-t 0.01 --timing \
-o scratch/classify-test.tsv
```
**Important**: Always use `target/release/rype` (absolute or relative path to the binary) rather than `cargo run --release --bin rype --` which can insert empty string arguments.
**CRITICAL**: NEVER run multiple performance tests in parallel. These benchmarks are I/O-bound (shard loading dominates), so concurrent tests produce misleading timings due to disk contention. Always run performance tests sequentially — one at a time.
## Development Environment Notes
- **Temporary files**: Do NOT use `/tmp` - it has insufficient space on this system. Use `scratch/` directory within the project for temporary files and test data. This directory is gitignored.