rype 1.0.0-rc.1

High-performance genomic sequence classification using minimizer-based k-mer sketching in RY space
Documentation
# Rype

[![CI](https://github.com/the-miint/rype/actions/workflows/ci.yml/badge.svg)](https://github.com/the-miint/rype/actions/workflows/ci.yml)
[![Crates.io](https://img.shields.io/crates/v/rype.svg)](https://crates.io/crates/rype)
[![Documentation](https://docs.rs/rype/badge.svg)](https://docs.rs/rype)
[![License](https://img.shields.io/badge/license-BSD--3--Clause-blue.svg)](LICENSE)

High-performance genomic sequence classification using minimizer-based k-mer sketching in RY (purine/pyrimidine) space.

## Overview

Rype is a Rust library and CLI tool for fast sequence classification. It uses a reduced 2-bit alphabet that collapses purines (A/G) and pyrimidines (T/C), enabling:

- **Mutation tolerance**: Purine-purine and pyrimidine-pyrimidine mutations don't break k-mer matches
- **Compact representation**: 64bp k-mers fit in a single `u64`
- **High performance**: Minimizer sketching with O(n) extraction using monotonic deques

## Installation

### From source

```bash
cargo install --path .
```

### Building

```bash
# Release build
cargo build --release

# Development build
cargo build
```

## Quick Start

### Create an index

```bash
# Create index from reference sequences
rype index create -o index.ryxdi -r ref1.fasta -r ref2.fasta -k 64 -w 50

# Create index with one bucket per sequence
rype index create -o genes.ryxdi -r genes.fasta --separate-buckets

# Build from a TOML configuration file
rype index from-config -c config.toml

# Merge two indices
rype index merge --index-primary idx1.ryxdi --index-secondary idx2.ryxdi -o merged.ryxdi
```

### Classify sequences

```bash
# Single-end reads
rype classify run -i index.ryxdi -1 reads.fastq -t 0.1

# Paired-end reads
rype classify run -i index.ryxdi -1 reads_R1.fastq -2 reads_R2.fastq -t 0.1

# Best-hit-only classification
rype classify run -i index.ryxdi -1 reads.fastq --best-hit

# Trim reads to first N bases before classification
rype classify run -i index.ryxdi -1 reads.fastq -t 0.1 --trim-to 100

# Host depletion with a negative index
rype classify run -i target.ryxdi -N host.ryxdi -1 reads.fastq -t 0.1

# Sample-level aggregated classification
rype classify aggregate -i index.ryxdi -1 reads.fastq -t 0.05

# Log-ratio scoring with two single-bucket indices
rype classify log-ratio -n numerator.ryxdi -d denominator.ryxdi -1 reads.fastq
```

### Index management

```bash
# Show index statistics
rype index stats -i index.ryxdi
```

## CLI Reference

### `rype index`

| Subcommand | Description |
|------------|-------------|
| `create` | Build index from FASTA/FASTQ files |
| `from-config` | Build index from TOML configuration file |
| `merge` | Merge two indices (with optional subtraction) |
| `stats` | Show index statistics |
| `bucket-source-detail` | Show source details for a specific bucket |
| `summarize` | Show detailed minimizer statistics (legacy non-Parquet indices only) |

### `rype classify`

| Subcommand | Description |
|------------|-------------|
| `run` | Per-read classification |
| `aggregate` | Sample-level aggregated classification (alias: `agg`) |
| `log-ratio` | Log-ratio scoring using two single-bucket indices |

## Configuration File

Build complex indices using a TOML configuration file:

```toml
[index]
output = "output.ryxdi"
window = 50
salt = 0x5555555555555555    # XOR salt for hashing
# k = 64                    # K-mer size (default: 64)

[buckets.species_a]
files = ["ref_a1.fasta", "ref_a2.fasta"]

[buckets.species_b]
files = ["ref_b.fasta"]
```

## Library Usage

```rust
use needletail::parse_fastx_file;
use rype::{
    classify_batch_sharded_merge_join, extract_into, BucketData, MinimizerWorkspace,
    ParquetWriteOptions, ShardedInvertedIndex,
};

fn main() -> anyhow::Result<()> {
    let mut workspace = MinimizerWorkspace::new();
    let k = 32;
    let w = 10;
    let salt = 0u64;

    // Read phiX174 genome and extract minimizers
    let mut phix_mins = Vec::new();
    let mut reader = parse_fastx_file("examples/phiX174.fasta")?;
    while let Some(record) = reader.next() {
        let record = record?;
        extract_into(&record.seq(), k, w, salt, &mut workspace);
        phix_mins.extend(workspace.buffer.drain(..));
    }
    phix_mins.sort();
    phix_mins.dedup();

    // Read pUC19 plasmid and extract minimizers
    let mut puc19_mins = Vec::new();
    let mut reader = parse_fastx_file("examples/pUC19.fasta")?;
    while let Some(record) = reader.next() {
        let record = record?;
        extract_into(&record.seq(), k, w, salt, &mut workspace);
        puc19_mins.extend(workspace.buffer.drain(..));
    }
    puc19_mins.sort();
    puc19_mins.dedup();

    // Build bucket data
    let buckets = vec![
        BucketData {
            bucket_id: 1,
            bucket_name: "phiX174".to_string(),
            sources: vec!["phiX174.fasta".to_string()],
            minimizers: phix_mins,
        },
        BucketData {
            bucket_id: 2,
            bucket_name: "pUC19".to_string(),
            sources: vec!["pUC19.fasta".to_string()],
            minimizers: puc19_mins,
        },
    ];

    // Create index in temporary directory
    let temp_dir = tempfile::tempdir()?;
    let index_path = temp_dir.path().join("example.ryxdi");
    let options = ParquetWriteOptions::default();
    rype::create_parquet_inverted_index(&index_path, buckets, k, w, salt, None, Some(&options), None)?;

    // Load and classify
    let index = ShardedInvertedIndex::open(&index_path)?;

    // Create query from first 100bp of phiX174 - should match bucket 1
    let query_seq = b"GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTT";
    let queries = vec![(1_i64, query_seq.as_slice(), None)];

    let results = classify_batch_sharded_merge_join(&index, None, &queries, 0.1, None)?;
    for hit in &results {
        let name = index.manifest().bucket_names.get(&hit.bucket_id).unwrap();
        println!("Query {} -> {} (score: {:.2})", hit.query_id, name, hit.score);
    }

    Ok(())
}
```

## C API

Rype provides a C API for FFI integration:

```c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "rype.h"

int main(int argc, char** argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <index.ryxdi>\n", argv[0]);
        return 1;
    }

    // Load index
    RypeIndex* index = rype_index_load(argv[1]);
    if (!index) {
        fprintf(stderr, "Failed to load index: %s\n", rype_get_last_error());
        return 1;
    }

    // Create a query (70bp from phiX174)
    const char* seq = "GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT";
    RypeQuery queries[1];
    queries[0].seq = (const uint8_t*)seq;
    queries[0].seq_len = strlen(seq);
    queries[0].pair_seq = NULL;
    queries[0].pair_len = 0;

    // Classify sequences
    RypeResultArray* results = rype_classify(index, queries, 1, 0.1);
    if (!results) {
        fprintf(stderr, "Classification failed: %s\n", rype_get_last_error());
        rype_index_free(index);
        return 1;
    }

    // Process results
    printf("Found %zu hits:\n", results->len);
    for (size_t i = 0; i < results->len; i++) {
        const char* name = rype_bucket_name(index, results->data[i].bucket_id);
        printf("  Query %ld -> %s (bucket %u, score: %.2f)\n",
               results->data[i].query_id,
               name ? name : "unknown",
               results->data[i].bucket_id,
               results->data[i].score);
    }

    // Cleanup
    rype_results_free(results);
    rype_index_free(index);
    return 0;
}
```

Build and run:
```bash
gcc example.c -L target/release -lrype -o example
LD_LIBRARY_PATH=target/release ./example index.ryxdi
```

### Thread Safety

- Index loading/freeing: NOT thread-safe
- Classification (`rype_classify`): Thread-safe (multiple threads can share the same index)
- Results: NOT thread-safe (each thread needs its own `RypeResultArray`)

## Algorithm Details

### RY Encoding

The library uses a reduced 2-bit alphabet:

| Base | Category | Encoding |
|------|----------|----------|
| A, G | Purine | 1 |
| T, C | Pyrimidine | 0 |
| N, etc. | Invalid | Resets k-mer |

### Minimizer Extraction

1. Slide a window of size `w` over k-mers
2. Select the minimum hash value within each window
3. Deduplicate consecutive identical minimizers

The implementation uses a monotonic deque for O(n) time complexity.

### Supported K-mer Sizes

- K = 16, 32, or 64 (stored as `u64`)

## Index Format

Rype uses a Parquet-based inverted index format stored as a directory:

```text
index.ryxdi/
├── manifest.toml         # Index metadata (k, w, salt, bucket info)
├── buckets.parquet       # Bucket metadata (id, name, sources)
└── inverted/
    ├── shard.0.parquet   # Inverted index shard (minimizer -> bucket_id pairs)
    └── ...               # Additional shards for large indices
```

This format provides:
- Efficient columnar storage with compression
- Memory-efficient streaming classification
- Support for indices larger than available RAM

## Performance Tips

- Use `--release` builds for production
- Batch processing amortizes thread pool overhead
- Parquet format enables memory-efficient classification for large indices

## Testing

```bash
# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test
cargo test test_name
```

## Development Setup

```bash
# Enable pre-commit hooks (runs fmt, clippy, tests, doc checks)
git config core.hooksPath .githooks

# Manually run checks
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-features
```

The test suite includes automated verification that README examples compile and run correctly (`readme_example_test.rs` and `cli_integration_tests.rs::test_readme_bash_examples`).

## License

Licensed under the [BSD 3-Clause License](LICENSE).

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.