Expand description
Rype: High-performance genomic sequence classification using minimizer-based k-mer sketching.
This library provides efficient classification of DNA sequences against reference indices using RY-space (purine/pyrimidine) encoding and minimizer sketching.
§Core Concepts
- RY Encoding: Reduces the 4-base DNA alphabet to 2 bits (purines → 1, pyrimidines → 0)
- Minimizers: Representative k-mers selected from sliding windows for efficient sketching
- Inverted Index: Enables O(Q log U) lookups instead of O(B × Q × log M)
§Main Types
InvertedIndex: Minimizer → bucket mappings for fast classificationShardedInvertedIndex: Memory-efficient sharded inverted indexMinimizerWorkspace: Reusable workspace for minimizer extraction
§Classification Functions
classify_batch_sharded_merge_join: Classify with sharded inverted index using merge-join (default)classify_batch_sharded_parallel_rg: Classify with parallel row group processing
Modules§
- c_api
- C API for rype - FFI bindings for external language integration.
- config
- memory
- Memory utilities for adaptive batch sizing.
- parquet_
index - Parquet-based format implementation for inverted index.
Structs§
- Bucket
Data - Bucket data with minimizers for building inverted index.
- Bucket
File Stats - Per-bucket statistics about total sequence lengths of source files.
- Bucket
Metadata - Bucket metadata for a single bucket.
- First
Error Capture - Thread-safe error capture that stores only the first error.
- HitResult
- Query ID, Bucket ID, Score
- Index
Metadata - Lightweight metadata-only view of an Index (without minimizer data)
- Inverted
Index - CSR-format inverted index for fast minimizer → bucket lookups.
- Inverted
Manifest - Manifest section for inverted index shards.
- Inverted
Shard Info - Per-shard metadata for inverted index.
- LogRatio
Result - Result of log-ratio computation for a single query.
- Merge
Options - Options for index merging.
- Merge
Stats - Statistics from a merge operation.
- Minimizer
Workspace - Workspace for minimizer extraction algorithms.
- Parquet
Manifest - Manifest containing index metadata.
- Parquet
Read Options - Configuration options for reading Parquet inverted index files.
- Parquet
Write Options - Configuration options for Parquet file writing.
- Partition
Result - Result of partitioning reads by numerator score into fast-path and needs-denominator groups.
- Query
Inverted Index - Query inverted index for merge-join classification. Stores sorted COO (coordinate) entries: (minimizer, packed_read_id).
- Shard
Info - Information about a single shard in a sharded inverted index.
- Shard
Manifest - Manifest describing a sharded inverted index.
- Sharded
Inverted Index - Handle for a sharded inverted index.
- Strand
Minimizers - Structure-of-arrays for minimizers on a single strand.
Enums§
- Fast
Path - Indicates whether a log-ratio result was determined via a fast path (skipping the denominator classification) or computed exactly.
- Orientation
- Orientation of a sequence relative to the bucket baseline.
- Parquet
Compression - Compression codec for Parquet files.
- Rype
Error - Unified error type for the rype library.
- Strand
- Strand indicator for minimizer origin.
Constants§
- BUCKET_
SOURCE_ DELIM - Delimiter between filename and sequence name in bucket sources.
Format:
path/to/file.fa::sequence_name - ORIENTATION_
FIRST_ N - Default number of minimizers to check for orientation. Checking the first N sorted minimizers provides a sample for orientation decisions without iterating the full bucket.
Statics§
- ENABLE_
TIMING - Controls whether timing diagnostics are printed to stderr.
Functions§
- base_
to_ bit - Convert a nucleotide base to its RY-space bit representation.
- choose_
orientation - Choose orientation with higher overlap. Ties favor Forward.
- choose_
orientation_ sampled - Choose orientation by checking first N minimizers of each strand.
- classify_
batch_ sharded_ merge_ join - Classify a batch of records against a sharded inverted index using merge-join.
- classify_
batch_ sharded_ parallel_ rg - Classify using parallel row group processing.
- classify_
from_ extracted_ minimizers - Classify from pre-extracted minimizers against a sharded inverted index using merge-join.
- classify_
from_ extracted_ minimizers_ parallel_ rg - Classify from pre-extracted minimizers using parallel row group processing.
- classify_
from_ query_ index - Classify using a pre-built QueryInvertedIndex against a sharded inverted index.
- classify_
from_ query_ index_ parallel_ rg - Classify using a pre-built QueryInvertedIndex with parallel row group processing.
- classify_
log_ ratio_ batch - Classify a batch of reads using log-ratio (numerator vs denominator).
- classify_
with_ sharded_ negative - Classify with memory-efficient negative filtering using sharded index.
- compute_
log_ ratio - Compute log10(numerator / denominator) with special handling for edge cases.
- compute_
source_ hash - Compute a hash from bucket minimizer counts for validation.
- count_
hits - Count how many minimizers from a query match a bucket (binary search).
- create_
parquet_ inverted_ index - Create a Parquet inverted index directly from bucket data.
- extract_
batch_ minimizers - Extract minimizers from a batch of query records in parallel.
- extract_
dual_ strand_ into - Extract minimizers from both strands of a sequence.
- extract_
into - Extract minimizers from a sequence (single strand).
- extract_
minimizer_ set - Extract sorted, deduplicated minimizer sets per strand.
- extract_
strand_ minimizers - Extract ordered minimizers with positions per strand (SoA layout).
- filter_
best_ hits - Filter a list of HitResults to keep only the best hit per query.
- get_
paired_ minimizers_ into - Extract minimizers from paired-end reads.
- is_
parquet_ index - Check if a path is a Parquet-based index directory.
- kway_
merge_ dedup - Merge multiple sorted, deduplicated vectors into one sorted, deduplicated vector.
- load_
all_ minimizers - Load all unique minimizers from a sharded index into a HashSet.
- log_
timing - Print timing info to stderr if
ENABLE_TIMINGis enabled. - merge_
indices_ streaming - Memory-bounded merge with streaming per-shard subtraction.
- merge_
sorted_ into - Merge sorted
sourceinto sortedtargetin-place, deduplicating. - partition_
by_ numerator_ score - Partition reads by numerator classification results into fast-path and needs-denominator groups.
- validate_
compatible_ indices - Validate that two indices are compatible for log-ratio computation.
- validate_
log_ ratio_ indices - Validate that two sharded indices are compatible for log-ratio classification.
- validate_
single_ bucket_ index - Validate that the index has exactly one bucket and return its ID and name.
Type Aliases§
- Query
Record - ID (i64), Sequence Reference, Optional Pair Sequence Reference
- Rype
Result - Convenience type alias for Results using RypeError.