Crate rype

Expand description

Rype: High-performance genomic sequence classification using minimizer-based k-mer sketching.

This library provides efficient classification of DNA sequences against reference indices using RY-space (purine/pyrimidine) encoding and minimizer sketching.

§Core Concepts

RY Encoding: Reduces the 4-base DNA alphabet to 2 bits (purines → 1, pyrimidines → 0)
Minimizers: Representative k-mers selected from sliding windows for efficient sketching
Inverted Index: Enables O(Q log U) lookups instead of O(B × Q × log M)

§Main Types

InvertedIndex: Minimizer → bucket mappings for fast classification
ShardedInvertedIndex: Memory-efficient sharded inverted index
MinimizerWorkspace: Reusable workspace for minimizer extraction

§Classification Functions

classify_batch_sharded_merge_join: Classify with sharded inverted index using merge-join (default)
classify_batch_sharded_parallel_rg: Classify with parallel row group processing

Modules§

c_api: C API for rype - FFI bindings for external language integration.
config
memory: Memory utilities for adaptive batch sizing.
parquet_index: Parquet-based format implementation for inverted index.

Structs§

BucketData: Bucket data with minimizers for building inverted index.
BucketFileStats: Per-bucket statistics about total sequence lengths of source files.
BucketMetadata: Bucket metadata for a single bucket.
FirstErrorCapture: Thread-safe error capture that stores only the first error.
HitResult: Query ID, Bucket ID, Score
IndexMetadata: Lightweight metadata-only view of an Index (without minimizer data)
InvertedIndex: CSR-format inverted index for fast minimizer → bucket lookups.
InvertedManifest: Manifest section for inverted index shards.
InvertedShardInfo: Per-shard metadata for inverted index.
LogRatioResult: Result of log-ratio computation for a single query.
MergeOptions: Options for index merging.
MergeStats: Statistics from a merge operation.
MinimizerWorkspace: Workspace for minimizer extraction algorithms.
ParquetManifest: Manifest containing index metadata.
ParquetReadOptions: Configuration options for reading Parquet inverted index files.
ParquetWriteOptions: Configuration options for Parquet file writing.
PartitionResult: Result of partitioning reads by numerator score into fast-path and needs-denominator groups.
QueryInvertedIndex: Query inverted index for merge-join classification. Stores sorted COO (coordinate) entries: (minimizer, packed_read_id).
ShardInfo: Information about a single shard in a sharded inverted index.
ShardManifest: Manifest describing a sharded inverted index.
ShardedInvertedIndex: Handle for a sharded inverted index.
StrandMinimizers: Structure-of-arrays for minimizers on a single strand.

Enums§

FastPath: Indicates whether a log-ratio result was determined via a fast path (skipping the denominator classification) or computed exactly.
Orientation: Orientation of a sequence relative to the bucket baseline.
ParquetCompression: Compression codec for Parquet files.
RypeError: Unified error type for the rype library.
Strand: Strand indicator for minimizer origin.

Constants§

BUCKET_SOURCE_DELIM: Delimiter between filename and sequence name in bucket sources. Format: path/to/file.fa::sequence_name
ORIENTATION_FIRST_N: Default number of minimizers to check for orientation. Checking the first N sorted minimizers provides a sample for orientation decisions without iterating the full bucket.

Statics§

ENABLE_TIMING: Controls whether timing diagnostics are printed to stderr.

Functions§

base_to_bit: Convert a nucleotide base to its RY-space bit representation.
choose_orientation: Choose orientation with higher overlap. Ties favor Forward.
choose_orientation_sampled: Choose orientation by checking first N minimizers of each strand.
classify_batch_sharded_merge_join: Classify a batch of records against a sharded inverted index using merge-join.
classify_batch_sharded_parallel_rg: Classify using parallel row group processing.
classify_from_extracted_minimizers: Classify from pre-extracted minimizers against a sharded inverted index using merge-join.
classify_from_extracted_minimizers_parallel_rg: Classify from pre-extracted minimizers using parallel row group processing.
classify_from_query_index: Classify using a pre-built QueryInvertedIndex against a sharded inverted index.
classify_from_query_index_parallel_rg: Classify using a pre-built QueryInvertedIndex with parallel row group processing.
classify_log_ratio_batch: Classify a batch of reads using log-ratio (numerator vs denominator).
classify_with_sharded_negative: Classify with memory-efficient negative filtering using sharded index.
compute_log_ratio: Compute log10(numerator / denominator) with special handling for edge cases.
compute_source_hash: Compute a hash from bucket minimizer counts for validation.
count_hits: Count how many minimizers from a query match a bucket (binary search).
create_parquet_inverted_index: Create a Parquet inverted index directly from bucket data.
extract_batch_minimizers: Extract minimizers from a batch of query records in parallel.
extract_dual_strand_into: Extract minimizers from both strands of a sequence.
extract_into: Extract minimizers from a sequence (single strand).
extract_minimizer_set: Extract sorted, deduplicated minimizer sets per strand.
extract_strand_minimizers: Extract ordered minimizers with positions per strand (SoA layout).
filter_best_hits: Filter a list of HitResults to keep only the best hit per query.
get_paired_minimizers_into: Extract minimizers from paired-end reads.
is_parquet_index: Check if a path is a Parquet-based index directory.
kway_merge_dedup: Merge multiple sorted, deduplicated vectors into one sorted, deduplicated vector.
load_all_minimizers: Load all unique minimizers from a sharded index into a HashSet.
log_timing: Print timing info to stderr if ENABLE_TIMING is enabled.
merge_indices_streaming: Memory-bounded merge with streaming per-shard subtraction.
merge_sorted_into: Merge sorted source into sorted target in-place, deduplicating.
partition_by_numerator_score: Partition reads by numerator classification results into fast-path and needs-denominator groups.
validate_compatible_indices: Validate that two indices are compatible for log-ratio computation.
validate_log_ratio_indices: Validate that two sharded indices are compatible for log-ratio classification.
validate_single_bucket_index: Validate that the index has exactly one bucket and return its ID and name.

Type Aliases§

QueryRecord: ID (i64), Sequence Reference, Optional Pair Sequence Reference
RypeResult: Convenience type alias for Results using RypeError.

Crate rype

Crate rype Copy item path

§Core Concepts

§Main Types

§Classification Functions

Modules§

Structs§

Enums§

Constants§

Statics§

Functions§

Type Aliases§

Crate rype