GRIT: Genomic Range Interval Toolkit
A high-performance genomic interval toolkit written in Rust. Drop-in replacement for bedtools with 3-15x faster performance.
Table of Contents
- Why GRIT?
- Installation
- Documentation
- Quick Start
- Migrating from bedtools
- Commands
- intersect - Find overlapping intervals
- subtract - Remove overlapping regions
- merge - Combine overlapping intervals
- sort - Sort BED files
- closest - Find nearest intervals
- window - Find intervals within a window
- coverage - Calculate interval coverage
- slop - Extend intervals
- complement - Find gaps between intervals
- genomecov - Genome-wide coverage
- jaccard - Similarity coefficient
- multiinter - Multi-file intersection
- Utilities
- generate - Generate synthetic datasets
- Input Validation
- Streaming Mode
- Performance
- Testing
- Contributing
- License
Why GRIT?
| Feature | bedtools | GRIT |
|---|---|---|
| Speed | Baseline | 3-15x faster |
| Memory (streaming) | N/A | O(k) constant |
| Parallelization | Single-threaded | Multi-core |
| Large file support | Limited by RAM | Process 50GB+ on 4GB RAM |
GRIT is designed for:
- High-throughput genomics - Process millions of intervals efficiently
- Memory-constrained environments - Streaming mode uses minimal RAM
- Drop-in replacement - Same CLI syntax as bedtools
- Reproducibility - Deterministic output regardless of thread count
Installation
From crates.io (Recommended)
From Homebrew (macOS/Linux)
From Source
Verify Installation
Documentation
Full command documentation with examples: https://manish59.github.io/grit/
Quick Start
# Find overlapping intervals between two BED files
# Merge overlapping intervals
# Sort a BED file
# Use streaming mode for large files (minimal memory)
Migrating from bedtools
GRIT is designed as a drop-in replacement for bedtools. Here's how common bedtools commands map to GRIT:
Command Comparison Table
| bedtools | GRIT (basic) | GRIT (optimized) |
|---|---|---|
bedtools intersect -a A.bed -b B.bed |
grit intersect -a A.bed -b B.bed |
grit intersect -a A.bed -b B.bed --streaming --assume-sorted |
bedtools intersect -a A.bed -b B.bed -sorted |
grit intersect -a A.bed -b B.bed |
grit intersect -a A.bed -b B.bed --streaming --assume-sorted |
bedtools subtract -a A.bed -b B.bed |
grit subtract -a A.bed -b B.bed |
grit subtract -a A.bed -b B.bed --streaming --assume-sorted |
bedtools merge -i A.bed |
grit merge -i A.bed |
grit merge -i A.bed --assume-sorted |
bedtools closest -a A.bed -b B.bed |
grit closest -a A.bed -b B.bed |
grit closest -a A.bed -b B.bed --streaming --assume-sorted |
bedtools coverage -a A.bed -b B.bed -sorted |
grit coverage -a A.bed -b B.bed |
grit coverage -a A.bed -b B.bed --assume-sorted |
bedtools window -a A.bed -b B.bed -w 1000 |
grit window -a A.bed -b B.bed -w 1000 |
grit window -a A.bed -b B.bed -w 1000 --assume-sorted |
bedtools sort -i A.bed |
grit sort -i A.bed |
grit sort -i A.bed |
bedtools slop -i A.bed -g genome.txt -b 100 |
grit slop -i A.bed -g genome.txt -b 100 |
Same |
bedtools complement -i A.bed -g genome.txt |
grit complement -i A.bed -g genome.txt |
grit complement -i A.bed -g genome.txt --assume-sorted |
bedtools jaccard -a A.bed -b B.bed |
grit jaccard -a A.bed -b B.bed |
Same |
Key GRIT Flags
| Flag | Description | When to Use |
|---|---|---|
--streaming |
O(k) memory mode | Large files (>1GB), memory-constrained systems |
--assume-sorted |
Skip sort validation | Pre-sorted files for faster startup |
--allow-unsorted |
Auto-sort in memory | Unsorted input (uses more memory) |
-g, --genome |
Validate chromosome order | Ensure genome-specific ordering |
--bedtools-compatible |
Match bedtools behavior | Zero-length interval handling |
Performance Modes
# Basic (validates input, loads into memory)
# Streaming (constant memory, requires sorted input)
# Maximum performance (skip validation, streaming)
# Handle unsorted input (auto-sorts in memory)
Common Workflow: bedtools to GRIT
# bedtools workflow
# GRIT equivalent (faster)
# GRIT pipeline (even faster - no intermediate files)
| |
Global Options
All commands support these options:
| Option | Description |
|---|---|
-t, --threads <N> |
Number of threads (default: all CPUs) |
--bedtools-compatible |
Normalize zero-length intervals to 1bp for bedtools parity |
-h, --help |
Show help for any command |
-V, --version |
Show version |
# Run with 4 threads
# Enable bedtools-compatible mode for zero-length intervals
# Get help for a specific command
Commands
intersect
Find overlapping intervals between two BED files.
When to Use
- Identify genomic regions that overlap between datasets (e.g., peaks vs. promoters)
- Filter intervals based on overlap with a reference set
- Find regions with NO overlap (exclusion analysis)
- Count how many times each region is covered
Why Use GRIT
- 4.4x faster than bedtools intersect
- O(k) memory in streaming mode (k = max concurrent overlaps)
- 19x less memory than bedtools
How to Use
grit intersect -a <FILE_A> -b <FILE_B> [OPTIONS]
Required:
-a, --file-a <FILE>- Query intervals (file A)-b, --file-b <FILE>- Reference intervals (file B)
Output Modes:
| Option | Output |
|---|---|
| (default) | Overlap region only |
--wa |
Original A entry |
--wb |
Overlap region + B entry |
--wa --wb |
Both A and B entries |
-c, --count |
A entry + overlap count |
-u, --unique |
A entry once if ANY overlap |
-v, --no-overlap |
A entries with NO overlap |
Filtering:
| Option | Description |
|---|---|
-f, --fraction <FLOAT> |
Minimum overlap as fraction of A (0.0-1.0) |
-r, --reciprocal |
Require reciprocal fraction overlap |
Performance & Validation:
| Option | Description |
|---|---|
--streaming |
O(k) memory mode (requires sorted input) |
--assume-sorted |
Skip sort validation (faster for pre-sorted files) |
--allow-unsorted |
Allow unsorted input (loads and re-sorts in memory) |
-g, --genome <FILE> |
Validate chromosome order against genome file |
--stats |
Print statistics to stderr |
Examples
# Basic: find overlap regions
# Get original entries from both files
# Find peaks NOT in blacklist regions
# Require 50% overlap of query interval
# Require 50% reciprocal overlap (both directions)
# Count overlaps per interval
# Report each query interval once (if it has any overlap)
# Large files with minimal memory
subtract
Remove portions of A that overlap with B.
When to Use
- Remove blacklist regions from your intervals
- Exclude known features from analysis regions
- Clean up interval sets by removing specific regions
Why Use GRIT
- 6.5x faster than bedtools subtract
- 19x less memory in streaming mode
- Precise interval arithmetic
How to Use
grit subtract -a <FILE_A> -b <FILE_B> [OPTIONS]
Required:
-a, --file-a <FILE>- Intervals to modify-b, --file-b <FILE>- Intervals to remove
Options:
| Option | Description |
|---|---|
-A, --remove-entire |
Remove entire A interval if ANY overlap |
-f, --fraction <FLOAT> |
Minimum overlap fraction required |
-r, --reciprocal |
Require reciprocal fraction |
--streaming |
O(k) memory mode (requires sorted input) |
--assume-sorted |
Skip sort validation (faster for pre-sorted files) |
--allow-unsorted |
Allow unsorted input (loads and re-sorts in memory) |
-g, --genome <FILE> |
Validate chromosome order against genome file |
--stats |
Print statistics to stderr |
Examples
# Remove blacklist regions (keeps non-overlapping portions)
# Remove entire interval if ANY overlap with blacklist
# Only subtract if >50% overlap
# Large files with streaming
merge
Combine overlapping and adjacent intervals into single intervals.
When to Use
- Collapse redundant overlapping intervals
- Create non-overlapping interval sets
- Simplify interval data before downstream analysis
- Combine intervals within a certain distance
Why Use GRIT
- 10.8x faster than bedtools merge
- ~3 MB memory regardless of file size
- Streaming by default (no
--streamingflag needed)
How to Use
grit merge -i <INPUT> [OPTIONS]
Required:
-i, --input <FILE>- Input BED file (use-for stdin)
Options:
| Option | Description |
|---|---|
-d, --distance <INT> |
Merge intervals within this distance (default: 0) |
-s, --strand |
Only merge intervals on same strand |
-c, --count |
Report count of merged intervals |
--in-memory |
Load all records (for unsorted input) |
--assume-sorted |
Skip sort validation (faster for pre-sorted files) |
-g, --genome <FILE> |
Validate chromosome order against genome file |
--stats |
Print statistics to stderr |
Examples
# Basic merge (overlapping and adjacent)
# Merge intervals within 100bp of each other
# Strand-specific merging
# Count how many intervals were merged
# Read from stdin (piping)
|
# Handle unsorted input
sort
Sort BED files by chromosome and position.
When to Use
- Prepare files for streaming operations
- Ensure consistent ordering for reproducibility
- Sort by interval size for analysis
- Use custom chromosome ordering (genome file)
Why Use GRIT
- O(n) radix sort vs O(n log n) comparison sort
- Memory-mapped I/O for large files
- Stable sort preserves input order for ties
How to Use
grit sort -i <INPUT> [OPTIONS]
Required:
-i, --input <FILE>- Input BED file (use-for stdin)
Options:
| Option | Description |
|---|---|
-g, --genome <FILE> |
Custom chromosome order from genome file |
--sizeA |
Sort by interval size (ascending) |
--sizeD |
Sort by interval size (descending) |
-r, --reverse |
Reverse final sort order |
--chrThenSizeA |
Sort by chromosome name only |
--stats |
Print statistics to stderr |
Examples
# Default sort (chromosome lexicographic, then start position)
# Custom chromosome order from genome file
# Sort by interval size (smallest first)
# Sort by interval size (largest first)
# Reverse sort order
# Read from stdin
|
Genome File Format:
chr1 248956422
chr2 242193529
chr3 198295559
closest
Find the nearest interval in B for each interval in A.
When to Use
- Find nearest gene for each variant
- Identify closest regulatory element to each peak
- Distance-to-feature analysis
- Nearest neighbor genomic analysis
Why Use GRIT
- Efficient O(n log m) binary search algorithm
- Flexible tie-breaking options
- Direction-aware searching (upstream/downstream)
How to Use
grit closest -a <FILE_A> -b <FILE_B> [OPTIONS]
Required:
-a, --file-a <FILE>- Query intervals-b, --file-b <FILE>- Reference intervals to search
Options:
| Option | Description |
|---|---|
-d, --distance |
Report distance in output |
-t, --tie <MODE> |
Handle ties: all, first, last |
--io |
Ignore overlapping intervals |
--iu |
Ignore upstream intervals |
--id |
Ignore downstream intervals |
-D, --max-distance <INT> |
Maximum search distance |
--streaming |
O(k) memory mode (requires sorted input) |
--assume-sorted |
Skip sort validation (faster for pre-sorted files) |
--allow-unsorted |
Allow unsorted input (loads and re-sorts in memory) |
-g, --genome <FILE> |
Validate chromosome order against genome file |
Examples
# Find closest gene for each variant
# Include distance in output
# Only report first tie
# Find nearest non-overlapping interval
# Only look downstream
# Only look upstream
# Limit search to 10kb
window
Find intervals in B within a window around intervals in A.
When to Use
- Find features within a distance of query regions
- Identify nearby regulatory elements
- Proximity-based feature association
- Asymmetric distance searches (different upstream/downstream)
Why Use GRIT
- Flexible symmetric and asymmetric windows
- Count or report modes
- Efficient interval tree queries
How to Use
grit window -a <FILE_A> -b <FILE_B> [OPTIONS]
Required:
-a, --file-a <FILE>- Query intervals-b, --file-b <FILE>- Reference intervals
Options:
| Option | Description |
|---|---|
-w, --window <INT> |
Window size both sides (default: 1000) |
-l, --left <INT> |
Left/upstream window size |
-r, --right <INT> |
Right/downstream window size |
-c, --count |
Report count of matches |
-v, --no-overlap |
Report A intervals with NO matches |
--assume-sorted |
Skip sort validation (faster for pre-sorted files) |
-g, --genome <FILE> |
Validate chromosome order against genome file |
Examples
# Find features within 1kb of query regions
# Asymmetric window: 5kb upstream, 1kb downstream
# Count features in window
# Find regions with no features nearby
coverage
Calculate coverage depth of B intervals over A intervals.
When to Use
- Count reads overlapping genomic regions
- Calculate what fraction of each region is covered
- Generate coverage statistics for intervals
- Quality control of sequencing data
Why Use GRIT
- 9x faster than bedtools coverage
- 134x less memory than bedtools
- Multiple output formats (counts, histogram, per-base)
How to Use
grit coverage -a <FILE_A> -b <FILE_B> [OPTIONS]
Required:
-a, --file-a <FILE>- Target regions-b, --file-b <FILE>- Features to count (reads, etc.)
Options:
| Option | Description |
|---|---|
--hist |
Report histogram of coverage depths |
-d, --per-base |
Report depth at each position |
--mean |
Report mean depth per region |
--assume-sorted |
Skip sort validation (faster for pre-sorted files) |
-g, --genome <FILE> |
Validate chromosome order against genome file |
Output Format (default):
chrom start end name score strand count bases_covered length fraction
Examples
# Basic coverage (count, covered bases, length, fraction)
# Mean depth per region
# Per-base depth
# Histogram of coverage depths
# Streaming mode for large files
slop
Extend intervals by a specified number of bases.
When to Use
- Expand peaks to include flanking regions
- Create promoter regions from TSS coordinates
- Add padding around features
- Strand-aware extension (upstream/downstream)
Why Use GRIT
- Respects chromosome boundaries
- Strand-aware extension
- Percentage-based extension option
How to Use
grit slop -i <INPUT> -g <GENOME> [OPTIONS]
Required:
-i, --input <FILE>- Input BED file-g, --genome <FILE>- Chromosome sizes file
Options:
| Option | Description |
|---|---|
-b, --both <INT> |
Extend both sides by N bases |
-l, --left <INT> |
Extend left/upstream |
-r, --right <INT> |
Extend right/downstream |
-s, --strand |
Use strand for upstream/downstream |
--pct |
Values are fractions of interval size |
Examples
# Extend 100bp on both sides
# Create 500bp upstream + 100bp downstream regions
# Strand-aware extension (upstream/downstream relative to strand)
# Extend by 10% of interval size on each side
Genome File Format:
chr1 248956422
chr2 242193529
chr3 198295559
complement
Find genomic regions NOT covered by input intervals.
When to Use
- Find gaps between features
- Identify intergenic regions
- Create inverse of an interval set
- Find uncovered portions of chromosomes
Why Use GRIT
- O(n) single-pass streaming algorithm
- Memory efficient
- Simple, focused operation
How to Use
grit complement -i <INPUT> -g <GENOME>
Required:
-i, --input <FILE>- Input BED file-g, --genome <FILE>- Chromosome sizes file
Examples
# Find gaps between intervals
# Find uncovered regions
genomecov
Compute genome-wide coverage statistics.
When to Use
- Generate coverage tracks for visualization
- Compute depth distribution across genome
- Create BedGraph files for genome browsers
- Normalize coverage (scaling)
Why Use GRIT
- Multiple output formats (histogram, BedGraph)
- Coverage scaling for normalization
- Efficient whole-genome processing
How to Use
grit genomecov -i <INPUT> -g <GENOME> [OPTIONS]
Required:
-i, --input <FILE>- Input BED file-g, --genome <FILE>- Chromosome sizes file
Options:
| Option | Description |
|---|---|
-d, --per-base |
Report depth at each position (1-based) |
--bg |
BedGraph format (non-zero regions only) |
--bga |
BedGraph format (including zero coverage) |
--scale <FLOAT> |
Scale depth by factor (default: 1.0) |
Examples
# Default histogram output
# BedGraph for visualization (non-zero only)
# BedGraph including zero coverage regions
# Per-base depth (large output)
# Scale coverage (e.g., RPM normalization)
Histogram Output Format:
chrom depth bases_at_depth chrom_size fraction
jaccard
Calculate Jaccard similarity coefficient between two interval sets.
When to Use
- Compare similarity of two interval sets
- Measure overlap between experiments
- Quality control: compare replicates
- Quantify agreement between methods
Why Use GRIT
- O(n + m) efficient sweep-line algorithm
- Single-pass computation
- Standard Jaccard metric
How to Use
grit jaccard -a <FILE_A> -b <FILE_B>
Required:
-a, --file-a <FILE>- First BED file-b, --file-b <FILE>- Second BED file
Output Format:
intersection union jaccard n_intersections
15000 45000 0.333333 150
Examples
# Compare two peak sets
# Compare methods
multiinter
Identify intervals and which files contain them across multiple BED files.
When to Use
- Find common intervals across multiple samples
- Identify sample-specific intervals
- Multi-way intersection analysis
- Consensus peak calling
Why Use GRIT
- Handles arbitrary number of files
- Reports which files contain each interval
- Cluster mode for strict consensus
How to Use
grit multiinter -i <FILE1> <FILE2> [FILE3...] [OPTIONS]
Required:
-i, --input <FILES>- Two or more input BED files
Options:
| Option | Description |
|---|---|
--cluster |
Only output intervals in ALL files |
Examples
# Find intervals across 3 files (reports which files contain each)
# Find intervals present in ALL files (consensus)
Utilities
generate
Generate synthetic BED datasets for benchmarking and testing.
When to Use
- Create reproducible test data for benchmarking
- Generate datasets with specific characteristics (uniform, clustered)
- Test GRIT commands with controlled data sizes
- Compare performance across different data distributions
How to Use
grit generate [OPTIONS]
Options:
| Option | Description |
|---|---|
-o, --output <DIR> |
Output directory (default: ./grit_bench_data) |
--sizes <SIZES> |
Comma-separated sizes: 100K, 1M, 10M |
--mode <MODE> |
Distribution: balanced, clustered, identical, skewed-a-gt-b, skewed-b-gt-a, all |
--seed <INT> |
Random seed for reproducibility (default: 42) |
--a <SIZE> |
Custom A file size |
--b <SIZE> |
Custom B file size |
--sorted <yes|no|auto> |
Output sorting (default: auto) |
--hotspot-frac <FLOAT> |
Genome fraction for hotspots (default: 0.05) |
--hotspot-weight <FLOAT> |
Interval fraction in hotspots (default: 0.80) |
--force |
Overwrite existing files |
Size Notation:
| Format | Example | Value |
|---|---|---|
| Number | 1000 |
1,000 intervals |
| K suffix | 100K |
100,000 intervals |
| M suffix | 10M |
10,000,000 intervals |
Generation Modes:
| Mode | Description |
|---|---|
balanced |
Equal-sized A and B with uniform distribution |
clustered |
Intervals concentrated in hotspot regions |
identical |
A and B contain identical intervals |
skewed-a-gt-b |
A file 10x larger than B |
skewed-b-gt-a |
B file 10x larger than A |
all |
Generate all modes |
Examples
# Quick test data (100K intervals)
# Benchmark suite (multiple sizes)
# Custom asymmetric sizes
# Clustered data (simulates ChIP-seq peaks)
# Unsorted output for testing sort validation
Output Structure:
grit_bench_data/
├── balanced/
│ └── 1M/
│ ├── A.bed
│ └── B.bed
├── clustered/
│ └── ...
└── ...
Input Validation
GRIT validates input files to prevent silent failures from incorrectly sorted data. This section explains the validation behavior and how to control it.
Sort Order Validation
By default, GRIT validates that input files are sorted before processing. Most commands require sorted input (by chromosome, then by start position).
If files are unsorted, you'll see a helpful error:
Error: File A is not sorted: position 100 at line 5 comes after 200 on chr1
Fix: Run 'grit sort -i a.bed > sorted_a.bed' first.
Or use '--allow-unsorted' to load and re-sort in memory (uses O(n) memory).
How to sort files:
# Sort with GRIT (recommended)
# Or use standard Unix sort
Validation Flags
| Flag | Description | Memory Impact |
|---|---|---|
--assume-sorted |
Skip validation entirely | No change |
--allow-unsorted |
Load and re-sort in memory | O(n) |
-g, --genome <FILE> |
Validate genome chromosome order | No change |
--assume-sorted
Skip validation when you know files are pre-sorted:
# Skip validation for faster startup
# Useful in pipelines where files are guaranteed sorted
Warning: Using --assume-sorted with unsorted files produces incorrect results silently.
--allow-unsorted
For non-streaming commands (intersect, subtract, closest), explicitly allow unsorted input:
# Load and re-sort in memory (uses O(n) memory)
# Without this flag, unsorted input fails with a clear error
This flag is not available for streaming commands, which require pre-sorted input.
Genome Order Validation
Use -g, --genome to validate that chromosomes appear in a specific order (e.g., hg38, mm10):
# Validate chromosome order against genome file
# Merge with genome order validation
# Sort files to match genome order
Genome file format (tab-separated: chromosome name and size):
chr1 248956422
chr2 242193529
chr3 198295559
chrX 156040895
chrY 57227415
When -g is provided:
- Chromosomes must appear in the genome file order
- Chromosomes not in the genome file cause an error
- Error messages suggest how to fix:
grit sort -i file.bed -g genome.txt
Without -g: Any contiguous chromosome order is accepted (lexicographic, natural, etc.)
stdin Validation
When reading from stdin, GRIT buffers the input to validate sort order:
# stdin is validated by default (buffers entire input)
|
# Skip stdin validation with --assume-sorted (no buffering)
|
Note: stdin validation uses O(n) memory to buffer input. For large piped inputs where data is guaranteed sorted, use --assume-sorted to skip buffering.
Validation Summary by Command
| Command | Requires Sorted | --allow-unsorted |
-g, --genome |
|---|---|---|---|
intersect |
Yes (streaming) / Validates (default) | Yes | Yes |
subtract |
Yes (streaming) / Validates (default) | Yes | Yes |
closest |
Yes (streaming) / Validates (default) | Yes | Yes |
merge |
Yes | No (use --in-memory) |
Yes |
window |
Yes | No | Yes |
coverage |
Yes | No | Yes |
sort |
No | N/A | Yes (for ordering) |
slop |
No | N/A | No |
complement |
Yes | No | No |
Streaming Mode
For very large files, streaming mode processes data with constant O(k) memory, where k is the maximum number of overlapping intervals at any position (typically < 100).
When to Use Streaming
- Files larger than available RAM
- Processing 50GB+ files on laptops
- Memory-constrained environments
- When files are already sorted
Streaming Commands
Commands that support --streaming mode:
# Intersect
# Subtract
# Closest
# Window (always uses streaming internally)
# Coverage (always uses streaming internally)
# Merge (streaming by default)
Memory Comparison
| Mode | Memory Usage | Best For |
|---|---|---|
| Default (parallel) | O(n + m) | Maximum speed |
| Streaming | O(k) ≈ 2 MB | Large files, low RAM |
Zero-Length Interval Semantics
GRIT uses strict half-open interval semantics by default, which differs from bedtools in handling zero-length intervals.
What Are Zero-Length Intervals?
Zero-length intervals have start == end, such as:
chr1 100 100
These represent point positions (e.g., SNP locations from VCF-to-BED conversion) rather than regions.
Default Behavior (Strict Mode)
In strict half-open semantics, a zero-length interval [100, 100) contains no bases:
- It does not overlap with itself
- It does not overlap with adjacent intervals like
[100, 101)
This follows the mathematical definition of half-open intervals.
Bedtools Behavior
Bedtools treats zero-length intervals as if they were 1bp intervals:
[100, 100)overlaps with[100, 101)- Self-intersection of zero-length intervals produces output
Enabling Bedtools Compatibility
Use --bedtools-compatible to match bedtools behavior:
# Default: strict semantics (zero-length intervals don't overlap)
# Bedtools-compatible: zero-length intervals normalized to 1bp
When enabled, zero-length intervals are normalized to 1bp during parsing:
chr1 100 100 → chr1 100 101
When to Use Each Mode
| Mode | Use Case |
|---|---|
| Strict (default) | Mathematical correctness, new projects |
| Bedtools-compatible | Reproducing bedtools results, dbSNP data |
Performance Impact
The --bedtools-compatible flag has negligible performance impact (<1%). Normalization occurs once during parsing, not in inner loops.
Performance
Benchmarks
Tested on 10M × 5M intervals (uniform distribution):
| Command | bedtools | GRIT | Speedup | Memory Reduction |
|---|---|---|---|---|
| window | 32.18s | 2.10s | 15.3x | 137x less |
| merge | 3.68s | 0.34s | 10.8x | ~same |
| coverage | 16.53s | 1.84s | 9.0x | 134x less |
| subtract | 9.49s | 1.47s | 6.5x | 19x less |
| closest | 9.70s | 1.95s | 5.0x | 59x less |
| intersect | 6.77s | 1.54s | 4.4x | 19x less |
| jaccard | 4.98s | 1.59s | 3.1x | 1230x less |
See full benchmark methodology for details.
Performance Tips
- Use streaming for large files - Constant memory, often faster
- Pre-sort your files - Use
--assume-sortedto skip validation - Adjust thread count - Default uses all CPUs, tune with
-t - Use merge first - Reduce interval count before expensive operations
Testing
Quick Start
# Run all tests
# Build release for testing
Unit Tests
# All unit tests (290+ tests)
# Specific module tests
# With verbose output
Integration Tests
# Fast sort integration tests (requires bedtools installed)
Test Sorted Input Validation
GRIT validates that input files are sorted for streaming operations:
# Create test files
# This should succeed (sorted input)
# This should fail with error (unsorted input)
# Error: File A is not sorted...
# Skip validation with --assume-sorted (faster for pre-sorted files)
Test Commands Individually
# Intersect
# Subtract
# Merge
# Closest
# Window
# Coverage
Verify Against bedtools
# Compare intersect output
# Compare sort output
# Compare merge output
# Compare subtract output
# SHA256 parity check (for large files)
Run Benchmarks
# Run all benchmarks
# Specific benchmarks
# Run benchmark script (if available)
Performance Testing
# Generate large test files
for; do
done
# Sort the test file
# Time streaming vs parallel mode
# Memory usage (on Linux)
Test Coverage
# Install cargo-tarpaulin for coverage
# Run coverage report
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit changes (
git commit -m 'feat: add new feature') - Push to branch (
git push origin feature/new-feature) - Open a Pull Request
License
MIT License - see LICENSE for details.