What Makes biometal Different?
Stream data directly from networks and analyze terabyte-scale datasets on consumer hardware without downloading.
- Constant ~5 MB memory regardless of dataset size (99.5% reduction)
- 16-25× speedup using ARM NEON SIMD on Apple Silicon
- Network streaming from HTTP/HTTPS sources (no download needed)
- Evidence-based design (1,357 experiments, 40,710 measurements)
🎉 NEW in v1.10.0: Extended Format Support
biometal now supports 12+ bioinformatics file formats with production-ready streaming parsers:
Sequences & Reads:
- FASTQ/FASTA: Read sequences with quality scores
- BAM/SAM: Binary alignment format with indexing (BAI)
Annotations & Features:
- BED/narrowPeak: Genomic intervals and ChIP-seq peaks (ENCODE)
- GFF3: Hierarchical gene features (genes, mRNAs, exons, CDS)
- GTF: Gene annotations for RNA-seq (GENCODE, Ensembl) [NEW]
Variants & Alignments:
- VCF: Genetic variants (SNPs, indels, structural variants)
- PAF: minimap2 pairwise alignments (long-read analysis) [NEW]
Graphs & Assembly:
- GFA: Assembly graphs (pangenomes, read overlap graphs)
Indices:
- FAI: FASTA index for O(1) sequence lookup
- TBI: Tabix index for O(log n) region queries
All formats support:
- ✅ Streaming architecture (constant ~5 MB memory)
- ✅ Automatic gzip decompression (
.gzfiles) - ✅ Python bindings with optimized memory usage
- ✅ Real-world validation (ENCODE, UCSC, Ensembl, 1000 Genomes)
Quick Start
Installation
Rust:
[]
= "1.10"
Python:
Note: Package is
biometal-rson PyPI, but imports asbiometalin Python.
Basic Usage
Rust:
use FastqStream;
// Stream FASTQ with constant memory (~5 MB)
let stream = from_path?;
for record in stream
Python:
# Stream FASTQ with constant memory (~5 MB)
=
# ARM NEON accelerated (16-25× speedup)
=
=
=
📚 Documentation
Start Here
- 📘 User Guide - Comprehensive guide: installation, core concepts, common workflows, troubleshooting, and migration from pysam/samtools (NEW - v1.6.0)
In-Depth Resources
- 📖 DeepWiki AI Docs - AI-assisted documentation with Q&A
- 📓 Interactive Tutorials - Jupyter notebooks with real workflows
- 🦀 API Reference - Full Rust documentation
- 🐍 Python Guide - Python-specific documentation
- 🧬 BAM API Reference - Complete BAM/SAM parser API (v1.4.0)
- ⚡ BAM Performance Guide - Benchmarks and optimization (v1.4.0)
- 📐 Architecture - Technical design details
- ❓ FAQ - Frequently asked questions
📓 Interactive Tutorials
Learn biometal through hands-on Jupyter notebooks (5 complete, ~2.5 hours):
| Notebook | Duration | Topics |
|---|---|---|
| 01. Getting Started | 15-20 min | Streaming, GC content, quality analysis |
| 02. Quality Control | 30-40 min | Trimming, filtering, masking (v1.2.0) |
| 03. K-mer Analysis | 30-40 min | ML preprocessing, DNABert (v1.1.0) |
| 04. Network Streaming | 30-40 min | HTTP streaming, public data (v1.0.0) |
| 05. BAM Alignment Analysis | 30-40 min | BAM parsing, 4× speedup, filtering (v1.2.0+) |
| 06. BAM Production Workflows | 45-60 min | Tag parsing, QC statistics, production pipelines (v1.4.0) |
🚀 Key Features
Streaming Architecture
- Constant ~5 MB memory regardless of dataset size
- Analyze 5TB datasets on laptops without downloading
- 99.5% memory reduction vs. traditional approaches
ARM-Native Performance
- 16-25× speedup using ARM NEON SIMD
- Optimized for Apple Silicon (M1/M2/M3/M4)
- Automatic scalar fallback on x86_64
Network Streaming
- Stream directly from HTTP/HTTPS (no download)
- Smart LRU caching + background prefetching
- Access public data (ENA, S3, GCS, Azure)
Operations Library
- Core operations: GC content, base counting, quality scores
- K-mer operations: Extraction, minimizers, spectrum (v1.1.0)
- QC operations: Trimming, filtering, masking (v1.2.0)
- BAM/SAM parser: Production-ready with 8.4× speedup via parallel BGZF + NEON + cloudflare_zlib
- 5.82 million records/sec throughput
- 92.0 MiB/s compressed file processing (+67% from cloudflare_zlib in v1.7.0)
- Constant ~5 MB memory (streams terabyte-scale alignments)
- Python bindings (v1.3.0): CIGAR operations, SAM writing, alignment metrics
- Production polish (v1.4.0): Tag convenience methods, statistics functions
- 6 tag accessors:
edit_distance(),alignment_score(),read_group(), etc. - 4 statistics functions:
insert_size_distribution(),edit_distance_stats(),strand_bias(),alignment_length_distribution()
- 6 tag accessors:
- NEON optimization (v1.5.0): ARM SIMD sequence decoding (4.62× faster)
- BAI index (v1.6.0): Indexed region queries with 1.68-500× speedup
- O(log n) random access to BAM files
- Near-zero overhead (<1ms index loading)
- Speedup scales with file size (10-500× for 1-10 GB files)
- Format Library (v1.8.0): Production-ready parsers for genomic annotation and assembly formats
- BED (Browser Extensible Data): Genomic intervals with streaming architecture
- BED3/6/12 format support
- 0-based half-open coordinate system
- Constant memory (~5 MB) for terabyte-scale peak files
- GFA (Graphical Fragment Assembly): Assembly graph format
- Segment, Link, Path record types
- Graph connectivity validation
- Streaming architecture for large assembly graphs
- VCF (Variant Call Format): Genetic variant data
- VCF 4.2 specification compliance
- Header parsing with sample/contig/INFO extraction
- SNP/indel classification
- Multi-allelic variant support
- GFF3 (General Feature Format): Hierarchical gene annotations
- 1-based inclusive coordinate system
- Parent-child relationship tracking (gene → mRNA → exon/CDS)
- Attribute parsing with convenience methods
- Coordinate conversion to BED (0-based)
- Testing: 23 property-based tests + 6 real-world integration tests
- Python bindings: Full streaming API for all formats
- BED (Browser Extensible Data): Genomic intervals with streaming architecture
- 60+ Python functions for bioinformatics workflows
Performance Highlights
| Operation | Scalar | Optimized | Speedup |
|---|---|---|---|
| Base counting | 315 Kseq/s | 5,254 Kseq/s | 16.7× (NEON) |
| GC content | 294 Kseq/s | 5,954 Kseq/s | 20.3× (NEON) |
| Quality filter | 245 Kseq/s | 6,143 Kseq/s | 25.1× (NEON) |
| BAM parsing | ~11 MiB/s | 92.0 MiB/s | 8.4× (BGZF + NEON + cloudflare_zlib v1.7.0) |
| Dataset Size | Traditional | biometal | Reduction |
|---|---|---|---|
| 100K sequences | 134 MB | 5 MB | 96.3% |
| 1M sequences | 1,344 MB | 5 MB | 99.5% |
| 5TB dataset | 5,000 GB | 5 MB | 99.9999% |
📊 Comprehensive Benchmark Comparison vs samtools/pysam →
Platform Support
| Platform | Performance | Tests | Status |
|---|---|---|---|
| Mac ARM (M1-M4) | 16-25× speedup | ✅ 551/551 | Optimized |
| AWS Graviton | 6-10× speedup | ✅ 551/551 | Portable |
| Linux x86_64 | 1× (scalar) | ✅ 551/551 | Portable |
Test count: 551 library tests (including 65 new tests for GTF, PAF, narrowPeak) + 23 property-based tests
Evidence-Based Design
biometal's design is grounded in comprehensive experimental validation:
- 1,357 experiments (40,710 measurements, N=30)
- Statistical rigor (95% CI, Cohen's d effect sizes)
- Full methodology: apple-silicon-bio-bench
- 6 optimization rules documented in OPTIMIZATION_RULES.md
Roadmap
v1.0.0 (Released Nov 5, 2025) ✅ - Core library + network streaming v1.1.0 (Released Nov 6, 2025) ✅ - K-mer operations v1.2.0 (Released Nov 6, 2025) ✅ - Python bindings for Phase 4 QC BAM/SAM (Integrated Nov 8, 2025) ✅ - Native streaming alignment parser with parallel BGZF (4× speedup) v1.3.0 (Released Nov 9, 2025) ✅ - Python BAM bindings with CIGAR operations and SAM writing v1.4.0 (Released Nov 9, 2025) ✅ - BAM tag convenience methods and statistics functions v1.5.0 (Released Nov 9, 2025) ✅ - ARM NEON sequence decoding (+27.5% BAM parsing speedup) v1.6.0 (Released Nov 10, 2025) ✅ - BAI index support (indexed region queries, 1.68-500× speedup) v1.7.0 (Released Nov 13, 2025) ✅ - cloudflare_zlib backend (1.67× decompression, 2.29× compression speedups) v1.8.0 (Released Nov 13, 2025) ✅ - Format library (BED, GFA, VCF, GFF3) with property-based testing
- 4 production-ready format parsers with streaming architecture
- 23 property-based tests + 6 real-world integration tests
- Tested against ENCODE, UCSC, Ensembl, 1000 Genomes data
- Full Python bindings for all formats
Next (Planned):
- CSI index support (for references >512 Mbp)
- Extended tag parsing (full type support)
- Additional alignment statistics
- Community feedback & benchmarking
Future (Community Driven):
- Extended operations (alignment, assembly)
- Additional formats (BCF, CRAM)
- Metal GPU acceleration (Mac-specific)
See CHANGELOG.md for detailed release notes.
Mission: Democratizing Bioinformatics
biometal addresses barriers that lock researchers out of genomics:
- Economic: Consumer ARM laptops ($1,400) deliver production performance
- Environmental: ARM efficiency reduces carbon footprint
- Portability: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)
- Data Access: Analyze 5TB datasets on 24GB laptops without downloading
Example Use Cases
Quality Control Pipeline
=
# Trim low-quality ends
=
# Length filter
# Mask remaining low-quality bases
=
# Check masking rate
= /
# Pass QC - process further
pass
K-mer Extraction for ML
# Extract k-mers for DNABert preprocessing
=
# Extract overlapping k-mers (k=6 typical for DNABert)
=
# Format for transformer models
=
# Feed to DNABert - constant memory!
Network Streaming
# Stream from HTTP without downloading
# Works with ENA, S3, GCS, Azure public data
=
=
# Analyze directly - no download needed!
# Memory: constant ~5 MB
=
BAM Alignment Analysis (v1.4.0)
# Stream BAM file with constant memory (~5 MB)
=
# Access alignment details
# NEW v1.4.0: Tag convenience methods
= # NM tag
= # AS tag
= # RG tag
# CIGAR operations (v1.3.0)
# NEW v1.4.0: Built-in statistics functions
# Insert size distribution (paired-end QC)
=
# Edit distance statistics (alignment quality)
=
# Strand bias (variant calling QC)
=
# Alignment length distribution (RNA-seq QC)
=
BAI Indexed Region Queries (v1.6.0)
# Load BAI index for fast random access
=
# Query specific genomic region (1.68× faster than full scan for small files)
# Speedup increases dramatically with file size (10-500× for 1-10 GB files)
# Only reads overlapping region are returned
# Reuse index for multiple queries (index loading: <1ms overhead)
=
=
# Full workflow: Coverage calculation for specific region
=
# Calculate coverage from CIGAR
=
+= 1
+= 1
Performance Characteristics:
- Index loading: < 1ms (negligible overhead)
- Small region query (1 Kbp): ~11 ms vs 18 ms full scan (1.68× speedup)
- Speedup scales with file size:
- 1 MB file: 1.7× speedup
- 100 MB file: 10-20× speedup
- 1 GB file: 50-100× speedup
- 10 GB file: 200-500× speedup
Format Library: BED/GFA/VCF/GFF3 (v1.8.0)
# BED: Parse genomic intervals (ChIP-seq peaks, gene annotations)
=
=
# GFA: Parse assembly graphs (genome assembly, pangenomes)
=
=
# VCF: Parse genetic variants (SNPs, indels)
=
= # Note: header() not parse_header()
# GFF3: Parse hierarchical gene annotations (genes, mRNAs, exons, CDS)
=
=
= # 1-based inclusive coordinates
=
# Note: interval() method not available in Python bindings
# Use feature.start and feature.end directly (1-based inclusive)
Format Library Features:
- Streaming architecture: Constant ~5 MB memory for all formats
- Production-ready: Tested against real ENCODE, UCSC, Ensembl, 1000 Genomes data
- Property-based testing: 23 tests validating format invariants (round-trip parsing, coordinate systems, specification compliance)
- Real-world validation: 6 integration tests with production files (61,547 GFF3 features, 1,000 UCSC genes, 10 VCF variants)
- Python bindings: Full streaming API with Pythonic interfaces
FAQ
Q: Why biometal-rs on PyPI but biometal everywhere else?
A: The biometal name was taken on PyPI, so we use biometal-rs for installation. You still import as import biometal.
Q: What platforms are supported? A: Mac ARM (optimized), Linux ARM/x86_64 (portable). Pre-built wheels for common platforms. See docs/CROSS_PLATFORM_TESTING.md.
Q: Why ARM-native? A: To democratize bioinformatics by enabling world-class performance on consumer hardware ($1,400 MacBooks vs. $50,000 servers).
More questions? See FAQ.md
Contributing
We welcome contributions! See CLAUDE.md for development guidelines.
biometal is built on evidence-based optimization - new features should:
- Have clear use cases
- Be validated experimentally (when adding optimizations)
- Maintain platform portability
- Follow OPTIMIZATION_RULES.md
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.
Citation
If you use biometal in your research:
For the experimental methodology: