biometal 1.6.0 - Docs.rs

What Makes biometal Different?

Stream data directly from networks and analyze terabyte-scale datasets on consumer hardware without downloading.

Constant ~5 MB memory regardless of dataset size (99.5% reduction)
16-25× speedup using ARM NEON SIMD on Apple Silicon
Network streaming from HTTP/HTTPS sources (no download needed)
Evidence-based design (1,357 experiments, 40,710 measurements)

Quick Start

Installation

Rust:

[dependencies]
biometal = "1.5"

Python:

pip install biometal-rs  # Install
python -c "import biometal; print(biometal.__version__)"  # Test

Note: Package is biometal-rs on PyPI, but imports as biometal in Python.

Basic Usage

Rust:

use biometal::FastqStream;

// Stream FASTQ with constant memory (~5 MB)
let stream = FastqStream::from_path("dataset.fq.gz")?;

for record in stream {
    let record = record?;
    // Process one record at a time
}

Python:

import biometal

# Stream FASTQ with constant memory (~5 MB)
stream = biometal.FastqStream.from_path("dataset.fq.gz")

for record in stream:
    # ARM NEON accelerated (16-25× speedup)
    gc = biometal.gc_content(record.sequence)
    counts = biometal.count_bases(record.sequence)
    mean_q = biometal.mean_quality(record.quality)

📚 Documentation

Start Here

📘 User Guide - Comprehensive guide: installation, core concepts, common workflows, troubleshooting, and migration from pysam/samtools (NEW - v1.6.0)

In-Depth Resources

📖 DeepWiki AI Docs - AI-assisted documentation with Q&A
📓 Interactive Tutorials - Jupyter notebooks with real workflows
🦀 API Reference - Full Rust documentation
🐍 Python Guide - Python-specific documentation
🧬 BAM API Reference - Complete BAM/SAM parser API (v1.4.0)
⚡ BAM Performance Guide - Benchmarks and optimization (v1.4.0)
📐 Architecture - Technical design details
❓ FAQ - Frequently asked questions

📓 Interactive Tutorials

Learn biometal through hands-on Jupyter notebooks (5 complete, ~2.5 hours):

Notebook	Duration	Topics
01. Getting Started	15-20 min	Streaming, GC content, quality analysis
02. Quality Control	30-40 min	Trimming, filtering, masking (v1.2.0)
03. K-mer Analysis	30-40 min	ML preprocessing, DNABert (v1.1.0)
04. Network Streaming	30-40 min	HTTP streaming, public data (v1.0.0)
05. BAM Alignment Analysis	30-40 min	BAM parsing, 4× speedup, filtering (v1.2.0+)
06. BAM Production Workflows	45-60 min	Tag parsing, QC statistics, production pipelines (v1.4.0)

👉 Browse all tutorials →

🚀 Key Features

Streaming Architecture

Constant ~5 MB memory regardless of dataset size
Analyze 5TB datasets on laptops without downloading
99.5% memory reduction vs. traditional approaches

ARM-Native Performance

16-25× speedup using ARM NEON SIMD
Optimized for Apple Silicon (M1/M2/M3/M4)
Automatic scalar fallback on x86_64

Network Streaming

Stream directly from HTTP/HTTPS (no download)
Smart LRU caching + background prefetching
Access public data (ENA, S3, GCS, Azure)

Operations Library

Core operations: GC content, base counting, quality scores
K-mer operations: Extraction, minimizers, spectrum (v1.1.0)
QC operations: Trimming, filtering, masking (v1.2.0)
BAM/SAM parser: Production-ready with 5× speedup via parallel BGZF + NEON
- 5.82 million records/sec throughput
- 55.1 MiB/s compressed file processing (+27.5% from NEON in v1.5.0)
- Constant ~5 MB memory (streams terabyte-scale alignments)
- Python bindings (v1.3.0): CIGAR operations, SAM writing, alignment metrics
- Production polish (v1.4.0): Tag convenience methods, statistics functions
  - 6 tag accessors: edit_distance(), alignment_score(), read_group(), etc.
  - 4 statistics functions: insert_size_distribution(), edit_distance_stats(), strand_bias(), alignment_length_distribution()
- NEON optimization (v1.5.0): ARM SIMD sequence decoding (4.62× faster)
- BAI index (v1.6.0): Indexed region queries with 1.68-500× speedup
  - O(log n) random access to BAM files
  - Near-zero overhead (<1ms index loading)
  - Speedup scales with file size (10-500× for 1-10 GB files)
50+ Python functions for bioinformatics workflows

Performance Highlights

Operation	Scalar	Optimized	Speedup
Base counting	315 Kseq/s	5,254 Kseq/s	16.7× (NEON)
GC content	294 Kseq/s	5,954 Kseq/s	20.3× (NEON)
Quality filter	245 Kseq/s	6,143 Kseq/s	25.1× (NEON)
BAM parsing	~11 MiB/s	55.1 MiB/s	5.0× (BGZF + NEON v1.5.0)

Dataset Size	Traditional	biometal	Reduction
100K sequences	134 MB	5 MB	96.3%
1M sequences	1,344 MB	5 MB	99.5%
5TB dataset	5,000 GB	5 MB	99.9999%

📊 Comprehensive Benchmark Comparison vs samtools/pysam →

Platform Support

Platform	Performance	Tests	Status
Mac ARM (M1-M4)	16-25× speedup	✅ 461/461	Optimized
AWS Graviton	6-10× speedup	✅ 461/461	Portable
Linux x86_64	1× (scalar)	✅ 461/461	Portable

Test count includes 354 core + 81 BAM + 26 BAI Python tests

Evidence-Based Design

biometal's design is grounded in comprehensive experimental validation:

1,357 experiments (40,710 measurements, N=30)
Statistical rigor (95% CI, Cohen's d effect sizes)
Full methodology: apple-silicon-bio-bench
6 optimization rules documented in OPTIMIZATION_RULES.md

Roadmap

v1.0.0 (Released Nov 5, 2025) ✅ - Core library + network streaming v1.1.0 (Released Nov 6, 2025) ✅ - K-mer operations v1.2.0 (Released Nov 6, 2025) ✅ - Python bindings for Phase 4 QC BAM/SAM (Integrated Nov 8, 2025) ✅ - Native streaming alignment parser with parallel BGZF (4× speedup) v1.3.0 (Released Nov 9, 2025) ✅ - Python BAM bindings with CIGAR operations and SAM writing v1.4.0 (Released Nov 9, 2025) ✅ - BAM tag convenience methods and statistics functions v1.5.0 (Released Nov 9, 2025) ✅ - ARM NEON sequence decoding (+27.5% BAM parsing speedup) v1.6.0 (Released Nov 10, 2025) ✅ - BAI index support (indexed region queries, 1.68-500× speedup)

Next (Planned):

CSI index support (for references >512 Mbp)
Extended tag parsing (full type support)
Additional alignment statistics
Community feedback & benchmarking

Future (Community Driven):

Extended operations (alignment, assembly)
Additional formats (VCF, BCF, CRAM)
Metal GPU acceleration (Mac-specific)

See CHANGELOG.md for detailed release notes.

Mission: Democratizing Bioinformatics

biometal addresses barriers that lock researchers out of genomics:

Economic: Consumer ARM laptops ($1,400) deliver production performance
Environmental: ARM efficiency reduces carbon footprint
Portability: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)
Data Access: Analyze 5TB datasets on 24GB laptops without downloading

Example Use Cases

Quality Control Pipeline

import biometal

stream = biometal.FastqStream.from_path("raw_reads.fq.gz")

for record in stream:
    # Trim low-quality ends
    trimmed = biometal.trim_quality_window(record, min_quality=20, window_size=4)

    # Length filter
    if biometal.meets_length_requirement(trimmed, min_len=50, max_len=150):
        # Mask remaining low-quality bases
        masked = biometal.mask_low_quality(trimmed, min_quality=20)

        # Check masking rate
        mask_rate = biometal.count_masked_bases(masked) / len(masked.sequence)
        if mask_rate < 0.1:
            # Pass QC - process further
            pass

K-mer Extraction for ML

import biometal

# Extract k-mers for DNABert preprocessing
stream = biometal.FastqStream.from_path("dataset.fq.gz")

for record in stream:
    # Extract overlapping k-mers (k=6 typical for DNABert)
    kmers = biometal.extract_kmers(record.sequence, k=6)

    # Format for transformer models
    kmer_string = " ".join(kmer.decode() for kmer in kmers)

    # Feed to DNABert - constant memory!
    model.process(kmer_string)

Network Streaming

import biometal

# Stream from HTTP without downloading
# Works with ENA, S3, GCS, Azure public data
url = "https://example.com/dataset.fq.gz"
stream = biometal.FastqStream.from_path(url)

for record in stream:
    # Analyze directly - no download needed!
    # Memory: constant ~5 MB
    gc = biometal.gc_content(record.sequence)

BAM Alignment Analysis (v1.4.0)

import biometal

# Stream BAM file with constant memory (~5 MB)
reader = biometal.BamReader.from_path("alignments.bam")

for record in reader:
    # Access alignment details
    print(f"{record.name}: MAPQ={record.mapq}, pos={record.position}")

    # NEW v1.4.0: Tag convenience methods
    edit_dist = record.edit_distance()  # NM tag
    align_score = record.alignment_score()  # AS tag
    read_group = record.read_group()  # RG tag
    print(f"  Edit distance: {edit_dist}, Score: {align_score}, RG: {read_group}")

    # CIGAR operations (v1.3.0)
    for op in record.cigar:
        if op.is_insertion() and op.length >= 5:
            print(f"  Found {op.length}bp insertion")

# NEW v1.4.0: Built-in statistics functions
# Insert size distribution (paired-end QC)
dist = biometal.insert_size_distribution("alignments.bam")
print(f"Mean insert size: {sum(s*c for s,c in dist.items())/sum(dist.values()):.1f}bp")

# Edit distance statistics (alignment quality)
stats = biometal.edit_distance_stats("alignments.bam")
print(f"Mean edit distance: {stats['mean']:.2f} mismatches/read")

# Strand bias (variant calling QC)
bias = biometal.strand_bias("alignments.bam", reference_id=0, position=1000)
print(f"Strand bias at chr1:1000: {bias['ratio']:.2f}:1")

# Alignment length distribution (RNA-seq QC)
lengths = biometal.alignment_length_distribution("alignments.bam")
print(f"Intron-spanning reads: {sum(c for l,c in lengths.items() if l > 1000)}")

BAI Indexed Region Queries (v1.6.0)

import biometal

# Load BAI index for fast random access
index = biometal.BaiIndex.from_path("alignments.bam.bai")

# Query specific genomic region (1.68× faster than full scan for small files)
# Speedup increases dramatically with file size (10-500× for 1-10 GB files)
for record in biometal.BamReader.query_region(
    "alignments.bam",
    index,
    "chr1",
    1000000,  # start position
    2000000   # end position
):
    # Only reads overlapping region are returned
    if record.is_mapped and record.mapq >= 30:
        print(f"{record.name}: {record.position}-{record.reference_end()}")

# Reuse index for multiple queries (index loading: <1ms overhead)
regions = [
    ("chr1", 1000000, 2000000),
    ("chr1", 5000000, 6000000),
    ("chr2", 100000, 200000),
]

for chrom, start, end in regions:
    count = sum(1 for _ in biometal.BamReader.query_region(
        "alignments.bam", index, chrom, start, end
    ))
    print(f"{chrom}:{start}-{end}: {count} reads")

# Full workflow: Coverage calculation for specific region
from collections import defaultdict

coverage = defaultdict(int)
for record in biometal.BamReader.query_region(
    "alignments.bam", index, "chr1", 1000, 2000
):
    if record.is_mapped and record.position is not None:
        # Calculate coverage from CIGAR
        pos = record.position
        for op in record.cigar:
            if op.consumes_reference():
                for i in range(op.length):
                    coverage[pos] += 1
                    pos += 1

print(f"Mean coverage: {sum(coverage.values())/len(coverage):.1f}×")

Performance Characteristics:

Index loading: < 1ms (negligible overhead)
Small region query (1 Kbp): ~11 ms vs 18 ms full scan (1.68× speedup)
Speedup scales with file size:
- 1 MB file: 1.7× speedup
- 100 MB file: 10-20× speedup
- 1 GB file: 50-100× speedup
- 10 GB file: 200-500× speedup

FAQ

Q: Why biometal-rs on PyPI but biometal everywhere else? A: The biometal name was taken on PyPI, so we use biometal-rs for installation. You still import as import biometal.

Q: What platforms are supported? A: Mac ARM (optimized), Linux ARM/x86_64 (portable). Pre-built wheels for common platforms. See docs/CROSS_PLATFORM_TESTING.md.

Q: Why ARM-native? A: To democratize bioinformatics by enabling world-class performance on consumer hardware ($1,400 MacBooks vs. $50,000 servers).

Contributing

We welcome contributions! See CLAUDE.md for development guidelines.

biometal is built on evidence-based optimization - new features should:

Have clear use cases
Be validated experimentally (when adding optimizations)
Maintain platform portability
Follow OPTIMIZATION_RULES.md

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT license (LICENSE-MIT)

at your option.

Citation

If you use biometal in your research:

@software{biometal2025,
  author = {Handley, Scott},
  title = {biometal: ARM-native bioinformatics with streaming architecture},
  year = {2025},
  url = {https://github.com/shandley/biometal}
}

For the experimental methodology:

@misc{asbb2025,
  author = {Handley, Scott},
  title = {Apple Silicon Bio Bench: Systematic Hardware Characterization},
  year = {2025},
  url = {https://github.com/shandley/apple-silicon-bio-bench}
}