Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
convert_genome
convert_genome converts direct-to-consumer (DTC) genotype exports (23andMe, AncestryDNA, etc.) into standard VCF or BCF files. The converter understands remote references, streams compressed archives, and now includes high-performance, parallel processing suitable for multi-million record datasets.
Features
- Parallel conversion pipeline powered by
rayon, scaling across all cores. - Thread-safe reference genome access with a shared
LRUcache for rapid base lookups. - Remote reference support for
http://andhttps://URLs with transparent decompression of.gzand.ziparchives. - Robust parsing with property-based and integration tests covering malformed input, missing fields, and concurrent access.
- Benchmark suite (Criterion) to track performance of parsing, reference lookups, and pipeline throughput.
- Comprehensive CI/CD across Linux, macOS, and Windows with formatting, linting, testing, coverage, and benchmarks.
Installation
The project targets Rust nightly (see rust-toolchain.toml). Install the converter directly from the repository:
Alternatively, build the binary without installing:
The resulting executable lives at target/release/convert_genome.
Usage
The CLI accepts both local files and remote resources. A minimal invocation converts a DTC file to VCF:
Generate a BCF with explicit assembly metadata and keep homozygous reference calls:
If a .fai index is not provided the converter will generate one next to the FASTA automatically.
Performance
Reference lookups use a shared, thread-safe LRU cache sized for 128k entries, dramatically reducing random I/O. The conversion pipeline collects DTC records, sorts them for cache locality, and processes them in parallel; results are written sequentially to keep deterministic ordering.
The Criterion benchmarks can be executed with:
Benchmarks cover:
- Cached vs. uncached reference lookups.
- DTC parsing throughput.
- Full conversion pipeline comparisons (parallel vs. single-threaded execution).
Testing
Unit, integration, and property-based tests ensure correctness across a wide surface area:
Ignored integration tests in tests/remote_download.rs exercise real-world genome downloads; run them manually as needed.
Continuous Integration
See .github/workflows/ci.yml. The workflow performs:
- Formatting (
cargo fmt --check) - Linting (
cargo clippy --all-targets -- -D warnings) - Cross-platform builds
- Test suites (debug + release/property)
- Benchmarks (
cargo bench --no-fail-fast) - Coverage reporting via
cargo tarpaulinon Linux
Project Architecture
src/cli.rs– Argument parsing and top-level command dispatch.src/conversion.rs– Conversion pipeline, header construction, and record translation.src/dtc.rs– Streaming parser for DTC genotype exports.src/reference.rs– Reference genome loader, contig metadata, and cached base access.src/remote.rs– Remote fetching with HTTP(S) support and archive extraction.
Additional resources:
tests/– Integration and property-based test suites.benches/– Criterion benchmarks for core subsystems.
Contributing
- Install the nightly toolchain (
rustup toolchain install nightly). - Run formatting and linting before submitting:
cargo fmtandcargo clippy --all-targets -- -D warnings. - Execute the full test suite (debug + release) and benchmarks.
- For large datasets or new reference assemblies, add integration tests with representative fixtures.
Issues and pull requests are welcome! Please include benchmark results when proposing performance-sensitive changes.