bitnuc
A library for efficient nucleotide sequence manipulation using 2-bit encoding.
Features
- 2-bit nucleotide encoding (A=00, C=01, G=10, T=11)
- Direct bit manipulation functions for custom implementations
- Higher-level sequence type with additional analysis features
Low-Level Packing Functions
For direct bit manipulation, use the as_2bit and from_2bit functions:
use ;
These functions are useful when you need to:
- Implement custom sequence storage
- Manipulate sequences at the bit level
- Integrate with other bioinformatics tools
- Copy sequences more efficiently
- Hash sequences more efficiently
For example, packing multiple short sequences:
use ;
High-Level Sequence Type
For more complex sequence manipulation, use the [PackedSequence] type:
use ;
Memory Usage
The 2-bit encoding provides significant memory savings:
Standard encoding: 1 byte per base
ACGT = 4 bytes = 32 bits
2-bit encoding: 2 bits per base
ACGT = 8 bits
This means you can store 4 times as many sequences in the same amount of memory.
Error Handling
All operations that could fail return a [Result] with [NucleotideError]:
use ;
// Invalid nucleotide
let err = as_2bit.unwrap_err;
assert!;
// Sequence too long
let long_seq = vec!;
let err = as_2bit.unwrap_err;
assert!;
Performance Considerations
When working with many short sequences (like k-mers), using as_2bit and from_2bit
directly can be more efficient than creating [PackedSequence] instances:
use ;
use HashMap;
See the documentation for [as_2bit] and [from_2bit] for more details on
working with packed sequences directly.
SIMD Acceleration
as_2bit is optionally SIMD accelerated depending on the architecture of your system.
By default, SIMD instructions are used, but they can be shut-off using the nosimd feature flag.
For increased performance and to really take advantage of the SIMD I recommend compiling with:
RUSTFLAGS="-C target-cpu=native"
or to add these flags to your project via the cargo build config:
# ./cargo/config.toml
[]
= ["-C", "target-cpu=native"]
Performance characteristics on my machine vary from 10% to 30% throughput increases depending on sequence size.