# bitnuc
[](./LICENSE.md)

[](https://crates.io/crates/bitnuc)
[](https://docs.rs/bitnuc/latest/bitnuc/)
A library for efficient nucleotide sequence manipulation using 2-bit encoding.
## Features
- 2-bit nucleotide encoding (A=00, C=01, G=10, T=11)
- Direct bit manipulation functions for custom implementations
- Higher-level sequence type with additional analysis features
## Low-Level Packing Functions
For direct bit manipulation, use the `as_2bit` and `from_2bit` functions:
```rust
use bitnuc::{as_2bit, from_2bit};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Pack a sequence into a u64
let packed = as_2bit(b"ACGT")?;
assert_eq!(packed, 0b11100100);
// Unpack back to a sequence
let mut unpacked = Vec::new(); // Allocate a reusable buffer
from_2bit(packed, 4, &mut unpacked)?;
assert_eq!(&unpacked, b"ACGT");
unpacked.clear(); // Reuse the buffer
Ok(())
}
```
These functions are useful when you need to:
- Implement custom sequence storage
- Manipulate sequences at the bit level
- Integrate with other bioinformatics tools
- Copy sequences more efficiently
- Hash sequences more efficiently
For example, packing multiple short sequences:
```rust
use bitnuc::{as_2bit, from_2bit};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Pack multiple 4-mers into u64s
let kmers = [b"ACGT", b"TGCA", b"GGCC"];
let packed: Vec<u64> = kmers
.into_iter()
.map(|kmer| as_2bit(kmer))
.collect::<Result<_, _>>()?;
// Unpack when needed
let mut kmers = Vec::new();
from_2bit(packed[0], 4, &mut kmers)?;
assert_eq!(&kmers, b"ACGT");
Ok(())
}
```
## High-Level Sequence Type
For more complex sequence manipulation, use the [`PackedSequence`] type:
```rust
use bitnuc::{PackedSequence, GCContent, BaseCount};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let seq = PackedSequence::new(b"ACGTACGT")?;
// Sequence analysis
println!("GC Content: {}%", seq.gc_content());
let [a_count, c_count, g_count, t_count] = seq.base_counts();
// Slicing
let subseq = seq.slice(1..5)?;
assert_eq!(&subseq, b"CGTA");
Ok(())
}
```
## Memory Usage
The 2-bit encoding provides significant memory savings:
```text
Standard encoding: 1 byte per base
ACGT = 4 bytes = 32 bits
2-bit encoding: 2 bits per base
ACGT = 8 bits
```
This means you can store 4 times as many sequences in the same amount of memory.
## Error Handling
All operations that could fail return a [`Result`] with [`NucleotideError`]:
```rust
use bitnuc::{as_2bit, NucleotideError};
// Invalid nucleotide
let err = as_2bit(b"ACGN").unwrap_err();
assert!(matches!(err, NucleotideError::InvalidBase(b'N')));
// Sequence too long
let long_seq = vec![b'A'; 33];
let err = as_2bit(&long_seq).unwrap_err();
assert!(matches!(err, NucleotideError::SequenceTooLong(33)));
```
## Performance Considerations
When working with many short sequences (like k-mers), using `as_2bit` and `from_2bit`
directly can be more efficient than creating [`PackedSequence`] instances:
```rust
use bitnuc::{as_2bit, from_2bit};
use std::collections::HashMap;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Efficient k-mer counting
let mut kmer_counts = HashMap::new();
// Pack k-mers directly into u64s
let sequence = b"ACGTACGT";
for window in sequence.windows(4) {
let packed = as_2bit(window)?;
*kmer_counts.entry(packed).or_insert(0) += 1;
}
// Count of "ACGT"
let acgt_packed = as_2bit(b"ACGT")?;
assert_eq!(kmer_counts.get(&acgt_packed), Some(&2));
Ok(())
}
```
See the documentation for [`as_2bit`] and [`from_2bit`] for more details on
working with packed sequences directly.
## SIMD Acceleration
`as_2bit` is optionally SIMD accelerated depending on the architecture of your system.
By default, SIMD instructions are used, but they can be shut-off using the `nosimd` feature flag.
For increased performance and to really take advantage of the SIMD I recommend compiling with:
```bash
RUSTFLAGS="-C target-cpu=native"
```
or to add these flags to your project via the cargo build config:
```toml
# ./cargo/config.toml
[build]
rustflags = ["-C", "target-cpu=native"]
```
Performance characteristics on my machine vary from 10% to 30% throughput increases depending on sequence size.