pub struct RefgetStore { /* private fields */ }Expand description
Global store handling cross-collection sequence management Holds a global sequence_store, which holds all sequences (across collections) so that sequences are deduplicated. This allows lookup by sequence digest directly (bypassing collection information). The RefgetStore also holds a collections hashmap, to provide lookup by collection+name
Implementations§
Source§impl RefgetStore
impl RefgetStore
Sourcepub fn set_quiet(&mut self, quiet: bool)
pub fn set_quiet(&mut self, quiet: bool)
Set whether to suppress progress output.
When quiet is true, operations like add_sequence_collection_from_fasta will not print progress messages.
§Arguments
quiet- Whether to suppress progress output
Sourcepub fn on_disk<P: AsRef<Path>>(cache_path: P) -> Result<Self>
pub fn on_disk<P: AsRef<Path>>(cache_path: P) -> Result<Self>
Create a disk-backed RefgetStore
Sequences are written to disk immediately and loaded on-demand (lazy loading). Only metadata is kept in memory.
§Arguments
cache_path- Directory for storing sequences and metadatamode- Storage mode (Raw or Encoded)
§Returns
Result with a configured disk-backed store
§Example
let store = RefgetStore::on_disk("/data/store")?;
store.add_sequence_collection_from_fasta("genome.fa")?;Sourcepub fn in_memory() -> Self
pub fn in_memory() -> Self
Create an in-memory RefgetStore
All sequences kept in RAM for fast access. Defaults to Encoded storage mode (2-bit packing for space efficiency). Use set_encoding_mode() to change storage mode after creation.
§Example
let store = RefgetStore::in_memory();
store.add_sequence_collection_from_fasta("genome.fa")?;Sourcepub fn set_encoding_mode(&mut self, new_mode: StorageMode)
pub fn set_encoding_mode(&mut self, new_mode: StorageMode)
Change the storage mode, re-encoding/decoding existing sequences as needed.
When switching from Raw to Encoded:
- All Full sequences in memory are encoded (2-bit packed)
When switching from Encoded to Raw:
- All Full sequences in memory are decoded back to raw bytes
Note: Stub sequences (lazy-loaded from disk) are not affected. They will be loaded in the NEW mode when accessed.
§Arguments
new_mode- The storage mode to switch to
Sourcepub fn enable_encoding(&mut self)
pub fn enable_encoding(&mut self)
Enable 2-bit encoding for space efficiency. Re-encodes any existing Raw sequences in memory.
Sourcepub fn disable_encoding(&mut self)
pub fn disable_encoding(&mut self)
Disable encoding, use raw byte storage. Decodes any existing Encoded sequences in memory.
Sourcepub fn disable_persistence(&mut self)
pub fn disable_persistence(&mut self)
Disable disk persistence for this store.
New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.
Sourcepub fn is_persisting(&self) -> bool
pub fn is_persisting(&self) -> bool
Check if persistence to disk is enabled.
Sourcepub fn add_sequence<T: Into<Option<[u8; 32]>>>(
&mut self,
sequence_record: SequenceRecord,
collection_digest: T,
force: bool,
) -> Result<()>
pub fn add_sequence<T: Into<Option<[u8; 32]>>>( &mut self, sequence_record: SequenceRecord, collection_digest: T, force: bool, ) -> Result<()>
Adds a sequence to the Store Ensure that it is added to the appropriate collection. If no collection is specified, it will be added to the default collection.
§Arguments
sequence_record- The sequence to addcollection_digest- Collection to add to (or None for default)force- If true, overwrite existing sequences. If false, skip duplicates.
Sourcepub fn add_sequence_collection(
&mut self,
collection: SequenceCollection,
) -> Result<()>
pub fn add_sequence_collection( &mut self, collection: SequenceCollection, ) -> Result<()>
Adds a collection, and all sequences in it, to the store.
Skips collections and sequences that already exist.
Use add_sequence_collection_force() to overwrite existing data.
§Arguments
collection- The sequence collection to add
Sourcepub fn add_sequence_collection_force(
&mut self,
collection: SequenceCollection,
) -> Result<()>
pub fn add_sequence_collection_force( &mut self, collection: SequenceCollection, ) -> Result<()>
Adds a collection, and all sequences in it, to the store, overwriting existing data.
Forces overwrite of collections and sequences that already exist.
Use add_sequence_collection() to skip duplicates (safer default).
§Arguments
collection- The sequence collection to add
Sourcepub fn add_sequence_collection_from_fasta<P: AsRef<Path>>(
&mut self,
file_path: P,
) -> Result<(SequenceCollectionMetadata, bool)>
pub fn add_sequence_collection_from_fasta<P: AsRef<Path>>( &mut self, file_path: P, ) -> Result<(SequenceCollectionMetadata, bool)>
Add a sequence collection from a FASTA file.
Skips sequences and collections that already exist in the store.
Use add_sequence_collection_from_fasta_force() to overwrite existing data.
§Arguments
file_path- Path to the FASTA file
§Returns
A tuple of (SequenceCollectionMetadata, was_new) where was_new indicates whether the collection was newly added (true) or already existed (false).
§Notes
Loading sequence data requires 2 passes through the FASTA file:
- First pass digests and guesses the alphabet to produce SequenceMetadata
- Second pass encodes the sequences based on the detected alphabet
Sourcepub fn add_sequence_collection_from_fasta_force<P: AsRef<Path>>(
&mut self,
file_path: P,
) -> Result<(SequenceCollectionMetadata, bool)>
pub fn add_sequence_collection_from_fasta_force<P: AsRef<Path>>( &mut self, file_path: P, ) -> Result<(SequenceCollectionMetadata, bool)>
Add a sequence collection from a FASTA file, overwriting existing data.
Forces overwrite of collections and sequences that already exist in the store.
Use add_sequence_collection_from_fasta() to skip duplicates (safer default).
§Arguments
file_path- Path to the FASTA file
§Returns
A tuple of (SequenceCollectionMetadata, was_new) where was_new is always true since force mode always overwrites.
Sourcepub fn sequence_digests(&self) -> impl Iterator<Item = [u8; 32]> + '_
pub fn sequence_digests(&self) -> impl Iterator<Item = [u8; 32]> + '_
Returns an iterator over all sequence digests in the store
Sourcepub fn sequence_metadata(&self) -> impl Iterator<Item = &SequenceMetadata> + '_
pub fn sequence_metadata(&self) -> impl Iterator<Item = &SequenceMetadata> + '_
Returns an iterator over sequence metadata for all sequences in the store.
This is a lightweight operation that returns only metadata (name, length, digests) without loading sequence data.
§Returns
An iterator over SequenceMetadata references.
§Example
for metadata in store.sequence_metadata() {
println!("{}: {} bp", metadata.name, metadata.length);
}Sourcepub fn total_disk_size(&self) -> usize
pub fn total_disk_size(&self) -> usize
Calculate the total disk size of all sequences in the store
This computes the disk space used by sequence data based on:
- Sequence length
- Alphabet type (bits per symbol)
- Storage mode (Raw or Encoded)
§Returns
Total bytes used for sequence data on disk
§Note
This only accounts for sequence data files (.seq), not metadata files like RGSI files, rgstore.json, or directory overhead.
§Examples
let store = RefgetStore::on_disk("store");
store.add_sequence_collection_from_fasta("genome.fa")?;
let disk_size = store.total_disk_size();
println!("Sequences use {} bytes on disk", disk_size);Sourcepub fn actual_disk_usage(&self) -> usize
pub fn actual_disk_usage(&self) -> usize
Returns the actual disk usage of the store directory.
Walks the local_path directory (if set) and sums all file sizes. For in-memory stores without a local_path, returns 0.
This is useful for stats reporting to show actual disk consumption regardless of whether sequences are loaded in memory.
Sourcepub fn list_collections(&self) -> Vec<SequenceCollectionMetadata>
pub fn list_collections(&self) -> Vec<SequenceCollectionMetadata>
Sourcepub fn get_collection_metadata<K: AsRef<[u8]>>(
&self,
collection_digest: K,
) -> Option<&SequenceCollectionMetadata>
pub fn get_collection_metadata<K: AsRef<[u8]>>( &self, collection_digest: K, ) -> Option<&SequenceCollectionMetadata>
Get metadata for a single collection by digest (no sequence data).
Use this for lightweight lookups when you don’t need sequence data.
Sourcepub fn get_collection(
&mut self,
collection_digest: &str,
) -> Result<SequenceCollection>
pub fn get_collection( &mut self, collection_digest: &str, ) -> Result<SequenceCollection>
Get a collection with all its sequences loaded.
This loads the collection metadata and all sequence data, returning
a complete SequenceCollection ready for use.
§Example
let collection = store.get_collection("abc123")?;
for seq in &collection.sequences {
println!("{}: {}", seq.metadata().name, seq.decode()?);
}Sourcepub fn list_sequences(&self) -> Vec<SequenceMetadata>
pub fn list_sequences(&self) -> Vec<SequenceMetadata>
Sourcepub fn get_sequence_metadata<K: AsRef<[u8]>>(
&self,
seq_digest: K,
) -> Option<&SequenceMetadata>
pub fn get_sequence_metadata<K: AsRef<[u8]>>( &self, seq_digest: K, ) -> Option<&SequenceMetadata>
Get metadata for a single sequence by digest (no sequence data).
Use this for lightweight lookups when you don’t need the actual sequence.
Sourcepub fn get_sequence<K: AsRef<[u8]>>(
&mut self,
seq_digest: K,
) -> Result<&SequenceRecord>
pub fn get_sequence<K: AsRef<[u8]>>( &mut self, seq_digest: K, ) -> Result<&SequenceRecord>
Sourcepub fn get_sequence_by_name<K: AsRef<[u8]>>(
&mut self,
collection_digest: K,
sequence_name: &str,
) -> Result<&SequenceRecord>
pub fn get_sequence_by_name<K: AsRef<[u8]>>( &mut self, collection_digest: K, sequence_name: &str, ) -> Result<&SequenceRecord>
Sourcepub fn iter_collections(&mut self) -> impl Iterator<Item = SequenceCollection>
pub fn iter_collections(&mut self) -> impl Iterator<Item = SequenceCollection>
Iterate over all collections with their sequences loaded.
This loads all collection data upfront and returns an iterator over
SequenceCollection objects with full sequence data.
§Example
for collection in store.iter_collections() {
println!("{}: {} sequences", collection.metadata.digest, collection.sequences.len());
}Note: For browsing without loading data, use list_collections() instead.
Sourcepub fn iter_sequences(&mut self) -> impl Iterator<Item = SequenceRecord>
pub fn iter_sequences(&mut self) -> impl Iterator<Item = SequenceRecord>
Iterate over all sequences with their data loaded.
This ensures all sequence data is loaded and returns an iterator over
SequenceRecord objects with full sequence data.
§Example
for seq in store.iter_sequences() {
println!("{}: {}", seq.metadata().name, seq.decode().unwrap_or_default());
}Note: For browsing without loading data, use list_sequences() instead.
Sourcepub fn is_collection_loaded<K: AsRef<[u8]>>(&self, collection_digest: K) -> bool
pub fn is_collection_loaded<K: AsRef<[u8]>>(&self, collection_digest: K) -> bool
Check if a collection is fully loaded (Full) or just metadata (Stub)
Sourcepub fn local_path(&self) -> Option<&PathBuf>
pub fn local_path(&self) -> Option<&PathBuf>
Returns the local path where the store is located (if any)
Sourcepub fn remote_source(&self) -> Option<&str>
pub fn remote_source(&self) -> Option<&str>
Returns the remote source URL (if any)
Sourcepub fn storage_mode(&self) -> StorageMode
pub fn storage_mode(&self) -> StorageMode
Returns the storage mode used by this store
Sourcepub fn substrings_from_regions<'a, K: AsRef<[u8]>>(
&'a mut self,
collection_digest: K,
bed_file_path: &str,
) -> Result<SubstringsFromRegions<'a, K>, Box<dyn Error>>
pub fn substrings_from_regions<'a, K: AsRef<[u8]>>( &'a mut self, collection_digest: K, bed_file_path: &str, ) -> Result<SubstringsFromRegions<'a, K>, Box<dyn Error>>
Get an iterator over substrings defined by BED file regions.
Reads a BED file line-by-line and yields substrings for each region. This is memory-efficient for large BED files as it streams results.
§Arguments
collection_digest- The collection digest containing the sequencesbed_file_path- Path to the BED file defining regions
§Returns
Iterator yielding Result<RetrievedSequence> for each BED region
§Example
let iter = store.substrings_from_regions(digest, "regions.bed")?;
for result in iter {
let seq = result?;
println!("{}:{}-{}: {}", seq.chrom_name, seq.start, seq.end, seq.sequence);
}Sourcepub fn export_fasta_from_regions<K: AsRef<[u8]>>(
&mut self,
collection_digest: K,
bed_file_path: &str,
output_file_path: &str,
) -> Result<(), Box<dyn Error>>
pub fn export_fasta_from_regions<K: AsRef<[u8]>>( &mut self, collection_digest: K, bed_file_path: &str, output_file_path: &str, ) -> Result<(), Box<dyn Error>>
Export sequences from BED file regions to a FASTA file.
Reads a BED file defining genomic regions and exports the sequences for those regions to a FASTA file. This is useful for extracting specific regions of interest from a genome.
§Arguments
collection_digest- The collection digest containing the sequencesbed_file_path- Path to the BED file defining regionsoutput_file_path- Path to write the output FASTA file
§Returns
Result indicating success or error
§Example
store.export_fasta_from_regions(
digest,
"regions.bed",
"output.fa"
)?;Sourcepub fn get_substring<K: AsRef<[u8]>>(
&mut self,
sha512_digest: K,
start: usize,
end: usize,
) -> Result<String>
pub fn get_substring<K: AsRef<[u8]>>( &mut self, sha512_digest: K, start: usize, end: usize, ) -> Result<String>
Retrieves a substring from an encoded sequence by its SHA512t24u digest.
§Arguments
sha512_digest- The SHA512t24u digest of the sequencestart- The start index of the substring (inclusive)end- The end index of the substring (exclusive)
§Returns
The substring if the sequence is found, or an error if not found or invalid range
Sourcepub fn export_fasta<K: AsRef<[u8]>, P: AsRef<Path>>(
&mut self,
collection_digest: K,
output_path: P,
sequence_names: Option<Vec<&str>>,
line_width: Option<usize>,
) -> Result<()>
pub fn export_fasta<K: AsRef<[u8]>, P: AsRef<Path>>( &mut self, collection_digest: K, output_path: P, sequence_names: Option<Vec<&str>>, line_width: Option<usize>, ) -> Result<()>
Export sequences from a collection to a FASTA file
§Arguments
collection_digest- The digest of the collection to export fromoutput_path- Path to write the FASTA filesequence_names- Optional list of sequence names to export. If None, exports all sequences in the collection.line_width- Optional line width for wrapping sequences (default: 80)
§Returns
Result indicating success or error
Sourcepub fn export_fasta_by_digests<P: AsRef<Path>>(
&mut self,
seq_digests: Vec<&str>,
output_path: P,
line_width: Option<usize>,
) -> Result<()>
pub fn export_fasta_by_digests<P: AsRef<Path>>( &mut self, seq_digests: Vec<&str>, output_path: P, line_width: Option<usize>, ) -> Result<()>
Export sequences by their sequence digests to a FASTA file
Bypasses collection information and exports sequences directly via sequence digests.
§Arguments
seq_digests- List of SHA512t24u sequence digests (not collection digests) to exportoutput_path- Path to write the FASTA fileline_width- Optional line width for wrapping sequences (default: 80)
§Returns
Result indicating success or error
Sourcepub fn write_sequences_rgsi<P: AsRef<Path>>(&self, file_path: P) -> Result<()>
pub fn write_sequences_rgsi<P: AsRef<Path>>(&self, file_path: P) -> Result<()>
Write all sequence metadata to an RGSI file.
Creates a global sequence index file containing metadata for all sequences in the store across all collections.
Sourcepub fn open_local<P: AsRef<Path>>(path: P) -> Result<Self>
pub fn open_local<P: AsRef<Path>>(path: P) -> Result<Self>
Open a local RefgetStore from a directory.
This loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly loaded with load_collection()/load_sequence().
§Arguments
path- Path to the store directory
Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi
Sourcepub fn open_remote<P: AsRef<Path>, S: AsRef<str>>(
cache_path: P,
remote_url: S,
) -> Result<Self>
pub fn open_remote<P: AsRef<Path>, S: AsRef<str>>( cache_path: P, remote_url: S, ) -> Result<Self>
Open a remote RefgetStore with local caching.
This loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when load_collection()/load_sequence() is called.
§Arguments
cache_path- Local directory for caching fetched dataremote_url- URL of the remote store
§Notes
By default, persistence is enabled (sequences are cached to disk).
Call disable_persistence() after loading to keep only in memory.
Sourcepub fn write(&self) -> Result<()>
pub fn write(&self) -> Result<()>
Write all sequence metadata to an RGSI file (without collection headers).
Creates a global sequence index file containing metadata for all sequences in the store across all collections. Does not include collection-level digest headers.
Write the store using its configured paths
For disk-backed stores (on_disk), this updates index files only since sequences/collections are already written incrementally. For in-memory stores, this is not supported (use write_store_to_dir instead).
§Returns
Result indicating success or error
§Errors
Returns an error if local_path is not set.
§Example
let store = RefgetStore::on_disk("/data/store")?;
store.add_sequence_collection_from_fasta("genome.fa")?;
store.write()?; // Updates index filesSourcepub fn write_store_to_dir<P: AsRef<Path>>(
&self,
root_path: P,
seqdata_path_template: Option<&str>,
) -> Result<()>
pub fn write_store_to_dir<P: AsRef<Path>>( &self, root_path: P, seqdata_path_template: Option<&str>, ) -> Result<()>
Write a RefgetStore object to a directory
Sourcepub fn stats(&self) -> (usize, usize, &'static str)
pub fn stats(&self) -> (usize, usize, &'static str)
Returns statistics about the store
§Returns
A tuple of (n_sequences, n_collections_loaded, storage_mode_str)
Note: n_collections_loaded only reflects collections currently loaded in memory. For remote stores, collections are loaded on-demand when accessed.
Sourcepub fn stats_extended(&self) -> StoreStats
pub fn stats_extended(&self) -> StoreStats
Extended statistics including stub/loaded breakdown for collections