Skip to main content

RefgetStore

Struct RefgetStore 

Source
pub struct RefgetStore { /* private fields */ }
Expand description

Global store handling cross-collection sequence management Holds a global sequence_store, which holds all sequences (across collections) so that sequences are deduplicated. This allows lookup by sequence digest directly (bypassing collection information). The RefgetStore also holds a collections hashmap, to provide lookup by collection+name

Implementations§

Source§

impl RefgetStore

Source

pub fn set_quiet(&mut self, quiet: bool)

Set whether to suppress progress output.

When quiet is true, operations like add_sequence_collection_from_fasta will not print progress messages.

§Arguments
  • quiet - Whether to suppress progress output
Source

pub fn is_quiet(&self) -> bool

Returns whether the store is in quiet mode.

Source

pub fn on_disk<P: AsRef<Path>>(cache_path: P) -> Result<Self>

Create a disk-backed RefgetStore

Sequences are written to disk immediately and loaded on-demand (lazy loading). Only metadata is kept in memory.

§Arguments
  • cache_path - Directory for storing sequences and metadata
  • mode - Storage mode (Raw or Encoded)
§Returns

Result with a configured disk-backed store

§Example
let store = RefgetStore::on_disk("/data/store")?;
store.add_sequence_collection_from_fasta("genome.fa")?;
Source

pub fn in_memory() -> Self

Create an in-memory RefgetStore

All sequences kept in RAM for fast access. Defaults to Encoded storage mode (2-bit packing for space efficiency). Use set_encoding_mode() to change storage mode after creation.

§Example
let store = RefgetStore::in_memory();
store.add_sequence_collection_from_fasta("genome.fa")?;
Source

pub fn set_encoding_mode(&mut self, new_mode: StorageMode)

Change the storage mode, re-encoding/decoding existing sequences as needed.

When switching from Raw to Encoded:

  • All Full sequences in memory are encoded (2-bit packed)

When switching from Encoded to Raw:

  • All Full sequences in memory are decoded back to raw bytes

Note: Stub sequences (lazy-loaded from disk) are not affected. They will be loaded in the NEW mode when accessed.

§Arguments
  • new_mode - The storage mode to switch to
Source

pub fn enable_encoding(&mut self)

Enable 2-bit encoding for space efficiency. Re-encodes any existing Raw sequences in memory.

Source

pub fn disable_encoding(&mut self)

Disable encoding, use raw byte storage. Decodes any existing Encoded sequences in memory.

Source

pub fn enable_persistence<P: AsRef<Path>>(&mut self, path: P) -> Result<()>

Enable disk persistence for this store.

Sets up the store to write sequences to disk. Any in-memory Full sequences are flushed to disk and converted to Stubs.

§Arguments
  • path - Directory for storing sequences and metadata
§Returns

Result indicating success or error

Source

pub fn disable_persistence(&mut self)

Disable disk persistence for this store.

New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.

Source

pub fn is_persisting(&self) -> bool

Check if persistence to disk is enabled.

Source

pub fn add_sequence<T: Into<Option<[u8; 32]>>>( &mut self, sequence_record: SequenceRecord, collection_digest: T, force: bool, ) -> Result<()>

Adds a sequence to the Store Ensure that it is added to the appropriate collection. If no collection is specified, it will be added to the default collection.

§Arguments
  • sequence_record - The sequence to add
  • collection_digest - Collection to add to (or None for default)
  • force - If true, overwrite existing sequences. If false, skip duplicates.
Source

pub fn add_sequence_collection( &mut self, collection: SequenceCollection, ) -> Result<()>

Adds a collection, and all sequences in it, to the store.

Skips collections and sequences that already exist. Use add_sequence_collection_force() to overwrite existing data.

§Arguments
  • collection - The sequence collection to add
Source

pub fn add_sequence_collection_force( &mut self, collection: SequenceCollection, ) -> Result<()>

Adds a collection, and all sequences in it, to the store, overwriting existing data.

Forces overwrite of collections and sequences that already exist. Use add_sequence_collection() to skip duplicates (safer default).

§Arguments
  • collection - The sequence collection to add
Source

pub fn add_sequence_collection_from_fasta<P: AsRef<Path>>( &mut self, file_path: P, ) -> Result<(SequenceCollectionMetadata, bool)>

Add a sequence collection from a FASTA file.

Skips sequences and collections that already exist in the store. Use add_sequence_collection_from_fasta_force() to overwrite existing data.

§Arguments
  • file_path - Path to the FASTA file
§Returns

A tuple of (SequenceCollectionMetadata, was_new) where was_new indicates whether the collection was newly added (true) or already existed (false).

§Notes

Loading sequence data requires 2 passes through the FASTA file:

  1. First pass digests and guesses the alphabet to produce SequenceMetadata
  2. Second pass encodes the sequences based on the detected alphabet
Source

pub fn add_sequence_collection_from_fasta_force<P: AsRef<Path>>( &mut self, file_path: P, ) -> Result<(SequenceCollectionMetadata, bool)>

Add a sequence collection from a FASTA file, overwriting existing data.

Forces overwrite of collections and sequences that already exist in the store. Use add_sequence_collection_from_fasta() to skip duplicates (safer default).

§Arguments
  • file_path - Path to the FASTA file
§Returns

A tuple of (SequenceCollectionMetadata, was_new) where was_new is always true since force mode always overwrites.

Source

pub fn sequence_digests(&self) -> impl Iterator<Item = [u8; 32]> + '_

Returns an iterator over all sequence digests in the store

Source

pub fn sequence_metadata(&self) -> impl Iterator<Item = &SequenceMetadata> + '_

Returns an iterator over sequence metadata for all sequences in the store.

This is a lightweight operation that returns only metadata (name, length, digests) without loading sequence data.

§Returns

An iterator over SequenceMetadata references.

§Example
for metadata in store.sequence_metadata() {
    println!("{}: {} bp", metadata.name, metadata.length);
}
Source

pub fn total_disk_size(&self) -> usize

Calculate the total disk size of all sequences in the store

This computes the disk space used by sequence data based on:

  • Sequence length
  • Alphabet type (bits per symbol)
  • Storage mode (Raw or Encoded)
§Returns

Total bytes used for sequence data on disk

§Note

This only accounts for sequence data files (.seq), not metadata files like RGSI files, rgstore.json, or directory overhead.

§Examples
let store = RefgetStore::on_disk("store");
store.add_sequence_collection_from_fasta("genome.fa")?;
let disk_size = store.total_disk_size();
println!("Sequences use {} bytes on disk", disk_size);
Source

pub fn actual_disk_usage(&self) -> usize

Returns the actual disk usage of the store directory.

Walks the local_path directory (if set) and sums all file sizes. For in-memory stores without a local_path, returns 0.

This is useful for stats reporting to show actual disk consumption regardless of whether sequences are loaded in memory.

Source

pub fn list_collections(&self) -> Vec<SequenceCollectionMetadata>

List all collections in the store (metadata only, no sequence data).

Returns metadata for all collections without loading sequence data. Use this for browsing/inventory operations.

§Example
for meta in store.list_collections() {
    println!("{}: {} sequences", meta.digest, meta.n_sequences);
}
Source

pub fn get_collection_metadata<K: AsRef<[u8]>>( &self, collection_digest: K, ) -> Option<&SequenceCollectionMetadata>

Get metadata for a single collection by digest (no sequence data).

Use this for lightweight lookups when you don’t need sequence data.

Source

pub fn get_collection( &mut self, collection_digest: &str, ) -> Result<SequenceCollection>

Get a collection with all its sequences loaded.

This loads the collection metadata and all sequence data, returning a complete SequenceCollection ready for use.

§Example
let collection = store.get_collection("abc123")?;
for seq in &collection.sequences {
    println!("{}: {}", seq.metadata().name, seq.decode()?);
}
Source

pub fn list_sequences(&self) -> Vec<SequenceMetadata>

List all sequences in the store (metadata only, no sequence data).

Returns metadata for all sequences without loading sequence data. Use this for browsing/inventory operations.

§Example
for meta in store.list_sequences() {
    println!("{}: {} bp", meta.name, meta.length);
}
Source

pub fn get_sequence_metadata<K: AsRef<[u8]>>( &self, seq_digest: K, ) -> Option<&SequenceMetadata>

Get metadata for a single sequence by digest (no sequence data).

Use this for lightweight lookups when you don’t need the actual sequence.

Source

pub fn get_sequence<K: AsRef<[u8]>>( &mut self, seq_digest: K, ) -> Result<&SequenceRecord>

Get a sequence by its SHA512t24u digest, loading data if needed.

§Example
let seq = store.get_sequence("abc123")?;
println!("{}: {}", seq.metadata().name, seq.decode()?);
Source

pub fn get_sequence_by_name<K: AsRef<[u8]>>( &mut self, collection_digest: K, sequence_name: &str, ) -> Result<&SequenceRecord>

Get a sequence by collection digest and name, loading data if needed.

§Example
let seq = store.get_sequence_by_name("collection123", "chr1")?;
println!("{}", seq.decode()?);
Source

pub fn iter_collections(&mut self) -> impl Iterator<Item = SequenceCollection>

Iterate over all collections with their sequences loaded.

This loads all collection data upfront and returns an iterator over SequenceCollection objects with full sequence data.

§Example
for collection in store.iter_collections() {
    println!("{}: {} sequences", collection.metadata.digest, collection.sequences.len());
}

Note: For browsing without loading data, use list_collections() instead.

Source

pub fn iter_sequences(&mut self) -> impl Iterator<Item = SequenceRecord>

Iterate over all sequences with their data loaded.

This ensures all sequence data is loaded and returns an iterator over SequenceRecord objects with full sequence data.

§Example
for seq in store.iter_sequences() {
    println!("{}: {}", seq.metadata().name, seq.decode().unwrap_or_default());
}

Note: For browsing without loading data, use list_sequences() instead.

Source

pub fn is_collection_loaded<K: AsRef<[u8]>>(&self, collection_digest: K) -> bool

Check if a collection is fully loaded (Full) or just metadata (Stub)

Source

pub fn local_path(&self) -> Option<&PathBuf>

Returns the local path where the store is located (if any)

Source

pub fn remote_source(&self) -> Option<&str>

Returns the remote source URL (if any)

Source

pub fn storage_mode(&self) -> StorageMode

Returns the storage mode used by this store

Source

pub fn substrings_from_regions<'a, K: AsRef<[u8]>>( &'a mut self, collection_digest: K, bed_file_path: &str, ) -> Result<SubstringsFromRegions<'a, K>, Box<dyn Error>>

Get an iterator over substrings defined by BED file regions.

Reads a BED file line-by-line and yields substrings for each region. This is memory-efficient for large BED files as it streams results.

§Arguments
  • collection_digest - The collection digest containing the sequences
  • bed_file_path - Path to the BED file defining regions
§Returns

Iterator yielding Result<RetrievedSequence> for each BED region

§Example
let iter = store.substrings_from_regions(digest, "regions.bed")?;
for result in iter {
    let seq = result?;
    println!("{}:{}-{}: {}", seq.chrom_name, seq.start, seq.end, seq.sequence);
}
Source

pub fn export_fasta_from_regions<K: AsRef<[u8]>>( &mut self, collection_digest: K, bed_file_path: &str, output_file_path: &str, ) -> Result<(), Box<dyn Error>>

Export sequences from BED file regions to a FASTA file.

Reads a BED file defining genomic regions and exports the sequences for those regions to a FASTA file. This is useful for extracting specific regions of interest from a genome.

§Arguments
  • collection_digest - The collection digest containing the sequences
  • bed_file_path - Path to the BED file defining regions
  • output_file_path - Path to write the output FASTA file
§Returns

Result indicating success or error

§Example
store.export_fasta_from_regions(
    digest,
    "regions.bed",
    "output.fa"
)?;
Source

pub fn get_substring<K: AsRef<[u8]>>( &mut self, sha512_digest: K, start: usize, end: usize, ) -> Result<String>

Retrieves a substring from an encoded sequence by its SHA512t24u digest.

§Arguments
  • sha512_digest - The SHA512t24u digest of the sequence
  • start - The start index of the substring (inclusive)
  • end - The end index of the substring (exclusive)
§Returns

The substring if the sequence is found, or an error if not found or invalid range

Source

pub fn export_fasta<K: AsRef<[u8]>, P: AsRef<Path>>( &mut self, collection_digest: K, output_path: P, sequence_names: Option<Vec<&str>>, line_width: Option<usize>, ) -> Result<()>

Export sequences from a collection to a FASTA file

§Arguments
  • collection_digest - The digest of the collection to export from
  • output_path - Path to write the FASTA file
  • sequence_names - Optional list of sequence names to export. If None, exports all sequences in the collection.
  • line_width - Optional line width for wrapping sequences (default: 80)
§Returns

Result indicating success or error

Source

pub fn export_fasta_by_digests<P: AsRef<Path>>( &mut self, seq_digests: Vec<&str>, output_path: P, line_width: Option<usize>, ) -> Result<()>

Export sequences by their sequence digests to a FASTA file

Bypasses collection information and exports sequences directly via sequence digests.

§Arguments
  • seq_digests - List of SHA512t24u sequence digests (not collection digests) to export
  • output_path - Path to write the FASTA file
  • line_width - Optional line width for wrapping sequences (default: 80)
§Returns

Result indicating success or error

Source

pub fn write_sequences_rgsi<P: AsRef<Path>>(&self, file_path: P) -> Result<()>

Write all sequence metadata to an RGSI file.

Creates a global sequence index file containing metadata for all sequences in the store across all collections.

Source

pub fn open_local<P: AsRef<Path>>(path: P) -> Result<Self>

Open a local RefgetStore from a directory.

This loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly loaded with load_collection()/load_sequence().

§Arguments
  • path - Path to the store directory

Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi

Source

pub fn open_remote<P: AsRef<Path>, S: AsRef<str>>( cache_path: P, remote_url: S, ) -> Result<Self>

Open a remote RefgetStore with local caching.

This loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when load_collection()/load_sequence() is called.

§Arguments
  • cache_path - Local directory for caching fetched data
  • remote_url - URL of the remote store
§Notes

By default, persistence is enabled (sequences are cached to disk). Call disable_persistence() after loading to keep only in memory.

Source

pub fn write(&self) -> Result<()>

Write all sequence metadata to an RGSI file (without collection headers).

Creates a global sequence index file containing metadata for all sequences in the store across all collections. Does not include collection-level digest headers.

Write the store using its configured paths

For disk-backed stores (on_disk), this updates index files only since sequences/collections are already written incrementally. For in-memory stores, this is not supported (use write_store_to_dir instead).

§Returns

Result indicating success or error

§Errors

Returns an error if local_path is not set.

§Example
let store = RefgetStore::on_disk("/data/store")?;
store.add_sequence_collection_from_fasta("genome.fa")?;
store.write()?;  // Updates index files
Source

pub fn write_store_to_dir<P: AsRef<Path>>( &self, root_path: P, seqdata_path_template: Option<&str>, ) -> Result<()>

Write a RefgetStore object to a directory

Source

pub fn stats(&self) -> (usize, usize, &'static str)

Returns statistics about the store

§Returns

A tuple of (n_sequences, n_collections_loaded, storage_mode_str)

Note: n_collections_loaded only reflects collections currently loaded in memory. For remote stores, collections are loaded on-demand when accessed.

Source

pub fn stats_extended(&self) -> StoreStats

Extended statistics including stub/loaded breakdown for collections

Trait Implementations§

Source§

impl Debug for RefgetStore

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Display for RefgetStore

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToString for T
where T: Display + ?Sized,

Source§

fn to_string(&self) -> String

Converts the given value to a String. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.