Skip to main content

DatasetStore

Trait DatasetStore 

Source
pub trait DatasetStore:
    Send
    + Sync
    + Clone {
    // Required methods
    fn get_by_id(
        &self,
        id: Uuid,
    ) -> impl Future<Output = Result<Option<Dataset>, AppError>> + Send;
    fn get_hashes_for_portal(
        &self,
        portal_url: &str,
    ) -> impl Future<Output = Result<HashMap<String, Option<String>>, AppError>> + Send;
    fn update_timestamp_only(
        &self,
        portal_url: &str,
        original_id: &str,
    ) -> impl Future<Output = Result<(), AppError>> + Send;
    fn batch_update_timestamps(
        &self,
        portal_url: &str,
        original_ids: &[String],
    ) -> impl Future<Output = Result<u64, AppError>> + Send;
    fn upsert(
        &self,
        dataset: &NewDataset,
    ) -> impl Future<Output = Result<Uuid, AppError>> + Send;
    fn search(
        &self,
        query_vector: Vector,
        limit: usize,
    ) -> impl Future<Output = Result<Vec<SearchResult>, AppError>> + Send;
    fn list_stream<'a>(
        &'a self,
        portal_filter: Option<&'a str>,
        limit: Option<usize>,
    ) -> BoxStream<'a, Result<Dataset, AppError>>;
    fn get_last_sync_time(
        &self,
        portal_url: &str,
    ) -> impl Future<Output = Result<Option<DateTime<Utc>>, AppError>> + Send;
    fn record_sync_status(
        &self,
        portal_url: &str,
        sync_time: DateTime<Utc>,
        sync_mode: &str,
        sync_status: &str,
        datasets_synced: i32,
    ) -> impl Future<Output = Result<(), AppError>> + Send;
    fn health_check(&self) -> impl Future<Output = Result<(), AppError>> + Send;
}
Expand description

Store for dataset persistence and retrieval.

Implementations handle database operations for datasets.

Required Methods§

Source

fn get_by_id( &self, id: Uuid, ) -> impl Future<Output = Result<Option<Dataset>, AppError>> + Send

Retrieves a dataset by its unique ID.

§Arguments
  • id - The dataset’s UUID
§Returns

The dataset if found, or None if not exists.

Source

fn get_hashes_for_portal( &self, portal_url: &str, ) -> impl Future<Output = Result<HashMap<String, Option<String>>, AppError>> + Send

Retrieves content hashes for all datasets from a specific portal.

Used for delta detection to determine which datasets need reprocessing.

§Arguments
  • portal_url - The source portal URL
§Returns

A map from original_id to optional content_hash.

Source

fn update_timestamp_only( &self, portal_url: &str, original_id: &str, ) -> impl Future<Output = Result<(), AppError>> + Send

Updates only the timestamp for an unchanged dataset.

Used when content hash matches but we want to track “last seen” time.

§Arguments
  • portal_url - The source portal URL
  • original_id - The dataset’s original ID from the portal
Source

fn batch_update_timestamps( &self, portal_url: &str, original_ids: &[String], ) -> impl Future<Output = Result<u64, AppError>> + Send

Batch updates timestamps for multiple unchanged datasets.

More efficient than calling update_timestamp_only for each dataset.

§Arguments
  • portal_url - The source portal URL
  • original_ids - Slice of dataset original IDs to update
§Returns

The number of rows actually updated.

Source

fn upsert( &self, dataset: &NewDataset, ) -> impl Future<Output = Result<Uuid, AppError>> + Send

Inserts or updates a dataset.

§Arguments
  • dataset - The dataset to upsert
§Returns

The UUID of the affected row.

Source

fn search( &self, query_vector: Vector, limit: usize, ) -> impl Future<Output = Result<Vec<SearchResult>, AppError>> + Send

Performs vector similarity search.

§Arguments
  • query_vector - The embedding vector to search for
  • limit - Maximum number of results
§Returns

Datasets ranked by similarity score (highest first).

Source

fn list_stream<'a>( &'a self, portal_filter: Option<&'a str>, limit: Option<usize>, ) -> BoxStream<'a, Result<Dataset, AppError>>

Lists datasets as a stream with optional filtering.

This method returns a stream of datasets for memory-efficient processing of large result sets. Unlike batch methods, it streams results directly from the database without loading everything into memory.

§Arguments
  • portal_filter - Optional portal URL to filter by
  • limit - Optional maximum number of records
Source

fn get_last_sync_time( &self, portal_url: &str, ) -> impl Future<Output = Result<Option<DateTime<Utc>>, AppError>> + Send

Retrieves the last successful sync timestamp for a portal.

Used for incremental harvesting to determine which datasets have been modified since the last sync.

§Arguments
  • portal_url - The source portal URL
§Returns

The timestamp of the last successful sync, or None if never synced.

Source

fn record_sync_status( &self, portal_url: &str, sync_time: DateTime<Utc>, sync_mode: &str, sync_status: &str, datasets_synced: i32, ) -> impl Future<Output = Result<(), AppError>> + Send

Records a sync status for a portal.

Called after a harvest operation to update the sync status. The sync_status parameter indicates the outcome: “completed” or “cancelled”.

§Arguments
  • portal_url - The source portal URL
  • sync_time - The timestamp of this sync
  • sync_mode - Either “full” or “incremental”
  • sync_status - The outcome: “completed” or “cancelled”
  • datasets_synced - Number of datasets processed
Source

fn health_check(&self) -> impl Future<Output = Result<(), AppError>> + Send

Checks database connectivity.

Performs a simple query to verify the database is reachable and responsive. Used by health check endpoints.

Dyn Compatibility§

This trait is not dyn compatible.

In older versions of Rust, dyn compatibility was called "object safety", so this trait is not object safe.

Implementors§