oasysdb 0.7.3 - Docs.rs

# Database

The Database is the primary interface for interacting with OasysDB. It is
responsible for managing the connection to the SQL database and vector indices.

These are the most notable operations that can be performed with the Database:

- Creating a new index.
- Refreshing an existing index.
- Searching for vectors in an index.
- Deleting an index.

## Create Index

This method creates a new index in the database. The initial data for the index
will be loaded from the SQL table defined in the source configuration parameter.

### Parameters

- **name**: Name of the new index.
- **algorithm**: Vector indexing algorithm to use in the index.
- **source**: Source configuration for the index.

### Custom Index Parameters

When specifying the indexing algorithm, we can also pass custom parameters
specific to the algorithm. For example, if we are using the IVFPQ algorithm, we
can configure the number of centroids and the number of sub-quantizers like:

```json
{
  "centroids": 512,
  "max_iterations": 100,
  "sub_centroids": 256,
  ...
}
```

For more information about the available parameters for each algorithm, please
refer to each algorithm's documentation.

### Source Configuration

The source configuration defines how the data will be loaded from the SQL
database to create and refresh the index. For example, if we store our vectors
in a table called _embeddings_ and in a column called _vector_, we can define
the source configuration like:

```json
{
  "table": "embeddings",
  "primary_key": "id",
  "vector": "vector"
  ...
}
```

!!! danger "Primary Key Requirement"

    The primary key must be unique and not null with auto-incrementing integer
    as its type. This allows OasysDB to incrementally load the data from the
    table when refreshing the index.

!!! danger "Vector Column Requirement"

    The vector must be stored in either JSON (Recommended) or blob column data
    type. Without this, OasysDB won't be able to load the vectors from the
    source table.

### Source Metadata (Optional)

In OasysDB, we can also store metadata along with the vectors directly in the
index which is very useful to eliminate post-search queries to the SQL database.
For example, if we have the following table in SQLite:

```sql
CREATE TABLE articles (
  id INTEGER PRIMARY KEY,
  vector JSON NOT NULL,
  content TEXT NOT NULL
);
```

We can define the source configuration to store the content in the index:

```json
{
  "table": "articles",
  "primary_key": "id",
  "vector": "vector",
  "metadata": ["content"]
  ...
}
```

When we search the index later on, the metadata will be included in the search
results allowing us to use the data right away without querying our SQL database
for the metadata.

!!! info "Metadata Limitation"

    The metadata is limited to primitive data types like integer, float, string,
    and boolean. It's also worth noting that the number and size of the metadata
    will affect the overall memory usage of the index.

    Don't overuse it 😁

### Source Filter (Optional)

In the source configuration, we can also define an optional SQL filter to load
only a subset of our data for the index. This filtering will also apply when
refreshing the index incrementally.

Let's say that we have a SQLite table with the schema below:

```sql
CREATE TABLE articles (
  id INTEGER PRIMARY KEY,
  vector JSON NOT NULL,
  content TEXT,
  year INTEGER,
);
```

We can add a SQL filter to only load the articles from the year 2021:

```json
{
  "table": "articles",
  "primary_key": "id",
  "vector": "vector",
  "filter": "year = 2021" // Exclude WHERE keyword
  ...
}
```

!!! warning "SQL Injection Risk"

    Be careful not to use user input directly in the SQL filter as this can
    lead to SQL injection attacks. Always sanitize the input before using it
    in the filter.

## Refresh Index

This method updates an existing index with the latest data from the SQL table.
Under the hood, OasysDB will query the source table from the last primary key
inserted and insert the new data to the index incrementally.

Incremental insertion is very crucial here because it allows us to insert an
individual record to the index without rebuilding the entire index which can be
very slow.

### Parameters

- **name**: Name of the index to refresh.

!!! tip "Asynchronous Refresh"

    The refresh operation can be performed asynchronously. This allows us to
    refresh the index in the background and/or periodically without blocking
    the main thread.

## Search Index

This method performs a nearest neighbor search in the index and returns _K_
search results based on the query vector. The search results will include the
primary key, distance between the query vector and the result vector, and
optional metadata if defined in the source configuration.

In JSON format, the search results will look like:

```json
[
  {
    "id": 1,
    "distance": 0.123,
    "data": {
      "content": "OasysDB is awesome!"
    }
  },
  ...
]
```

### Parameters

- **name**: Name of the index to search.
- **query**: Query vector for the nearest neighbor search.
- **k**: Number of results to return.
- **filters**: Optional SQL-like filter to apply to the search results.

### Post-filtering (Optional)

When searching the index, we can additionally apply post-filtering to the search
operation against the metadata stored in the index.

Let's say that we have the following setup for our index:

=== "SQLite Table"

    ```sql
    CREATE TABLE articles (
      id INTEGER PRIMARY KEY,
      vector JSON NOT NULL,
      year INTEGER
    );
    ```

=== "Source Configuration"

    ```json
    {
      "table": "articles",
      "primary_key": "id",
      "vector": "vector",
      "metadata": ["year"]
    }
    ```

Since we have the year metadata stored in the index, we can apply post-filtering
to the search operation by adding a filter string to the filters parameter:

```json
{
  "name": "index",
  "query": [0.1, 0.2, 0.3, ...],
  "k": 10,
  "filters": "year = 2021" // SQL-like filtering
}
```

This operation will only return the search results where the year metadata is
equal to 2021. There are also other operators we can use for the filtering and
these are the supported operators with their compatible metadata types:

| Operator | Description           | Metadata Type  |
| -------- | --------------------- | -------------- |
| =        | Equal                 | All            |
| !=       | Not Equal             | All            |
| <        | Less Than             | Integer, Float |
| <=       | Less Than or Equal    | Integer, Float |
| >        | Greater Than          | Integer, Float |
| >=       | Greater Than or Equal | Integer, Float |
| CONTAINS | Contains              | String         |

These operators can also be combined with the **AND** or **OR** logical
operators to create more complex filtering conditions. However, we can only use
one type of join operator at a time. For example:

```json
{
  ...
  "filters": "year >= 2020 AND year <= 2022"
}
```

!!! note "Filtering Limitation"

    The filtering is limited to the metadata stored in the index. If we add a
    filter with a column that is not included in the metadata, the search
    operation will return an empty result since none of the metadata matches
    the filter.

## Delete Index

This method deletes an existing index from the database and automatically
releases the index from the indices pool if it's loaded.

Since by default, the index is persisted on disk, deleting the index will also
remove the index file from the disk. This operation is useful when we want to
free up the disk space by removing indices that are no longer needed.

**This operation is irreversible!**

### Parameters

- **name**: Name of the index to delete.

## Indices Pool

The Database also contains indices pool to manage multiple indices in-memory.
This is useful when we have multiple indices we frequently use allowing us to
avoid the overhead of loading the index from disk which can be slow.

By default, performing any operation related to an index like search or refresh
will load the index to the pool. If we want to release the index from the pool,
we can use the `release_indices` method.

## Advanced Data Operations

The Database also provides advanced operations to manage the indices directly
without the SQL layer. These operations are useful when we want to perform
low-level operations updating the records directly in the index data store.

These are the available advanced operations:

- Insert records into an index.
- Update records in an index.
- Delete records from an index by their IDs.

!!! warning "Low-level Operations"

    Please note that these operations are not recommended for general use since
    they bypass the SQL layer and can lead to inconsistencies between the SQL
    data source and the index which can lead to unexpected behavior.