Struct IndexWriter

Source

pub struct IndexWriter<D: DirectoryWriter + 'static> { /* private fields */ }

Expand description

Async IndexWriter for adding documents and committing segments

Features:

Queue-based parallel indexing with worker tasks
Streams documents to disk immediately (no in-memory document storage)
Uses string interning for terms (reduced allocations)
Uses hashbrown HashMap (faster than BTreeMap)

Architecture:

add_document() sends to per-worker unbounded channels (non-blocking)
Round-robin distribution across workers - no mutex contention
Each worker owns a SegmentBuilder and flushes when memory threshold is reached

State management:

Building segments: Managed here (pending_builds)
Committed segments + metadata: Managed by SegmentManager (sole owner of metadata.json)

Implementations§

Source §

impl<D: DirectoryWriter + 'static> IndexWriter<D>

Source

pub async fn build_vector_index(&self) -> Result<()>

Build vector index from accumulated Flat vectors (trains ONCE)

This trains centroids/codebooks from ALL vectors across all segments. Training happens only ONCE - subsequent calls are no-ops if already built.

Note: This is auto-triggered by commit() when threshold is crossed. You typically don’t need to call this manually.

The process:

Check if already built (skip if so)
Collect all vectors from all segments
Train centroids/codebooks based on schema’s index_type
Update metadata to mark as built (prevents re-training)

Source

pub async fn total_vector_count(&self) -> usize

Get total vector count across all segments (for threshold checking)

Source

pub async fn is_vector_index_built(&self, field: Field) -> bool

Check if vector index has been built for a field

Source

pub async fn rebuild_vector_index(&self) -> Result<()>

Rebuild vector index by retraining centroids/codebooks

Use this when:

Significant new data has been added and you want better centroids
You want to change the number of clusters
The vector distribution has changed significantly

This resets the Built state to Flat, then triggers a fresh training.

Source §

impl<D: DirectoryWriter + 'static> IndexWriter<D>

Source

pub async fn create( directory: D, schema: Schema, config: IndexConfig, ) -> Result<Self>

Create a new index in the directory

Source

pub async fn create_with_config( directory: D, schema: Schema, config: IndexConfig, builder_config: SegmentBuilderConfig, ) -> Result<Self>

Create a new index with custom builder config

Source

pub async fn open(directory: D, config: IndexConfig) -> Result<Self>

Open an existing index for writing

Source

pub async fn open_with_config( directory: D, config: IndexConfig, builder_config: SegmentBuilderConfig, ) -> Result<Self>

Open an existing index with custom builder config

Source

pub fn from_index(index: &Index<D>) -> Self

Create an IndexWriter from an existing Index

This shares the SegmentManager with the Index, ensuring consistent segment lifecycle management.

Source

pub fn schema(&self) -> &Schema

Get the schema

Source

pub fn set_tokenizer<T: Tokenizer>(&mut self, field: Field, tokenizer: T)

Set tokenizer for a field

Source

pub fn add_document(&self, doc: Document) -> Result<DocId>

Add a document to the indexing queue

Documents are sent to per-worker unbounded channels. This is O(1) and never blocks - returns immediately. Workers handle the actual indexing in parallel.

Source

pub fn add_documents(&self, documents: Vec<Document>) -> Result<usize>

Add multiple documents to the indexing queue

Documents are distributed round-robin to workers. Returns immediately - never blocks.

Source

pub fn pending_build_count(&self) -> usize

Get the number of pending background builds

Source

pub fn pending_merge_count(&self) -> usize

Get the number of pending background merges

Source

pub async fn maybe_merge(&self)

Check merge policy and spawn background merges if needed

This is called automatically after segment builds complete via SegmentManager. Can also be called manually to trigger merge checking.

Source

pub async fn wait_for_merges(&self)

Wait for all pending merges to complete

Source

pub fn tracker(&self) -> Arc<SegmentTracker>

Get the segment tracker for sharing with readers This allows readers to acquire snapshots that prevent segment deletion

Source

pub async fn acquire_snapshot(&self) -> SegmentSnapshot<D>

Acquire a snapshot of current segments for reading The snapshot holds references - segments won’t be deleted while snapshot exists

Source

pub async fn cleanup_orphan_segments(&self) -> Result<usize>

Clean up orphan segment files that are not registered

This can happen if the process halts after segment files are written but before they are registered in segments.json. Call this after opening an index to reclaim disk space from incomplete operations.

Returns the number of orphan segments deleted.

Source

pub async fn flush(&self) -> Result<()>

Flush all workers - signals them to build their current segments

Sends flush signals to all workers and waits for them to acknowledge. Workers continue running and can accept new documents after flush.

Source

pub async fn commit(&self) -> Result<()>

Commit all pending segments to disk and wait for completion

This flushes workers and waits for ALL background builds to complete. Provides durability guarantees - all data is persisted.

Auto-triggers vector index build when threshold is crossed for any field.

Source

pub async fn force_merge(&self) -> Result<()>

Force merge all segments into one

Auto Trait Implementations§

§

impl<D> !UnwindSafe for IndexWriter<D>

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §