Crate qdrant_datafusion

Crate qdrant_datafusion 

Source
Expand description

ยง๐Ÿ›ธ Qdrant DataFusion Integration

A high-performance Rust library that provides seamless integration between Qdrant vector database and Apache DataFusion, enabling SQL queries over vector data with full support for heterogeneous collections, complex projections, and mixed vector types.

Crates.io Documentation License: Apache 2.0 Build Status Coverage

ยง๐ŸŽฏ Features

ยงComplete Vector Support

  • Dense Vectors: Single embeddings as List<Float32>
  • Multi-Vectors: Multiple embeddings per point as List<List<Float32>>
  • Sparse Vectors: Efficient sparse representations with separate indices and values
  • Mixed Collections: Supports collections with different vector types

ยงAdvanced Query Capabilities

  • SQL Interface: Query Qdrant collections using standard SQL syntax
  • Schema Projection: Optimized queries that only fetch requested fields
  • Heterogeneous Data: Handle points with different vector field subsets
  • Nullable Fields: Proper null handling for missing vector data
  • LIMIT Support: Efficient query limiting pushed to Qdrant

ยงHigh Performance Architecture

  • Schema-Driven: Clean, efficient deserialization with O(F) performance
  • Single-Pass Processing: Minimized memory allocations and data copying
  • Async Streaming: Non-blocking query execution with proper backpressure
  • Connection Pooling: Reusable client connections for optimal throughput

ยงProduction Ready

  • > 90% Test Coverage: Comprehensive testing with real Qdrant instances
  • Memory Safe: Full Rust safety guarantees with zero unsafe code
  • Error Handling: Detailed error types with context for debugging
  • Extensible: Ready for custom UDFs and advanced query planning

ยง๐Ÿš€ Quick Start

Add this to your Cargo.toml:

[dependencies]
qdrant-datafusion = "0.1"

ยงBasic Usage

โ“˜
use qdrant_datafusion::prelude::*;
use qdrant_client::Qdrant;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> Result<()> {
    // Connect to Qdrant
    let client = Qdrant::from_url("http://localhost:6334").build()?;

    // Create DataFusion table provider
    let table_provider = QdrantTableProvider::try_new(client, "my_collection").await?;

    // Register with DataFusion context
    let ctx = SessionContext::new();
    ctx.register_table("vectors", Arc::new(table_provider))?;

    // Query with SQL!
    let df = ctx.sql("
        SELECT id, payload, embedding
        FROM vectors
        WHERE id IN ('doc1', 'doc2')
        LIMIT 10
    ").await?;

    let results = df.collect().await?;
    println!("{:?}", results);

    Ok(())
}

ยงAdvanced Queries

โ“˜
// Complex projections with mixed vector types
let df = ctx.sql("
    SELECT
        id,
        text_embedding,
        image_embedding,
        multi_embeddings,
        keywords_indices,
        keywords_values
    FROM mixed_vectors
    WHERE payload IS NOT NULL
").await?;

// Efficient schema projection - only fetches requested vector fields
let df = ctx.sql("SELECT text_embedding FROM vectors").await?;

ยง๐Ÿ“Š Vector Type Support

Vector TypeSchemaDescriptionExample Query
DenseList<Float32>Single embedding per fieldSELECT text_embedding FROM docs
MultiList<List<Float32>>Multiple embeddings per fieldSELECT multi_embeddings FROM docs
SparseList<UInt32> + List<Float32>Efficient sparse vectorsSELECT keywords_indices, keywords_values FROM docs

ยง๐Ÿ”ง Collection Types

ยงNamed Collections (Heterogeneous)

Collections with multiple named vector fields where different points can have different subsets:

-- Schema automatically includes all possible vector fields
SELECT
    id,
    text_embedding,      -- Some points have this
    image_embedding,     -- Some points have this
    audio_embedding      -- Some points have this
FROM heterogeneous_collection;

ยงUnnamed Collections (Homogeneous)

Collections with a single unnamed vector field:

-- Schema contains single 'vector' field
SELECT id, payload, vector
FROM homogeneous_collection;

ยง๐ŸŽฏ Current Capabilities

โœ… Complete TableProvider Implementation

  • Full SQL querying via DataFusion
  • All Qdrant vector types supported
  • Schema projection optimization
  • Proper null handling for missing fields

โœ… Production Ready

  • 90% test coverage with real Qdrant instances
  • Comprehensive error handling
  • Memory-safe Rust implementation
  • Async streaming execution

ยง๐Ÿ”ฎ Future Roadmap

๐Ÿ”„ In Development

  • Custom UDFs: Distance functions, similarity search, recommendations, and more
  • Query Planning: Qdrant-specific optimizations and filter pushdown
  • Advanced Filters: Native Qdrant filter integration with SQL WHERE clauses

๐ŸŽฏ Planned

  • Multi-Database Joins: Join Qdrant data with other DataFusion sources
  • Vector Search UDFs: similarity(), recommend(), discover() like functions
  • Extension Nodes: Custom physical plan nodes for complex vector operations

ยง๐Ÿงช Testing

Run the test suite with a real Qdrant instance:

# Start Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

# Run tests
cargo test --features test-utils

# Check coverage
just coverage

ยง๐Ÿ—๏ธ Architecture

ยงSchema-Driven Design

Built around a schema-driven architecture that reduces complex matching logic and leaves room for future expansion:

// Schema defines extractors upfront
enum FieldExtractor {
    Id(StringBuilder),
    Payload(StringBuilder),
    DenseVector { name: String, builder: ListBuilder<Float32Builder> },
    MultiVector { name: String, builder: ListBuilder<ListBuilder<Float32Builder>> },
    SparseIndices { name: String, builder: ListBuilder<UInt32Builder> },
    SparseValues { name: String, builder: ListBuilder<Float32Builder> },
}

// Single pass processing with owned iteration
pub fn append_point(&mut self, point: ScoredPoint) {
    let ScoredPoint { id, payload, vectors, .. } = point;
    let vector_lookup = build_vector_lookup(vectors);

    for extractor in &mut self.field_extractors {
        // All logic inline - no hidden abstractions
    }
}

ยง๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

ยง๐Ÿ“ License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Modulesยง

arrow
Deserialize and serialize utilities for Qdrant data types to Arrow data types and Qdrant values to Arrow RecordBatches.
error
prelude
Convenience exports for working with the library.
stream
Generate a stream of Qdrant points as arrow RecordBatch.
table
DataFusion TableProvider implementation for Qdrant vector database collections.
test_utils
udfs
datafusion-functions-json and other functions relevant for Qdrant.
utils
Various utility functions for working with schema and data.