Function handle_streaming_request

Source

pub async fn handle_streaming_request<B>(
    client: Client,
    model: &str,
    request: CreateChatCompletionRequest,
    _tenant_id_hash: u64,
    _context_hash: u64,
    _semantic_text: String,
) -> Result<Sse<impl Stream<Item = Result<Event, Infallible>> + Send + 'static>, GatewayError>where
    B: BqSearchBackend + Clone + Send + Sync + 'static,

Expand description

Handles streaming chat completion requests, bypassing the semantic cache.

§Cache Bypass Rationale

Streaming requests deliberately bypass the cache for several reasons:

Incremental delivery: Streaming responses are delivered chunk-by-chunk via SSE, making traditional cache lookup/store semantics inefficient
Latency sensitivity: Streaming is typically chosen for real-time feedback; cache overhead would negate this benefit
Response variability: Partial responses and timing are inherently variable, making cache hit rates low and storage wasteful

§Type Parameter `B`

The generic parameter B: BqSearchBackend is retained for API consistency with non-streaming handlers (e.g., handle_chat_completion) even though it is not used in the function body. This allows callers to use a uniform handler signature and simplifies router configuration where both streaming and non-streaming paths share the same backend type.

§Future Work

Potential enhancements for streaming cache support:

Accumulated response caching: Store the fully accumulated response after stream completion (see _accumulated_content placeholder) for subsequent non-streaming lookups
Prefix caching: Cache partial responses to enable “continuation” semantics
Semantic deduplication: Detect duplicate streaming requests in-flight and fan-out a single upstream stream to multiple clients

handle_streaming_request

Function handle_streaming_request Copy item path

§Cache Bypass Rationale

§Type Parameter B

§Future Work

Function handle_streaming_request

§Type Parameter `B`