handle_streaming_request

Function handle_streaming_request 

Source
pub async fn handle_streaming_request<B>(
    client: Client,
    model: &str,
    request: CreateChatCompletionRequest,
    _tenant_id_hash: u64,
    _context_hash: u64,
    _semantic_text: String,
) -> Result<Sse<impl Stream<Item = Result<Event, Infallible>> + Send + 'static>, GatewayError>
where B: BqSearchBackend + Clone + Send + Sync + 'static,
Expand description

Handles streaming chat completion requests, bypassing the semantic cache.

§Cache Bypass Rationale

Streaming requests deliberately bypass the cache for several reasons:

  • Incremental delivery: Streaming responses are delivered chunk-by-chunk via SSE, making traditional cache lookup/store semantics inefficient
  • Latency sensitivity: Streaming is typically chosen for real-time feedback; cache overhead would negate this benefit
  • Response variability: Partial responses and timing are inherently variable, making cache hit rates low and storage wasteful

§Type Parameter B

The generic parameter B: BqSearchBackend is retained for API consistency with non-streaming handlers (e.g., handle_chat_completion) even though it is not used in the function body. This allows callers to use a uniform handler signature and simplifies router configuration where both streaming and non-streaming paths share the same backend type.

§Future Work

Potential enhancements for streaming cache support:

  • Accumulated response caching: Store the fully accumulated response after stream completion (see _accumulated_content placeholder) for subsequent non-streaming lookups
  • Prefix caching: Cache partial responses to enable “continuation” semantics
  • Semantic deduplication: Detect duplicate streaming requests in-flight and fan-out a single upstream stream to multiple clients