pub async fn handle_streaming_request<B>(
client: Client,
model: &str,
request: CreateChatCompletionRequest,
_tenant_id_hash: u64,
_context_hash: u64,
_semantic_text: String,
) -> Result<Sse<impl Stream<Item = Result<Event, Infallible>> + Send + 'static>, GatewayError>Expand description
Handles streaming chat completion requests, bypassing the semantic cache.
§Cache Bypass Rationale
Streaming requests deliberately bypass the cache for several reasons:
- Incremental delivery: Streaming responses are delivered chunk-by-chunk via SSE, making traditional cache lookup/store semantics inefficient
- Latency sensitivity: Streaming is typically chosen for real-time feedback; cache overhead would negate this benefit
- Response variability: Partial responses and timing are inherently variable, making cache hit rates low and storage wasteful
§Type Parameter B
The generic parameter B: BqSearchBackend is retained for API consistency with
non-streaming handlers (e.g., handle_chat_completion) even though it is not used
in the function body. This allows callers to use a uniform handler signature and
simplifies router configuration where both streaming and non-streaming paths share
the same backend type.
§Future Work
Potential enhancements for streaming cache support:
- Accumulated response caching: Store the fully accumulated response after stream
completion (see
_accumulated_contentplaceholder) for subsequent non-streaming lookups - Prefix caching: Cache partial responses to enable “continuation” semantics
- Semantic deduplication: Detect duplicate streaming requests in-flight and fan-out a single upstream stream to multiple clients