hoosh 1.3.0

AI inference gateway — multi-provider LLM routing, local model serving, speech-to-text, and token budget management
Documentation
# ADR-001: HTTP Gateway Design

**Status:** Accepted
**Date:** 2026-03-20

## Context

Hoosh needs to route LLM requests across 14+ provider backends (local and remote).
The question is whether hoosh should be:

1. A library with in-process provider bindings (like litellm/CrewAI)
2. An HTTP gateway server with an OpenAI-compatible API

## Decision

Hoosh is an HTTP gateway server.

## Rationale

- **Provider isolation**: Backend crashes don't take down the consumer process
- **Language agnostic**: Any language can call the HTTP API, not just Rust
- **Deployment flexibility**: Run hoosh on a GPU box, consumers on CPU-only machines
- **Observability**: Centralized logging, metrics, and token budget tracking
- **Configuration**: Hot-reloadable TOML config without recompiling consumers

## Consequences

- **Latency overhead**: Extra HTTP hop between consumer and backend. Mitigated with:
  - TCP_NODELAY (disables Nagle's 40ms batching)
  - Connection pooling (52us warm vs 151us cold)
  - HTTP/2 multiplexing
  - Gateway overhead measured at ~0ms (within noise of direct Ollama calls)
- **Deployment complexity**: Consumers need a running hoosh instance. Mitigated by `HooshClient` crate.
- **Not in-process**: Cannot avoid serialization overhead. Acceptable because inference time (200ms+) dominates serialization (820ns).

## Alternatives Considered

- **In-process library**: Would eliminate HTTP overhead but couples consumers to Rust, prevents backend isolation, and complicates multi-language support.
- **Unix socket**: Faster than TCP for local communication but less portable. May be added later as an optimization.