suture-repair 0.1.2

Ultra-low latency reverse proxy that repairs truncated JSON in LLM streams.
Documentation

Suture

CI crates.io docs.rs License

Ultra-low-latency reverse proxy that repairs truncated and malformed JSON in LLM streaming responses, on the fly.

πŸ“ The story: Why your LLM tool calls silently break β€” and a ~10Β΅s fix

When an upstream LLM stream is cut off β€” by max_tokens, a context-window limit, or a dropped socket β€” the JSON it was emitting (a tool call's arguments, or structured-output content) is left unterminated, and your application throws JSONDecodeError / serde_json "EOF while parsing" errors. Suture sits between your app and the provider, watches the stream, and emits exactly the missing characters to make the reassembled JSON valid β€” without buffering the stream or adding meaningful latency.

The problem, concretely

A tool-call stream truncated at max_tokens leaves your client reassembling invalid JSON:

// what the client reassembles from the delta events:
{"city": "Par      // ← unterminated β†’ JSONDecodeError / serde_json: EOF while parsing a string

Suture closes it on the wire, so the client gets valid JSON instead β€” no SDK changes, no retry, no regenerated tokens:

{"city": "Par"}    // ← valid; the string and object are safely closed

Symptoms this fixes

You're in the right place if your LLM app has thrown any of these on a streaming response:

  • json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column …
  • json.decoder.JSONDecodeError: Expecting value: line 1 column … (char …)
  • serde_json::Error: EOF while parsing a string / EOF while parsing an object
  • pydantic_core.ValidationError on a truncated tool-call arguments
  • Tool / function-call arguments that won't parse when the model hits max_tokens
  • Truncated structured-output / JSON-mode content across streamed deltas

…on OpenAI, Anthropic, Google Vertex AI (Gemini / Claude), or AWS Bedrock.

Features

  • Repairs OpenAI (/v1/chat/completions), Anthropic (/v1/messages), GCP Vertex AI (Gemini + Claude-on-Vertex), and AWS Bedrock (ConverseStream) streaming responses.
  • SSE-aware β€” repairs the reassembled tool-call arguments / structured content accumulated across delta events, not just raw wire bytes.
  • Streaming + compressed β€” transparently decodes gzip/brotli/deflate, repairs, and re-encodes per the client's Accept-Encoding; never buffers the whole body. Added overhead is ~10 Β΅s per chunk.
  • Holds no credentials β€” your provider API key / bearer token is forwarded verbatim.
  • The byte-level repair engine is usable as a standalone library: cargo add suture-repair (then use suture::…), or just the engine via cargo add suture-repair-core (use suture_core::repair_str).

Quickstart

cargo install suture-repair    # installs the `suture` binary; or: docker build -t suture .
suture                         # listens on 127.0.0.1:8787

Point your SDK's base URL at Suture (your API key still flows through):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key=os.environ["OPENAI_API_KEY"])

Routes: POST /v1/chat/completions β†’ OpenAI, POST /v1/messages β†’ Anthropic, POST /v1/projects/* β†’ Vertex, POST /model/* β†’ Bedrock (each when enabled), GET /health.

How it works

Three layers, each independently tested:

  • suture-core β€” a byte-level JSON repair state machine. Given any prefix of a valid JSON value, it computes the characters needed to close it (or reports that the input is inconsistent and should pass through untouched). No allocation beyond nesting depth.
  • suture-sse β€” an incremental SSE parser + per-provider extractors that reassemble the JSON-bearing field across delta events, drive the core engine, and synthesize a closing event at stream end (before the terminator).
  • suture β€” an axum/reqwest reverse proxy. Forwards your request verbatim, then on the response: text/event-stream is repaired via the SSE layer; a single application/json body is closed with the core engine; anything else streams through unchanged.

Configuration

Env var Default Purpose
SUTURE_LISTEN 127.0.0.1:8787 listen address
SUTURE_OPENAI_BASE https://api.openai.com OpenAI upstream
SUTURE_ANTHROPIC_BASE https://api.anthropic.com Anthropic upstream
SUTURE_VERTEX_ENABLED 0 enable the Vertex route (host derived from the path)
SUTURE_VERTEX_BASE β€” optional Vertex upstream override
SUTURE_BEDROCK_ENABLED 0 enable the Bedrock route (host from the validated Host header)
SUTURE_BEDROCK_BASE β€” optional Bedrock upstream override

Deployment

See deploy/ for a Dockerfile and Cloud Run, ECS/Fargate, and Kubernetes-sidecar manifests, plus operational notes (don't buffer the stream, TLS at the edge, health checks). The sidecar pattern (co-located, localhost) best matches the low-latency design.

Status

OpenAI, Anthropic, GCP Vertex AI, and AWS Bedrock (ConverseStream) are supported, with transparent compression handling. Bedrock uses credential-free SigV4 passthrough β€” the client signs for the real Bedrock host and Suture forwards verbatim, so Suture never sees a reusable AWS secret (the secret key never leaves the client; only a per-request signature transits).

License

Dual-licensed under either of MIT or Apache-2.0, at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you shall be dual-licensed as above, without any additional terms.