llmsim 0.2.3 - Docs.rs

# Implementation Plan

## Phase 1: Project Foundation

### 1.1 Project Setup
- [x] Initialize single Cargo crate with library + binary:
  - Library: `llmsim` (src/lib.rs)
  - Binary: `llmsim` with `serve` subcommand (src/main.rs)
- [x] Configure `Cargo.toml` with metadata (name, version, license = "MIT", authors)
- [x] Add initial dependencies:
  - `tokio` (async runtime)
  - `axum` (HTTP framework)
  - `serde` / `serde_json` (serialization)
  - `tiktoken-rs` (token counting)
  - `rand` (latency randomization)
  - `tracing` (logging)
  - `clap` (CLI argument parsing)
- [x] Create basic CI workflow (`.github/workflows/ci.yml`): format, lint, test

### 1.2 Core Types
- [x] Define OpenAI API types in `src/openai/types.rs`:
  - `ChatCompletionRequest`
  - `ChatCompletionResponse`
  - `ChatCompletionChunk` (for streaming)
  - `Message`, `Role`, `Usage`
  - `ToolCall`, `Function`
- [x] Add serde derive macros with proper field naming (`#[serde(rename_all = "snake_case")]`)
- [x] Write unit tests for serialization/deserialization against real API examples

---

## Phase 2: Core Library

### 2.1 Token Counter
- [x] Create `src/tokens.rs`
- [x] Implement `count_tokens(text: &str, model: &str) -> usize`
- [x] Support model-to-encoding mapping (gpt-4, gpt-5, claude, etc.)
- [x] Add fallback for unknown models
- [x] Write tests with known token counts

### 2.2 Latency Profiles
- [x] Create `src/latency.rs`
- [x] Define `LatencyProfile` struct:
  ```rust
  pub struct LatencyProfile {
      pub ttft_mean_ms: u64,      // Time to first token
      pub ttft_stddev_ms: u64,
      pub tbt_mean_ms: u64,       // Time between tokens
      pub tbt_stddev_ms: u64,
  }
  ```
- [x] Implement preset profiles:
  - `LatencyProfile::gpt5()` - flagship model
  - `LatencyProfile::gpt5_mini()` - faster
  - `LatencyProfile::o_series()` - reasoning models (o3, o4)
  - `LatencyProfile::gpt4()` - GPT-4 family
  - `LatencyProfile::claude_opus()` - Anthropic flagship
  - `LatencyProfile::claude_sonnet()` - balanced
  - `LatencyProfile::instant()` - no delay (for fast tests)
- [x] Implement `LatencyProfile::sample_ttft()` and `sample_tbt()` using normal distribution
- [x] Write tests for distribution sanity

### 2.3 Response Generator
- [x] Create `src/generator.rs`
- [x] Implement `ResponseGenerator` trait:
  ```rust
  pub trait ResponseGenerator: Send + Sync {
      fn generate(&self, request: &ChatCompletionRequest) -> String;
  }
  ```
- [x] Implement `LoremGenerator` - generates lorem ipsum text
- [x] Implement `EchoGenerator` - echoes back the user message
- [x] Implement `FixedGenerator` - returns configured fixed response
- [x] Implement `RandomWordGenerator` - random words to target token count
- [x] Add configurable response length (target token count)

### 2.4 Streaming Engine
- [x] Create `src/stream.rs`
- [x] Implement `TokenStream` that yields `ChatCompletionChunk` with delays
- [x] Support SSE format (`data: {...}\n\n`)
- [x] Handle `[DONE]` termination message
- [x] Integrate with latency profiles for inter-token delays
- [x] Write integration tests

### 2.5 Error Injection
- [x] Create `src/errors.rs`
- [x] Define `ErrorConfig`:
  ```rust
  pub struct ErrorConfig {
      pub rate_limit_rate: f64,     // 0.0-1.0
      pub server_error_rate: f64,
      pub timeout_rate: f64,
      pub timeout_after_ms: u64,
  }
  ```
- [x] Implement error decision logic
- [x] Create proper OpenAI-format error responses
- [x] Write tests for error rate distribution

### 2.6 Rate Limiter (Optional for Phase 2)
- [ ] Create `src/rate_limit.rs`
- [ ] Implement token bucket algorithm
- [ ] Support requests-per-minute and tokens-per-minute limits
- [ ] Return proper 429 responses with `Retry-After` header

---

## Phase 3: Server CLI (`llmsim serve`)

### 3.1 Basic Server Setup
- [x] Create `src/main.rs` with clap subcommand structure
- [x] Implement `llmsim serve` subcommand with CLI options:
  - `--port` (default: 8080)
  - `--host` (default: 0.0.0.0)
  - `--config` (optional config file path)
  - `--generator` (lorem, echo, random, fixed:text)
  - `--target-tokens` (default: 100)
  - Note: Latency is auto-derived from model in each request
- [x] Create `src/cli/` module for server functionality
- [x] Set up Axum router with graceful shutdown
- [x] Add health check endpoint (`GET /health`)
- [x] Add tracing/logging setup

### 3.2 OpenAI Chat Completions Endpoint
- [x] Implement `POST /v1/chat/completions`
- [x] Parse `ChatCompletionRequest`
- [x] Handle `stream: true` vs `stream: false`
- [x] Return proper `ChatCompletionResponse` with usage
- [x] Implement SSE streaming response
- [x] Add request validation
- [ ] Write integration tests with reqwest

### 3.3 OpenAI Models Endpoint
- [x] Implement `GET /v1/models`
- [x] Return list of "available" models with metadata (GPT-5, o-series, Claude, etc.)
- [x] Implement `GET /v1/models/{model_id}`

### 3.4 Configuration
- [x] Create `src/cli/config.rs`
- [x] Support YAML config file:
  ```yaml
  server:
    port: 8080
    host: "0.0.0.0"

  latency:
    profile: "gpt5"  # or custom values

  response:
    generator: "lorem"
    target_tokens: 100

  errors:
    rate_limit_rate: 0.01
    server_error_rate: 0.001
  ```
- [x] CLI arguments override config file values
- [x] Validate configuration on startup

### 3.5 Docker Support
- [x] Create `Dockerfile` (multi-stage build)
- [ ] Create `docker-compose.yml` for easy local testing
- [ ] Document Docker usage in README

---

## Phase 4: Tool Calling Support

### 4.1 Function/Tool Definitions
- [ ] Extend types with `Tool`, `ToolChoice`, `FunctionCall`
- [ ] Parse tool definitions from request
- [ ] Validate tool call format

### 4.2 Tool Call Response Generation
- [ ] Implement `ToolCallGenerator`:
  - Random tool selection from available tools
  - Generate plausible arguments based on parameter schema
- [ ] Support `tool_choice: "auto"`, `"none"`, `{"type": "function", "function": {"name": "..."}}`
- [ ] Return proper `tool_calls` array in response

### 4.3 Multi-turn Tool Conversations
- [ ] Handle `role: "tool"` messages in conversation
- [ ] Track tool call IDs
- [ ] Generate appropriate follow-up responses

---

## Phase 5: Additional API Support

### 5.1 Anthropic Messages API
- [ ] Create `llmsim/src/anthropic/types.rs`
- [ ] Implement Anthropic message format
- [ ] Add `/v1/messages` endpoint
- [ ] Support Anthropic streaming format (different from OpenAI)
- [ ] Handle Anthropic-specific headers (`x-api-key`, `anthropic-version`)

### 5.2 OpenAI Responses API
- [x] Create Responses API specification (`specs/responses-api.md`)
- [x] Define Responses API types (`src/openai/responses.rs`)
  - [x] `ResponsesRequest` with input, model, instructions, etc.
  - [x] `ResponsesResponse` with output items, usage, status
  - [x] `InputItem` and `OutputItem` types
  - [x] Streaming chunk types for SSE
- [x] Implement `POST /v1/responses` endpoint
  - [x] Parse string or array input
  - [x] Generate simulated response with output items
  - [x] Return token usage statistics
- [x] Implement streaming for Responses API
  - [x] SSE event types: response.created, response.output_text.delta, etc.
  - [x] Proper sequence numbering for deltas
- [x] Add examples for Responses API
  - [x] Python example (`examples/responses_client.py`)
  - [x] Rust example (`examples/responses_usage.rs`)

### 5.3 OpenAI Assistants API (Threads)
- [ ] Implement `/v1/threads` endpoints
- [ ] Implement `/v1/threads/{thread_id}/messages`
- [ ] Implement `/v1/threads/{thread_id}/runs`
- [ ] Support run streaming

### 5.4 Google Gemini API
- [ ] Create `llmsim/src/gemini/types.rs`
- [ ] Implement Gemini message format
- [ ] Add `/v1beta/models/{model}:generateContent` endpoint
- [ ] Add `/v1beta/models/{model}:streamGenerateContent` endpoint

---

## Phase 6: Advanced Features

### 6.1 Response Mocking
- [ ] Create mock configuration format:
  ```yaml
  mocks:
    - match:
        content_contains: "weather"
      response:
        content: "The weather is sunny and 72°F."
    - match:
        model: "gpt-4"
        system_contains: "json"
      response:
        content: '{"result": "mocked"}'
  ```
- [ ] Implement pattern matching engine
- [ ] Support regex patterns
- [ ] Add mock priority/ordering

### 6.2 Real-time Stats & TUI Dashboard
- [x] Create `src/stats.rs` with thread-safe atomic counters:
  - Request metrics: total, active, streaming, non-streaming
  - Token metrics: prompt, completion, total
  - Error tracking by status code (429, 5xx, 504)
  - Latency: average, min, max
  - Per-model request counts
  - RPS: rolling 60-second window calculation
- [x] Add stats endpoint (`GET /llmsim/stats`) returning JSON snapshot
- [x] Create `src/tui/` module with Ratatui dashboard:
  - `app.rs`: Event loop, state management, HTTP polling
  - `ui.rs`: Widget layout and rendering
- [x] TUI features:
  - Real-time updating (200ms refresh)
  - Request and token statistics panels
  - Latency and error metrics
  - RPS and token rate sparkline charts
  - Model distribution bar chart
  - Keyboard controls (q=quit, r=refresh)
- [x] Add `--tui` flag to `llmsim serve` command
- [x] Add `on_complete` callback to TokenStreamBuilder for streaming stats

### 6.3 Metrics & Observability (Prometheus)
- [ ] Add Prometheus metrics endpoint (`/metrics`)
- [ ] Track:
  - Request count by endpoint and model
  - Response latency histograms
  - Token counts (input/output)
  - Error rates
  - Active connections
- [ ] Add structured logging with request IDs

### 6.4 Record/Replay Mode
- [ ] Implement proxy mode to real APIs
- [ ] Record requests/responses to file
- [ ] Replay recorded sessions
- [ ] Anonymize sensitive data in recordings

---

## Phase 7: Polish & Release

### 7.1 Documentation
- [ ] Write comprehensive README.md:
  - Quick start
  - Installation (cargo, binary, Docker)
  - Configuration reference
  - API compatibility matrix
  - Examples
- [ ] Add `docs/` folder with detailed guides
- [ ] Generate API documentation with rustdoc

### 7.2 Testing & Quality
- [ ] Achieve 80%+ code coverage
- [ ] Add load tests using k6 or similar
- [ ] Test against real client libraries (openai-python, anthropic-sdk)
- [ ] Fuzz testing for parser robustness

### 7.3 Release
- [ ] Set up GitHub releases with binaries (Linux, macOS, Windows)
- [ ] Publish to crates.io
- [ ] Create Homebrew formula
- [ ] Announce on relevant communities

---

## Milestone Summary

| Milestone | Description | Target Deliverable |
|-----------|-------------|-------------------|
| M1 | Foundation | Compiling workspace with types |
| M2 | Core Library | Token counting, latency, generators work |
| M3 | Basic Server | OpenAI chat completions endpoint works |
| M4 | Tool Calling | Function calling support |
| M5 | Multi-API | Anthropic + Gemini support |
| M6 | Advanced | Mocking, metrics, record/replay |
| M7 | Release | Published, documented, tested |

---

## Technical Decisions Log

Document significant technical decisions here as implementation progresses:

1. **Axum over Actix-web**: Axum is simpler, well-integrated with tokio, and has good streaming support
2. **tiktoken-rs**: Direct port of OpenAI's tokenizer, ensures accurate token counts
3. **YAML for config**: More readable than JSON, better for complex configurations
4. **Single crate with lib + bin**: Simpler structure with `llmsim` as library and `llmsim serve` as CLI subcommand. Avoids workspace complexity while still exposing library for programmatic use
5. **Clap subcommands**: Using `llmsim serve` pattern allows future expansion with additional commands (e.g., `llmsim mock`, `llmsim record`)
6. **Model list from models.dev**: GPT-5 family, o-series reasoning models, and Claude models based on current production models
7. **Ratatui for TUI**: Leading Rust TUI library (fork of tui-rs), used by Codex CLI. Provides sub-millisecond rendering, rich widgets (sparklines, bar charts), and good documentation
8. **Atomic counters for stats**: Thread-safe statistics using `std::sync::atomic` with relaxed ordering for minimal contention
9. **Stats under /llmsim prefix**: Keep LLMSim-specific endpoints separate from OpenAI-compatible /v1 routes
10. **TUI as --tui flag**: Integrate dashboard into serve command rather than separate subcommand for simpler UX