codetether-agent 1.0.4

# Image Input Support Requirements

Based on analysis of the opencode TypeScript implementation, this document outlines requirements for adding image/multimodal input support to CodeTether Agent.

## Overview

Opencode handles images through a sophisticated multimodal content system that supports:
- Base64-encoded inline images
- External image URLs
- File attachments with MIME type detection
- Provider-specific optimizations

## Current State (CodeTether)

The `ContentPart` enum in `src/provider/mod.rs` already has basic image support:

```rust
pub enum ContentPart {
    Text { text: String },
    Image {
        url: String,
        mime_type: Option<String>,
    },
    File {
        path: String,
        mime_type: Option<String>,
    },
    // ...
}
```

However, this is not fully implemented across the system.

---

## Requirements

### 1. Data Model Enhancements

#### 1.1 Enhanced Image ContentPart
Extend the `Image` variant to support both URL and base64 data:

```rust
pub enum ImageData {
    Url(String),           // External URL (https://...)
    Base64 {              // Inline base64-encoded image
        data: String,
        media_type: String, // e.g., "image/jpeg", "image/png"
    },
    FileId(String),       // Provider-specific file ID (OpenAI file API)
}

pub enum ContentPart {
    Text { text: String },
    Image {
        data: ImageData,
        detail: Option<ImageDetail>, // low, high, auto
        filename: Option<String>,
    },
    // ...
}

pub enum ImageDetail {
    Low,
    High,
    Auto,
}
```

#### 1.2 File Part Enhancement
Support generic file attachments (PDFs, etc.):

```rust
pub enum ContentPart {
    File {
        data: FileData,
        mime_type: String,
        filename: Option<String>,
    },
    // ...
}

pub enum FileData {
    Path(String),         // Local file path
    Url(String),          // External URL
    Base64(String),       // Base64-encoded content
    FileId(String),       // Provider file ID
}
```

### 2. File Input Handling

#### 2.1 MIME Type Detection
Implement automatic MIME type detection:

```rust
pub fn detect_mime_type(path: &Path) -> Option<String> {
    // Use file extension and magic bytes
    // Support: image/*, application/pdf, text/*
}

pub fn is_image_mime(mime: &str) -> bool {
    mime.starts_with("image/")
}

pub fn is_supported_vision_mime(mime: &str) -> bool {
    matches!(mime, 
        "image/jpeg" | "image/png" | "image/gif" | 
        "image/webp" | "image/bmp" | "application/pdf"
    )
}
```

#### 2.2 Binary File Detection
Implement binary file detection (from opencode's `shouldEncode`):

```rust
pub fn should_encode_as_binary(mime_type: &str) -> bool {
    let top_level = mime_type.split('/').next().unwrap_or("");
    let sub_type = mime_type.split('/').nth(1).unwrap_or("");
    
    // Binary top-level types
    if matches!(top_level, "image" | "audio" | "video" | "font" | "model" | "multipart") {
        return true;
    }
    
    // Binary subtypes
    let binary_markers = ["zip", "gzip", "pdf", "msword", "binary", "ogg"];
    binary_markers.iter().any(|m| sub_type.contains(m))
}
```

#### 2.3 Image Processing
- **Base64 encoding**: Convert binary images to base64 strings
- **Size limits**: Enforce maximum file sizes (provider-specific)
- **Format conversion**: Optionally convert to supported formats
- **Detail levels**: Support "low", "high", "auto" detail settings

### 3. CLI Interface

#### 3.1 Run Command Enhancement
Extend `RunArgs` to support image attachments:

```rust
#[derive(Parser, Debug)]
pub struct RunArgs {
    /// Message to send
    pub message: String,
    
    /// Files to attach (images, PDFs, etc.)
    #[arg(short, long)]
    pub file: Vec<PathBuf>,
    
    /// Image detail level
    #[arg(long, value_parser = ["low", "high", "auto"])]
    pub image_detail: Option<String>,
    
    // ... existing fields
}
```

**Usage examples:**
```bash
# Single image
codetether run "Describe this image" --file screenshot.png

# Multiple images
codetether run "Compare these" --file image1.jpg --file image2.png

# With PDF
codetether run "Summarize this document" --file report.pdf

# With detail level
codetether run "Read this code" --file diagram.png --image-detail high
```

#### 3.2 TUI Integration
- Drag-and-drop support for images
- Paste image from clipboard
- Visual indicator for attached files
- Image preview in chat (if terminal supports it)

### 4. Provider Implementations

#### 4.1 OpenAI-Compatible (OpenAI, Copilot)

Convert `ContentPart::Image` to OpenAI's message format:

```rust
// For base64 images
{
    "type": "image_url",
    "image_url": {
        "url": "data:image/jpeg;base64,/9j/4AAQ...",
        "detail": "high"
    }
}

// For external URLs
{
    "type": "image_url", 
    "image_url": {
        "url": "https://example.com/image.jpg",
        "detail": "auto"
    }
}

// For PDFs (OpenAI Responses API)
{
    "type": "input_file",
    "file_url": "https://..."  // or file_id
}
```

Implementation in `src/provider/openai.rs`:

```rust
fn convert_content_part(part: &ContentPart) -> serde_json::Value {
    match part {
        ContentPart::Text { text } => json!({
            "type": "text",
            "text": text
        }),
        ContentPart::Image { data, detail, .. } => {
            let url = match data {
                ImageData::Url(u) => u.clone(),
                ImageData::Base64 { data, media_type } => {
                    format!("data:{};base64,{}, media_type, data)
                }
                ImageData::FileId(id) => {
                    // Use file_id field instead
                    return json!({
                        "type": "image_file",
                        "file_id": id
                    });
                }
            };
            
            let mut obj = json!({
                "type": "image_url",
                "image_url": { "url": url }
            });
            
            if let Some(d) = detail {
                obj["image_url"]["detail"] = json!(d.to_string().to_lowercase());
            }
            
            obj
        }
        // ...
    }
}
```

#### 4.2 Anthropic (Claude)

Anthropic uses a different format:

```rust
// Base64 image
{
    "type": "image",
    "source": {
        "type": "base64",
        "media_type": "image/jpeg",
        "data": "/9j/4AAQ..."
    }
}

// External URL (if supported)
{
    "type": "image",
    "source": {
        "type": "url",
        "url": "https://example.com/image.jpg"
    }
}
```

#### 4.3 Google (Gemini)

Gemini format:

```rust
// Base64 inline
{
    "inlineData": {
        "mimeType": "image/jpeg",
        "data": "/9j/4AAQ..."
    }
}

// File URI
{
    "fileData": {
        "mimeType": "image/jpeg",
        "fileUri": "gs://bucket/image.jpg"
    }
}
```

#### 4.4 Provider Capability Detection

Add vision capability to `ModelInfo`:

```rust
pub struct ModelInfo {
    pub id: String,
    pub name: String,
    pub provider: String,
    pub supports_vision: bool,
    pub vision_mime_types: Vec<String>, // Supported MIME types
    pub max_image_size: Option<usize>,  // Max bytes per image
    pub max_images_per_request: Option<usize>,
    // ...
}
```

### 5. Session Management

#### 5.1 Persisting Image Data
Images should be stored efficiently in sessions:

```rust
// Option 1: Store base64 in session JSON (simple, but large)
// Option 2: Store images in separate directory, reference by ID

pub struct Session {
    // ... existing fields
    pub attachments: Vec<Attachment>,
}

pub struct Attachment {
    pub id: String,
    pub filename: String,
    pub mime_type: String,
    pub path: PathBuf,  // Local storage path
    pub created_at: DateTime<Utc>,
}
```

#### 5.2 Session Storage Layout

```
~/.local/share/codetether/
├── sessions/
│   ├── session-uuid.json
│   └── attachments/
│       └── session-uuid/
│           ├── img-1.png
│           └── doc.pdf
```

### 6. TUI Enhancements

#### 6.1 File Attachment UI
- Show attached files in input area
- Support keyboard shortcuts (Ctrl+A to attach)
- Visual indicators for file types

#### 6.2 Image Display
- Show image filenames in chat
- Optional: terminal image preview (using sixel, iTerm2 inline images, or kitty graphics protocol)
- Fallback: show file info (size, dimensions, type)

#### 6.3 Input Handling
```rust
pub enum InputEvent {
    Text(String),
    AttachFile(PathBuf),
    PasteImage(Vec<u8>),  // From clipboard
}
```

### 7. Error Handling

#### 7.1 Validation Errors
- Unsupported file type
- File too large
- Invalid image format
- Provider doesn't support vision

#### 7.2 User Feedback
```
Error: Cannot attach "video.mp4" - unsupported file type
Supported: .jpg, .jpeg, .png, .gif, .webp, .pdf

Error: Image "huge.png" (25MB) exceeds maximum size (20MB)

Warning: Model "gpt-3.5-turbo" doesn't support images. 
Images will be ignored or use a vision-capable model.
```

### 8. Security Considerations

#### 8.1 Path Validation
- Validate file paths are within allowed directories
- Prevent directory traversal attacks
- Sandbox file access

#### 8.2 Content Validation
- Verify file magic bytes match extension
- Scan for malicious content
- Limit total attachment size per request

#### 8.3 URL Validation
- Whitelist allowed URL schemes (https only)
- Validate URL format
- Implement timeouts for fetching

### 9. Implementation Phases

#### Phase 1: Core Infrastructure
1. Enhance `ContentPart` enum with `ImageData` and `FileData`
2. Implement MIME type detection
3. Add base64 encoding utilities
4. Update serialization/deserialization

#### Phase 2: Provider Support
1. Implement OpenAI image format conversion
2. Implement Anthropic image format conversion
3. Implement Google image format conversion
4. Add vision capability detection

#### Phase 3: CLI Integration
1. Update `RunArgs` with file attachment options
2. Implement file reading and validation in `run.rs`
3. Add image detail parameter
4. Error handling and user feedback

#### Phase 4: Session Persistence
1. Design attachment storage system
2. Implement attachment upload/download
3. Update session serialization
4. Cleanup orphaned attachments

#### Phase 5: TUI Integration
1. Add file attachment UI
2. Implement clipboard paste
3. Visual indicators for attachments
4. Optional image preview

### 10. Testing Requirements

#### 10.1 Unit Tests
- MIME type detection
- Base64 encoding/decoding
- ContentPart serialization
- Provider format conversion

#### 10.2 Integration Tests
- End-to-end image upload flow
- Multiple provider support
- Session persistence
- Error scenarios

#### 10.3 Test Images
- Various formats: JPG, PNG, GIF, WebP, BMP
- Various sizes: small (<1KB), medium (<1MB), large (>10MB)
- Edge cases: corrupted, zero-byte, non-image with image extension

### 11. Documentation

#### 11.1 User Documentation
- How to attach images via CLI
- Supported file formats
- Provider-specific limitations
- Image detail levels explained

#### 11.2 Developer Documentation
- ContentPart enum design
- Adding new providers
- Extending MIME type support
- TUI integration guide

---

## Open Questions

1. **Storage**: Should we store base64 in session JSON or use external files?
2. **Caching**: Should we cache fetched external URLs?
3. **Compression**: Should we compress/resize large images automatically?
4. **Privacy**: How to handle sensitive images in logs/telemetry?
5. **Rate Limiting**: How to handle provider image upload limits?

## References

- Opencode implementation: `packages/opencode/src/session/message-v2.ts`
- Opencode file handling: `packages/opencode/src/file/index.ts`
- OpenAI Vision API: https://platform.openai.com/docs/guides/vision
- Anthropic Vision: https://docs.anthropic.com/claude/docs/vision
- Google Gemini Vision: https://ai.google.dev/gemini-api/docs/vision