cooklang-import 0.9.9

A tool for importing recipes into Cooklang format
Documentation
# Architecture Overview

This document provides a high-level overview of the cooklang-import system architecture, showing the different input flows and their outcomes.

## System Architecture

```mermaid
flowchart TB
    %% Input Sources
    URL[URL Input]
    TEXT[Text Input]
    IMAGE[Image Input]

    %% Domain Routing
    DOMAIN_CHECK{Domain in<br/>page_scriber.domains?}

    %% Fetchers
    FETCH_REQ[HTTP Fetcher<br/>reqwest]
    FETCH_PS[Page Scriber Fetcher<br/>POST /api/fetch-source<br/>returns HTML]

    %% Processing Stages
    EXTRACT_HTML[HTML Extractors<br/>JSON-LD → MicroData →<br/>HTML Class]
    EXTRACT_PLAIN[Plain Text Extraction<br/>extract_text_from_html]
    EXTRACT_TEXT[Text Extractor<br/>LLM-based parsing]
    LANG_DETECT[Language Detection<br/>whatlang]
    OCR[OCR Processing<br/>Google Cloud Vision]

    %% Intermediate State
    RECIPE[Recipe Object<br/>name, ingredients,<br/>instructions, metadata]
    TEXT_FORMAT[Text Format<br/>YAML frontmatter +<br/>ingredients + instructions]

    %% Configuration
    CONFIG[Configuration<br/>config.toml + env vars<br/>+ fallback config]

    %% LLM Conversion
    CONVERTERS[Converters<br/>OpenAI, Anthropic, Google<br/>Azure OpenAI, Ollama]

    %% Output Modes
    RECIPE_OUT[Recipe Output<br/>extract_only mode]
    COOKLANG_OUT[Cooklang Output<br/>frontmatter + cooklang body<br/>+ ConversionMetadata]

    %% FFI Layer
    FFI[UniFFI Bindings<br/>iOS Swift · Android Kotlin]

    %% Flow connections — URL pipeline with domain-aware routing
    URL --> DOMAIN_CHECK
    DOMAIN_CHECK --> |yes| FETCH_PS
    DOMAIN_CHECK --> |no| FETCH_REQ
    FETCH_REQ --> EXTRACT_HTML
    FETCH_PS --> EXTRACT_HTML
    EXTRACT_HTML --> |success| RECIPE
    EXTRACT_HTML --> |all fail| EXTRACT_PLAIN
    FETCH_REQ --> |HTTP error,<br/>page scriber configured| FETCH_PS
    EXTRACT_PLAIN --> EXTRACT_TEXT
    EXTRACT_TEXT --> TEXT_FORMAT

    CONFIG -.-> DOMAIN_CHECK
    CONFIG -.-> FETCH_PS

    TEXT --> |pre-formatted| TEXT_FORMAT
    TEXT --> |needs extraction| EXTRACT_TEXT

    IMAGE --> OCR
    OCR --> |OPENAI_API_KEY set| EXTRACT_TEXT
    OCR --> |no API key| TEXT_FORMAT

    RECIPE --> |serialize| TEXT_FORMAT

    TEXT_FORMAT --> |extract_only| RECIPE_OUT
    TEXT_FORMAT --> LANG_DETECT
    LANG_DETECT --> CONVERTERS

    CONFIG -.-> CONVERTERS

    CONVERTERS --> COOKLANG_OUT

    COOKLANG_OUT --> FFI
    RECIPE_OUT --> FFI

    %% Styling
    classDef inputStyle fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef fetchStyle fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    classDef processStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef dataStyle fill:#e0f2f1,stroke:#004d40,stroke-width:2px
    classDef outputStyle fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef configStyle fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef ffiStyle fill:#fce4ec,stroke:#880e4f,stroke-width:2px
    classDef routingStyle fill:#fff9c4,stroke:#f57f17,stroke-width:2px

    class URL,TEXT,IMAGE inputStyle
    class FETCH_REQ,FETCH_PS fetchStyle
    class EXTRACT_HTML,EXTRACT_PLAIN,EXTRACT_TEXT,LANG_DETECT,OCR processStyle
    class RECIPE,TEXT_FORMAT dataStyle
    class RECIPE_OUT,COOKLANG_OUT outputStyle
    class CONFIG,CONVERTERS configStyle
    class FFI ffiStyle
    class DOMAIN_CHECK routingStyle
```

## Project Structure

```
src/
├── lib.rs                      # Public API exports
├── main.rs                     # CLI binary
├── model.rs                    # Recipe struct with serialization
├── builder.rs                  # Builder API + pipeline orchestration
├── config.rs                   # Configuration loading (+ FallbackConfig)
├── error.rs                    # Error types
├── uniffi_bindings.rs          # FFI bindings for iOS/Android (feature-gated)
│
├── pipelines/                  # Flow orchestration
│   ├── mod.rs
│   ├── url.rs                  # URL → text pipeline
│   ├── text.rs                 # Text → text pipeline
│   └── image.rs                # Image → text pipeline
│
├── url_to_text/                # URL input processing
│   ├── mod.rs
│   ├── fetchers/
│   │   ├── mod.rs
│   │   ├── request.rs          # HTTP fetch (reqwest)
│   │   └── page_scriber.rs     # Page scriber fetch (HTML source via /api/fetch-source)
│   ├── html/
│   │   ├── mod.rs
│   │   └── extractors/
│   │       ├── mod.rs          # Extractor trait + ParsingContext
│   │       ├── json_ld.rs      # JSON-LD schema extraction
│   │       ├── microdata.rs    # HTML5 microdata extraction
│   │       └── html_class.rs   # CSS class-based extraction
│   └── text/
│       ├── mod.rs
│       └── extractor.rs        # LLM-based plain text extraction
│
├── images_to_text/             # Image input processing
│   ├── mod.rs
│   └── ocr.rs                  # Google Vision OCR (path + base64)
│
└── converters/                 # Text → Cooklang conversion
    ├── mod.rs                  # Converter trait + factory + TokenUsage/ConversionMetadata
    ├── prompt.rs               # Cooklang conversion prompt + language detection (whatlang)
    ├── prompt.txt              # Prompt template ({{RECIPE}} + {{LANGUAGE}})
    ├── open_ai.rs
    ├── anthropic.rs
    ├── azure_openai.rs
    ├── google.rs
    └── ollama.rs

build.rs                        # Cargo build script (UniFFI scaffolding)
uniffi.toml                     # UniFFI binding generation config
uniffi-bindgen.rs               # UniFFI CLI entry point
Package.swift                   # Swift Package Manager manifest
scripts/
├── build-android.sh
├── build-ios.sh
├── generate-swift-package.sh
├── publish-android.sh
└── test-ios-release.sh
```

## Input Flows

### 1. URL → Recipe/Cooklang
The most common use case where a recipe URL is provided:
- **Step 1**: Check if domain is in `page_scriber.domains` list (from config.toml)
- **Step 2a**: If domain is listed, fetch HTML via Page Scriber (`/api/fetch-source`)
- **Step 2b**: Otherwise, fetch HTML via HTTP request (reqwest)
- **Step 3**: Try HTML extractors in order: JSON-LD → MicroData → HTML Class
- **Step 4**: If reqwest failed (e.g., HTTP 402/blocked) and page scriber is configured, auto-fallback to Page Scriber, then retry structured extractors
- **Step 5**: If all extractors fail, extract plain text from HTML (`extract_text_from_html`) then use LLM-based Text Extractor
- **Output**: Recipe struct (extract_only) or Cooklang format (default)

### 2. Text → Cooklang
For plain text or pre-formatted recipes:
- **Pre-formatted**: Assumes text is already in correct format (frontmatter + ingredients + instructions)
- **With extraction**: Uses LLM to parse unstructured text into structured format
- **Output**: Cooklang format via converter

### 3. Image → Cooklang
For recipe images (photos, screenshots):
- Uses Google Cloud Vision API for OCR
- Supports file paths or base64-encoded images
- Multiple images can be combined
- **Structured extraction**: If `OPENAI_API_KEY` is set, OCR text goes through TextExtractor to extract title, metadata (servings, prep_time, cook_time, total_time), and structured recipe text
- **Fallback**: If no API key, returns raw OCR text
- **Output**: Cooklang format via converter

## Data Flow

```
Input → Pipeline → Intermediate Format → Converter → Output

Intermediate Format (Text with YAML frontmatter):
---
title: Recipe Name
source: https://example.com
servings: 4
prep_time: 15 min
cook_time: 30 min
total_time: 45 min
---

ingredient 1
ingredient 2
ingredient 3

Instructions text here...
```

## Processing Components

### Fetchers (url_to_text/fetchers/)
- **RequestFetcher**: Standard HTTP fetch using reqwest with timeout and user agent
- **PageScriberFetcher**: Fetches HTML source via a page scriber service (`POST /api/fetch-source`). Used for sites that block bots (Cloudflare 402, CAPTCHA). Unlike the old ChromeFetcher which returned plain text, this returns raw HTML so structured extractors can still work. Configured via `page_scriber.url` in config.toml.

### HTML Extractors (url_to_text/html/extractors/)
Attempt extraction in order of reliability:
1. **JSON-LD**: Structured recipe data in `<script type="application/ld+json">`
2. **MicroData**: HTML5 microdata attributes (itemscope, itemprop)
3. **HTML Class**: Common CSS class patterns for recipe sites

### Text Extractor (url_to_text/text/)
LLM-based extraction that parses unstructured text into structured recipe components:
- Extracts title, servings, prep_time, cook_time, total_time
- Parses ingredients and instructions from messy text
- Used as fallback for URL processing when HTML extractors fail
- Used for image OCR output to extract structured data from raw OCR text
- Requires `OPENAI_API_KEY` environment variable

### Converters (converters/)
Transform intermediate text format to Cooklang:
- **Trait**: `Converter` with `convert(text) -> Result<String>`
- **Factory**: `create_converter(name, config)` for dynamic creation
- **Providers**: OpenAI, Anthropic, Google, Azure OpenAI, Ollama
- **Language detection**: Uses `whatlang` crate to auto-detect recipe language, injected into prompt template as `{{LANGUAGE}}`
- **Metadata**: Returns `ConversionMetadata` with `model_version`, `TokenUsage` (input/output tokens), and `latency_ms`
- **Fallback**: Configurable provider fallback with retry attempts and exponential backoff (`FallbackConfig`)

## Configuration

Configuration is loaded from multiple sources (in priority order):
1. Environment variables (e.g., `OPENAI_API_KEY`)
2. `config.toml` file
3. Default values

### Configurable Options
- **Page Scriber**: URL of the page scriber service and list of domains that should use it directly
- **Extractors**: Enable/disable and order of extraction strategies
- **Converters**: Enable/disable providers, set default, configure fallback order
- **Fallback**: Enable/disable automatic provider failover with retry attempts and delay
- **Timeouts**: HTTP request timeouts
- **Provider-specific**: API keys, base URLs, endpoints, model names, project IDs (Google), deployment names (Azure)

```toml
[page_scriber]
url = "http://localhost:4000"
domains = ["seriouseats.com", "allrecipes.com"]

[extractors]
enabled = ["json_ld", "microdata", "html_class"]
order = ["json_ld", "microdata", "html_class"]

[converters]
enabled = ["anthropic", "open_ai", "ollama"]
order = ["anthropic", "open_ai", "ollama"]
default = "anthropic"

[converters.fallback]
enabled = true
order = ["anthropic", "open_ai", "ollama"]
retry_attempts = 2
retry_delay_ms = 1000

[providers.anthropic]
enabled = true
model = "claude-3-5-sonnet-20241022"
```

## Output Modes

### Recipe Struct (extract_only)
Returns structured recipe data without LLM conversion:
- Title in YAML frontmatter
- Ingredients (one per line)
- Instructions (free text)
- Metadata (cook time, servings, source, etc.)

### Cooklang Format (default)
Converts recipe to Cooklang syntax:
- YAML frontmatter with metadata
- Ingredients marked with `@ingredient{quantity%unit}` syntax
- Cookware marked with `#cookware{}` syntax
- Timers marked with `~{time%unit}` syntax
- Includes `ConversionMetadata` (model version, token usage, latency)

## Mobile SDKs (UniFFI)

The library compiles as `lib`, `cdylib`, and `staticlib` crate types to support native Rust use and FFI consumption. Mobile bindings are feature-gated behind the `uniffi` feature flag.

### FFI Layer (src/uniffi_bindings.rs)
Provides FFI-safe mirrors of core types and synchronous wrappers around the async API:
- `FfiRecipeComponents`, `FfiLlmProvider`, `FfiImportResult`, `FfiImportError`, `FfiImportConfig`
- Exported functions: `import_from_url`, `convert_text_to_cooklang`, `convert_image_to_cooklang`, `extract_recipe_from_url`, `simple_import`, `get_version`, `is_provider_available`

### Platform Targets
- **iOS/macOS**: Swift bindings generated via UniFFI → `Sources/CooklangImport/CooklangImport.swift`, distributed as Swift Package (`Package.swift`)
- **Android**: Kotlin bindings generated via UniFFI, built with `scripts/build-android.sh`