file_to_json

file_to_json is a Rust library that converts arbitrary text-based files into JSON. It understands a set of common structured formats locally (CSV, JSON, YAML, TOML) and falls back to an OpenRouter-hosted LLM for any formats it does not recognise.

Features

Local parsers for CSV, JSON, YAML, and TOML.
Automatic PDF text extraction before calling the LLM.
OpenRouter LLM fallback (default text model: openrouter/polaris-alpha).
Automatic chunking for large text payloads to stay within LLM limits.
Safe guards against sending large or non-UTF-8 payloads to the LLM.
Vision-aware fallback for common image formats (JPEG/PNG/GIF/WebP) that captions images via OpenRouter and emits structured metadata.
Simple API returning serde_json::Value.
Configurable fallback strategies for large files (chunking or code generation).

Installation

Add the crate to your project:

cargo add file_to_json --git https://github.com/your-org/file_to_json

(Replace the repository URL with where you host the crate.)

Usage

use file_to_json::{Converter, FallbackStrategy, OpenRouterConfig};

fn main() -> Result<(), file_to_json::ConvertError> {
    // Option 1: from environment variables
    let converter = Converter::from_env()?;

    // Option 2: custom configuration
    // let converter = Converter::new(OpenRouterConfig {
    //     api_key: "sk-or-...".into(),
    //     model: "openrouter/polaris-alpha".into(),
    //     timeout: std::time::Duration::from_secs(60),
    //     fallback_strategy: FallbackStrategy::CodeGeneration,
    // })?;

    let value = converter.convert_path("data/sample.csv")?;
    println!("{}", serde_json::to_string_pretty(&value)?);
    Ok(())
}

Environment variables

OPENROUTER_API_KEY – required. Your OpenRouter API key.
OPENROUTER_MODEL – optional. Defaults to openrouter/polaris-alpha.
OPENROUTER_FALLBACK_STRATEGY – optional. chunked (default) or code.
OPENROUTER_VISION_MODEL – optional. Defaults to anthropic/claude-3.5-sonnet. Must support image inputs and JSON output.
OPENROUTER_MAX_IMAGE_BYTES – optional. Maximum size (bytes) of image payloads; defaults to 5242880 (5 MiB).

Behaviour

If the file extension is recognised, the crate parses it locally.
If the file looks like a supported image (JPEG/PNG/GIF/WebP) it is base64-encoded and sent to the configured vision model, which is prompted to return JSON metadata containing a summary, tags, objects, dominant_colors, and confidence.
Otherwise it sends the UTF-8 content (after extracting text for PDFs) to OpenRouter. For inputs that exceed 128 KiB the fallback strategy determines how to proceed:
- chunked (default): splits the input into ≤128 KiB segments, converts each chunk, and merges the returned JSON (arrays concatenated, objects shallow-merged, mixed types wrapped in an array). Works best when each chunk shares a compatible structure.
- code: sends the first/middle/last 10 lines to the model, asks for Python 3 code that can parse the full file, writes that code to a temporary script, and executes it locally to produce JSON (requires python3 on the PATH).
The result is returned as serde_json::Value.

Binary files are rejected unless they are supported images (handled by the vision model), can be converted to UTF-8 text (e.g. PDFs via the built-in extractor), or can be handled by the code-generation fallback.

Example: image captioning

With the required environment variables set:

export OPENROUTER_API_KEY="sk-or-..."
# optional overrides
export OPENROUTER_VISION_MODEL="anthropic/claude-3.5-sonnet"
export OPENROUTER_MAX_IMAGE_BYTES="5242880"

Running the bundled example on a JPEG:

cargo run --example convert -- ./examples/data/einstein.jpg

produces structured JSON similar to:

{
  "summary": "A black and white portrait of an elderly person with wild white hair.",
  "tags": ["portrait", "black and white", "historical"],
  "objects": ["face", "hair", "jacket"],
  "dominant_colors": ["black", "white", "grey"],
  "confidence": 0.98
}

Testing

cargo test

License

This project is distributed under the terms of the MIT license.

file_to_json 0.1.4