file_to_json

file_to_json is a Rust library that converts arbitrary text-based files into JSON. It understands a set of common structured formats locally (CSV, JSON, YAML, TOML) and falls back to an OpenRouter-hosted LLM for any formats it does not recognise.

Features

Local parsers for CSV, JSON, YAML, and TOML.
Automatic PDF text extraction before calling the LLM.
OpenRouter LLM fallback (default text model: anthropic/claude-3.7-sonnet).
Automatic chunking for large text payloads to stay within LLM limits.
Safe guards against sending large or non-UTF-8 payloads to the LLM.
Vision-aware fallback for common image formats (JPEG/PNG/GIF/WebP) that captions images via OpenRouter and emits structured metadata.
Simple API returning serde_json::Value.
Configurable fallback strategies for large files (chunking or code generation).

Installation

Add the crate to your project:

cargo add file_to_json --git https://github.com/your-org/file_to_json

(Replace the repository URL with where you host the crate.)

For Contributors

This repository uses Git LFS to manage large example files. After cloning, you'll need to:

Install Git LFS: brew install git-lfs (macOS) or see git-lfs.github.com
Initialize: git lfs install
Pull large files: git lfs pull

See examples/README.md for more details.

Usage

use file_to_json::{Converter, FallbackStrategy, OpenRouterConfig};
use std::time::Duration;

fn main() -> Result<(), file_to_json::ConvertError> {
    let config = OpenRouterConfig {
        api_key: "sk-or-...".to_string(),
        model: "anthropic/claude-3.7-sonnet".to_string(),
        timeout: Duration::from_secs(60),
        fallback_strategy: FallbackStrategy::Chunked,
        vision_model: Some("anthropic/claude-3.7-sonnet".to_string()),
        max_image_bytes: 5 * 1024 * 1024, // 5 MiB
    };
    
    let converter = Converter::new(config)?;
    let value = converter.convert_path("data/sample.csv")?;
    println!("{}", serde_json::to_string_pretty(&value)?);
    Ok(())
}

Configuration

The OpenRouterConfig struct accepts the following fields:

api_key – required. Your OpenRouter API key.
model – optional. Defaults to anthropic/claude-3.7-sonnet.
timeout – optional. Request timeout duration. Defaults to 60 seconds.
fallback_strategy – optional. FallbackStrategy::Chunked (default) or FallbackStrategy::CodeGeneration.
vision_model – optional. Defaults to anthropic/claude-3.5-sonnet. Must support image inputs and JSON output.
max_image_bytes – optional. Maximum size (bytes) of image payloads; defaults to 5242880 (5 MiB).

Behaviour

If the file extension is recognised, the crate parses it locally.
If the file looks like a supported image (JPEG/PNG/GIF/WebP) it is base64-encoded and sent to the configured vision model, which is prompted to return JSON metadata containing a summary, tags, objects, dominant_colors, and confidence.
Otherwise it sends the UTF-8 content (after extracting text for PDFs) to OpenRouter. For inputs that exceed 128 KiB the fallback strategy determines how to proceed:
- chunked (default): splits the input into ≤128 KiB segments, converts each chunk, and merges the returned JSON (arrays concatenated, objects shallow-merged, mixed types wrapped in an array). Works best when each chunk shares a compatible structure.
- code: sends the first/middle/last 10 lines to the model, asks for Python 3 code that can parse the full file, writes that code to a temporary script, and executes it locally to produce JSON (requires python3 on the PATH).
The result is returned as serde_json::Value.

Binary files are rejected unless they are supported images (handled by the vision model), can be converted to UTF-8 text (e.g. PDFs via the built-in extractor), or can be handled by the code-generation fallback.

Example: image captioning

Running the bundled example on a JPEG:

cargo run --example convert -- ./examples/data/einstein.jpg <API_KEY>

produces structured JSON similar to:

{
  "summary": "A black and white portrait of an elderly person with wild white hair.",
  "tags": ["portrait", "black and white", "historical"],
  "objects": ["face", "hair", "jacket"],
  "dominant_colors": ["black", "white", "grey"],
  "confidence": 0.98
}

Testing

cargo test

License

This project is distributed under the terms of the MIT license.

file_to_json 0.3.0