# file_to_json
`file_to_json` is a Rust library that converts arbitrary text-based files into JSON. It understands a set of common structured formats locally (CSV, JSON, YAML, TOML) and falls back to an OpenRouter-hosted LLM for any formats it does not recognise.
## Features
- Local parsers for CSV, JSON, YAML, and TOML.
- Automatic PDF text extraction before calling the LLM.
- OpenRouter LLM fallback (default text model: `anthropic/claude-3.7-sonnet`).
- Automatic chunking for large text payloads to stay within LLM limits.
- Safe guards against sending large or non-UTF-8 payloads to the LLM.
- Vision-aware fallback for common image formats (JPEG/PNG/GIF/WebP) that captions images via OpenRouter and emits structured metadata.
- Simple API returning `serde_json::Value`.
- Configurable fallback strategies for large files (chunking or code generation).
## Installation
Add the crate to your project:
```bash
cargo add file_to_json --git https://github.com/your-org/file_to_json
```
*(Replace the repository URL with where you host the crate.)*
### For Contributors
This repository uses **Git LFS** to manage large example files. After cloning, you'll need to:
1. Install Git LFS: `brew install git-lfs` (macOS) or see [git-lfs.github.com](https://git-lfs.github.com/)
2. Initialize: `git lfs install`
3. Pull large files: `git lfs pull`
See `examples/README.md` for more details.
## Usage
```rust,no_run
use file_to_json::{Converter, FallbackStrategy, OpenRouterConfig};
use std::time::Duration;
fn main() -> Result<(), file_to_json::ConvertError> {
let config = OpenRouterConfig {
api_key: "sk-or-...".to_string(),
model: "anthropic/claude-3.7-sonnet".to_string(),
timeout: Duration::from_secs(60),
fallback_strategy: FallbackStrategy::Chunked,
vision_model: Some("anthropic/claude-3.7-sonnet".to_string()),
max_image_bytes: 5 * 1024 * 1024, // 5 MiB
};
let converter = Converter::new(config)?;
let value = converter.convert_path("data/sample.csv")?;
println!("{}", serde_json::to_string_pretty(&value)?);
Ok(())
}
```
### Configuration
The `OpenRouterConfig` struct accepts the following fields:
- `api_key` – **required**. Your OpenRouter API key.
- `model` – optional. Defaults to `anthropic/claude-3.7-sonnet`.
- `timeout` – optional. Request timeout duration. Defaults to 60 seconds.
- `fallback_strategy` – optional. `FallbackStrategy::Chunked` (default) or `FallbackStrategy::CodeGeneration`.
- `vision_model` – optional. Defaults to `anthropic/claude-3.5-sonnet`. Must support image inputs and JSON output.
- `max_image_bytes` – optional. Maximum size (bytes) of image payloads; defaults to `5242880` (5 MiB).
## Behaviour
1. If the file extension is recognised, the crate parses it locally.
2. If the file looks like a supported image (JPEG/PNG/GIF/WebP) it is base64-encoded and sent to the configured vision model, which is prompted to return JSON metadata containing a `summary`, `tags`, `objects`, `dominant_colors`, and `confidence`.
3. Otherwise it sends the UTF-8 content (after extracting text for PDFs) to OpenRouter. For inputs that exceed 128 KiB the fallback strategy determines how to proceed:
- `chunked` (default): splits the input into ≤128 KiB segments, converts each chunk, and merges the returned JSON (arrays concatenated, objects shallow-merged, mixed types wrapped in an array). Works best when each chunk shares a compatible structure.
- `code`: sends the first/middle/last 10 lines to the model, asks for Python 3 code that can parse the full file, writes that code to a temporary script, and executes it locally to produce JSON (requires `python3` on the PATH).
4. The result is returned as `serde_json::Value`.
Binary files are rejected unless they are supported images (handled by the vision model), can be converted to UTF-8 text (e.g. PDFs via the built-in extractor), or can be handled by the code-generation fallback.
### Example: image captioning
Running the bundled example on a JPEG:
```bash
cargo run --example convert -- ./examples/data/einstein.jpg <API_KEY>
```
produces structured JSON similar to:
```json
{
"summary": "A black and white portrait of an elderly person with wild white hair.",
"tags": ["portrait", "black and white", "historical"],
"objects": ["face", "hair", "jacket"],
"dominant_colors": ["black", "white", "grey"],
"confidence": 0.98
}
```
## Testing
```bash
cargo test
```
## License
This project is distributed under the terms of the MIT license.