# tryparse
Multi-strategy parser for messy, real-world data. Built to handle LLM responses with broken JSON, markdown wrappers, type mismatches, and inconsistent formatting.
## Quick Start
```toml
[dependencies]
tryparse = "0.4"
serde = { version = "1.0", features = ["derive"] }
```
```rust
use tryparse::parse;
use serde::Deserialize;
#[derive(Deserialize, Debug)]
struct User {
name: String,
age: u32,
}
fn main() {
// Handles markdown wrappers, trailing commas, unquoted keys, type coercion
let messy_input = r#"
Here's your data:
```json
{
name: "Alice",
age: "30",
}
```
"#;
let user: User = parse(messy_input).unwrap();
println!("{:?}", user); // User { name: "Alice", age: 30 }
}
```
## Core Features
### With `serde::Deserialize`
Basic type coercion works out of the box:
```rust
use tryparse::parse;
use serde::Deserialize;
#[derive(Deserialize)]
struct Data {
count: i64, // "42" → 42
price: f64, // "3.14" → 3.14
active: bool, // "true" → true
tags: Vec<String>, // "tag" → ["tag"]
}
let data: Data = parse(r#"{"count": "42", "price": "3.14", "active": "true", "tags": "tag"}"#).unwrap();
```
### With `LlmDeserialize` (derive feature)
Advanced features require the `derive` feature:
```toml
[dependencies]
tryparse = { version = "0.4", features = ["derive"] }
tryparse-derive = "0.4"
```
**Fuzzy field matching** - Handles different naming conventions:
```rust
use tryparse::parse_llm;
use tryparse_derive::LlmDeserialize;
#[derive(Debug, LlmDeserialize)]
struct Config {
user_name: String, // Matches: userName, UserName, user-name, user.name, USER_NAME
max_count: i64,
}
let data: Config = parse_llm(r#"{"userName": "Alice", "maxCount": 30}"#).unwrap();
```
**Enum fuzzy matching** - Case-insensitive, partial matches:
```rust
#[derive(Debug, LlmDeserialize)]
enum Status {
InProgress, // Matches: "in_progress", "in-progress", "inprogress", "in progress"
Completed, // Matches: "complete", "COMPLETED", "done"
Cancelled,
}
```
**Internally-tagged enum fuzzy matching** - Tag values match fuzzily:
```rust
use serde::{Serialize, Deserialize};
#[derive(Debug, LlmDeserialize, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
enum Decision {
StartTask { skill_id: String, reasoning: String },
AskClarification { message: String },
}
// All of these work with fuzzy tag matching:
let d1: Decision = parse_llm(r#"{"type": "start_task", "skill_id": "test", "reasoning": "clear"}"#).unwrap();
let d2: Decision = parse_llm(r#"{"type": "StartTask", "skill_id": "test", "reasoning": "clear"}"#).unwrap();
let d3: Decision = parse_llm(r#"{"type": "startTask", "skill_id": "test", "reasoning": "clear"}"#).unwrap();
let d4: Decision = parse_llm(r#"{"type": "start-task", "skill_id": "test", "reasoning": "clear"}"#).unwrap();
let d5: Decision = parse_llm(r#"{"type": "STARTTASK", "skill_id": "test", "reasoning": "clear"}"#).unwrap();
```
**Union types** - Automatically picks the best variant:
```rust
#[derive(Debug, LlmDeserialize)]
#[llm(union)]
enum Value {
Number(i64),
Text(String),
List(Vec<String>),
}
// Parses as Number(42)
let v1: Value = parse_llm("42").unwrap();
// Parses as Text("hello")
let v2: Value = parse_llm(r#""hello""#).unwrap();
// Parses as List(...)
let v3: Value = parse_llm(r#"["a", "b"]"#).unwrap();
```
**Implied key** - Single-field structs unwrap values:
```rust
#[derive(Debug, LlmDeserialize)]
struct Wrapper {
data: String,
}
// Direct string wraps into the single field
let w: Wrapper = parse_llm(r#""hello world""#).unwrap();
assert_eq!(w.data, "hello world");
```
## API Reference
### Basic Parsing
```rust
// Parse with serde::Deserialize
fn parse<T: DeserializeOwned>(input: &str) -> Result<T>
// Parse with serde::Deserialize, get all candidates
fn parse_with_candidates<T: DeserializeOwned>(input: &str) -> Result<(T, Vec<FlexValue>)>
// Parse with custom parser configuration
fn parse_with_parser<T: DeserializeOwned>(input: &str, parser: &FlexibleParser) -> Result<T>
// Parse with metadata (strategy used, duration, candidates evaluated)
fn parse_with_metadata<T: DeserializeOwned>(input: &str) -> Result<(T, ParseMetadata)>
```
### Advanced Parsing (requires `derive` feature)
```rust
// Parse with LlmDeserialize trait (fuzzy matching, unions, etc.)
fn parse_llm<T: LlmDeserialize>(input: &str) -> Result<T>
// Parse with LlmDeserialize, get all candidates
fn parse_llm_with_candidates<T: LlmDeserialize>(input: &str) -> Result<(T, Vec<FlexValue>)>
```
### Utilities
```rust
// Score a candidate (lower is better)
fn score_candidate(candidate: &FlexValue) -> u32
// Rank candidates by score
fn rank_candidates(candidates: &mut [FlexValue])
// Get the best candidate
fn best_candidate(candidates: &[FlexValue]) -> Option<&FlexValue>
```
## How It Works
### 1. Multi-Stage Parsing Pipeline
```
Input String
↓
┌──────────────────────────────────┐
│ Pre-Processing │
│ • Remove BOM, zero-width chars │
│ • Fix excessive nesting (>50) │
│ • Normalize backslashes │
└────────────┬─────────────────────┘
↓
┌──────────────────────────────────┐
│ Strategy Execution (parallel) │
│ • DirectJson (priority 1) │
│ • Markdown (priority 2) │
│ • YAML (priority 15)│
│ • JsonFixer (priority 20)│
│ • Heuristic (priority 30)│
│ │
│ → Produces Vec<FlexValue> │
└────────────┬─────────────────────┘
↓
┌──────────────────────────────────┐
│ Scoring & Ranking │
│ • Base score by source │
│ • Transformation penalties │
│ • Confidence adjustment │
│ • Sort ascending (best first) │
└────────────┬─────────────────────┘
↓
┌──────────────────────────────────┐
│ Deserialization │
│ • Try candidates in order │
│ • Apply type coercion │
│ • Track transformations │
│ • Return first success │
└──────────────────────────────────┘
```
### 2. Parsing Strategies
| **DirectJson** | 1 | Direct `serde_json::from_str()`. Fastest path for valid JSON. |
| **Markdown** | 2 | Extracts from markdown code blocks. Scores by keywords, position, size. |
| **YAML** | 15 | Parses YAML, converts to JSON. Requires `yaml` feature. |
| **JsonFixer** | 20 | Fixes common JSON errors (see below). |
| **Heuristic** | 30 | Pattern-based extraction from prose. Last resort. |
### 3. JSON Fixes Applied
The `JsonFixer` strategy handles:
- **Trailing commas**: `{"a": 1,}` → `{"a": 1}`
- **Unquoted keys**: `{name: "x"}` → `{"name": "x"}`
- **Single quotes**: `{'a': 1}` → `{"a": 1}`
- **Missing commas**: `{"a":1 "b":2}` → `{"a":1,"b":2}`
- **Unclosed braces/brackets**: `{"a": 1` → `{"a": 1}`
- **Comments**: `{"a": 1 /* comment */}` → `{"a": 1}`
- **Smart quotes**: `{"a": "value"}` → `{"a": "value"}`
- **Double-escaped JSON**: `"{\"a\":1}"` → `{"a":1}`
- **Template literals**: `` {`key`: "value"} `` → `{"key": "value"}`
- **Hex numbers**: `{"a": 0xFF}` → `{"a": 255}`
- **Unescaped newlines** in strings
- **JavaScript functions**: Removed entirely
### 4. Type Coercion
Applied during deserialization (works with both `Deserialize` and `LlmDeserialize`):
| String | Number | `"42"` → `42` |
| String | Bool | `"true"` → `true` |
| Number | String | `42` → `"42"` |
| Float | Int | `42.0` → `42` |
| Single | Array | `"item"` → `["item"]` |
### 5. Field Matching (LlmDeserialize only)
Normalizes field names to snake_case and matches case-insensitively:
| `user_name` | `userName`, `UserName`, `user-name`, `user.name`, `USER_NAME`, `username` |
| `max_count` | `maxCount`, `MaxCount`, `max-count`, `max.count`, `MAX_COUNT` |
**Note**: Does not handle acronyms perfectly. `XMLParser` becomes `x_m_l_parser` not `xml_parser`.
### 6. Scoring System
**Base Scores** (by source):
- Direct JSON: 0
- Markdown: 10
- YAML: 15
- Fixed JSON: 20 + (5 × number of fixes)
- Heuristic: 50
**Transformation Penalties**:
- String→Number: +2
- Float→Int: +3
- Field rename: +4
- Single→Array: +5
- Default inserted: +50
**Confidence Modifier**:
- Each transformation reduces confidence by 5%
- Final score += `(1.0 - confidence) × 100`
**Lower scores win**. Direct JSON with no coercion scores 0 (best possible).
## Examples
### Handling Markdown Responses
```rust
let llm_output = r#"
Sure! Here's the user data:
```json
{
"name": "Alice",
"age": 30,
"email": "alice@example.com"
}
```
Let me know if you need anything else!
"#;
#[derive(Deserialize, Debug)]
struct User {
name: String,
age: i64,
email: String,
}
let user: User = parse(llm_output).unwrap();
```
### Inspecting Parse Candidates
```rust
use tryparse::{parse_with_candidates, scoring::score_candidate};
let (result, candidates) = parse_with_candidates::<User>(messy_input).unwrap();
println!("Best result: {:?}", result);
println!("\nAll candidates:");
for (i, candidate) in candidates.iter().enumerate() {
println!(" {}: {:?} (score: {})",
i,
candidate.source(),
score_candidate(candidate)
);
// Inspect transformations
for t in candidate.transformations() {
println!(" - {:?}", t);
}
}
```
### Custom Parser Configuration
```rust
use tryparse::parser::FlexibleParser;
use tryparse::parse_with_parser;
// Use builder pattern for custom configuration
let parser = FlexibleParser::builder()
.without_heuristic() // Disable heuristic extraction
.without_markdown() // Disable markdown extraction
.build();
let data: User = parse_with_parser(input, &parser).unwrap();
```
### Complex Nested Structures
```rust
use std::collections::HashMap;
#[derive(Debug, LlmDeserialize)]
struct Project {
name: String,
owner: User,
status: Status,
tags: Vec<String>,
metadata: HashMap<String, String>,
}
let project: Project = parse_llm(complex_json).unwrap();
```
### Union Types with Scoring
```rust
#[derive(Debug, LlmDeserialize)]
#[llm(union)]
enum Response {
Success { data: User },
Error { message: String, code: i64 },
Pending { estimated_time: i64 },
}
// Automatically picks the variant that best matches the structure
let response: Response = parse_llm(api_response).unwrap();
match response {
Response::Success { data } => println!("User: {:?}", data),
Response::Error { message, code } => eprintln!("Error {}: {}", code, message),
Response::Pending { estimated_time } => println!("Wait {}s", estimated_time),
}
```
## Feature Flags
```toml
# Default: includes markdown and yaml
[dependencies]
tryparse = "0.4"
# Minimal build (core JSON parsing only)
tryparse = { version = "0.4", default-features = false }
# With derive macros for LlmDeserialize
tryparse = { version = "0.4", features = ["derive"] }
# All features
tryparse = { version = "0.4", features = ["derive", "markdown", "yaml"] }
```
Available features:
- `markdown` (default) - Markdown code block extraction
- `yaml` (default) - YAML parsing support
- `derive` - Derive macro for `LlmDeserialize` (fuzzy field/enum matching, union types)
## Testing
```bash
# Unit tests (226 tests in lib)
cargo test --lib
# All tests (lib + integration + doc tests)
cargo test --all-features
# Minimal build tests
cargo test --no-default-features
# Run specific test
cargo test --test integration_test -- --nocapture
```
## Performance Considerations
- **Parsing is synchronous**: No async/await support
- **Memory overhead**: Tracks all parsing candidates and transformations
- **Strategy execution**: Some strategies run in parallel
- **Regex compilation**: Expensive regexes are compiled lazily and cached
- **Best for**: <1MB inputs, occasional parsing (not high-frequency loops)
Optimizations:
- Direct JSON (valid JSON) takes the fastest path
- Failed strategies short-circuit early
- Scoring is lazy (only computed when needed)
- Copy semantics for small types (enums, etc.)
## Debugging
Enable detailed logging:
```rust
env_logger::init();
std::env::set_var("RUST_LOG", "tryparse=debug");
let result = parse::<User>(input);
```
Inspect what went wrong:
```rust
match parse::<User>(input) {
Ok(user) => println!("Success: {:?}", user),
Err(e) => {
eprintln!("Parse failed: {:?}", e);
// Try getting candidates to see what was parsed
if let Ok((_, candidates)) = parse_with_candidates::<User>(input) {
for c in candidates {
eprintln!("Candidate: {:?}", c.value);
}
}
}
}
```
## Known Limitations
1. **Synchronous only** - No async parsing
2. **No streaming** - Requires complete input string
3. **Memory overhead** - Tracks all candidates and transformations
4. **Acronym handling** - `XMLParser` → `x_m_l_parser` (not `xml_parser`)
5. **Best-effort parsing** - May produce unexpected results on ambiguous input
6. **No custom deserializers** - Can't implement custom `Deserialize` logic for fields
7. **Internally-tagged enum fields** - Tag values support fuzzy matching, but variant fields use standard serde deserialization (exact match or serde's `rename_all`)
## Contributing
Requirements:
- Rust 1.70.0+
- Run `cargo fmt` before committing
- Pass `cargo clippy --all-targets --all-features`
- All tests must pass: `cargo test --all-features`
Pull request checklist:
1. Clear description of what and why
2. Tests for new functionality
3. Update README if API changes
4. No clippy warnings
5. All existing tests pass
## License
Apache-2.0
## Credits
Parsing algorithms inspired by [BAML's Schema-Aligned Parsing](https://github.com/BoundaryML/baml).