hedl-json 2.0.0

# hedl-json

**HEDL's integration with the JSON ecosystem -bidirectional conversion, JSONPath queries, schema generation, and streaming.**

JSON is the universal data interchange format. Your APIs speak it, your databases accept it, your monitoring tools consume it, your LLM providers require it. Every token in a JSON payload costs money. Every extra byte adds latency. Every API call compounds the inefficiency.

`hedl-json` bridges HEDL's efficiency with JSON's ubiquity. Use HEDL's compact matrix notation internally -save 46.7% on tokens, 57.7% on payload size. When you need JSON compatibility, `hedl-json` handles the conversion seamlessly. Query HEDL documents with JSONPath. Generate JSON Schema for validation. Stream large JSON files without loading everything into memory.

Part of the **HEDL format family** alongside `hedl-yaml`, `hedl-xml`, `hedl-csv`, and `hedl-parquet` -bringing HEDL's efficiency to every ecosystem you work in.

## What's Implemented

Based on 6,333 lines of Rust across 7 modules:

1. **Bidirectional Conversion**: HEDL ↔ JSON with configurable fidelity
2. **JSONPath Queries**: Query HEDL documents using standard JSONPath syntax
3. **JSON Schema Generation**: Generate JSON Schema Draft 7 from HEDL documents
4. **Streaming Parsers**: Process large JSON/JSONL files incrementally without full memory load
5. **Schema Caching**: LRU cache for repeated structure inference (30-50% speedup)
6. **Security Limits**: DoS protection with configurable resource limits

## Installation

```toml
[dependencies]
hedl-json = "2.0"
```

## Bidirectional Conversion

### HEDL → JSON: Export for APIs and LLMs

Convert HEDL's compact representation to JSON when you need API compatibility:

```rust
use hedl_json::{to_json, to_json_value, ToJsonConfig};

let doc = hedl_core::parse(br#"
%S:User:[id, name, email]
---
users: @User
 | alice, Alice Smith, alice@example.com
 | bob, Bob Jones, bob@example.com
"#)?;

// Configure JSON output
let config = ToJsonConfig {
    include_metadata: false,  // Don't add __type__, __schema__ fields
    flatten_lists: false,      // Keep matrix structure as object arrays
    include_children: true,    // Include nested entities
    ascii_safe: false,         // UTF-8 output (set true for ASCII-only)
};

// Convert to JSON string (for API responses)
let json_str = to_json(&doc, &config)?;
// {"users": [{"id": "alice", "name": "Alice Smith", "email": "alice@example.com"}, ...]}

// Or get serde_json::Value directly (for further processing)
let json_val = to_json_value(&doc, &config)?;
```

**Token Efficiency**: HEDL's matrix notation saves 46.7% tokens compared to verbose JSON arrays. Use HEDL internally, export to JSON only at system boundaries.

### JSON → HEDL: Import from APIs and Files

Parse JSON from external APIs into HEDL's structured data model:

```rust
use hedl_json::{from_json, from_json_value, from_json_value_owned, FromJsonConfig};

// From JSON string (e.g., API response)
let json = r#"{"name": "Alice", "age": 30, "active": true}"#;
let config = FromJsonConfig::default();
let doc = from_json(json, &config)?;

// From serde_json::Value (existing parsed JSON)
let value: serde_json::Value = serde_json::from_str(json)?;

// Borrows the value (value remains usable after conversion)
let doc = from_json_value(&value, &config)?;

// Or takes ownership for zero-copy efficiency
let doc = from_json_value_owned(value, &config)?;
```

## Security Limits: DoS Protection

`FromJsonConfig` enforces resource limits to prevent denial-of-service attacks from malicious JSON. Defaults are intentionally **high** for legitimate ML and data processing workloads:

```rust
use hedl_json::{from_json, FromJsonConfig};

// Default configuration (for trusted internal data)
let default = FromJsonConfig::default();
// max_depth: Some(10,000) levels (deep hierarchies, nested JSON)
// max_array_size: Some(10,000,000) elements (large datasets, batch processing)
// max_string_length: Some(100 MB) (embeddings, base64-encoded data)
// max_object_size: Some(100,000) keys (rich metadata, complex objects)

let json = r#"{"name": "Alice", "age": 30}"#;
let doc = from_json(json, &default)?;
```

For untrusted input (user uploads, external APIs, public endpoints), use stricter limits:

```rust
use hedl_json::{from_json, FromJsonConfig};

// Strict configuration (for untrusted external sources)
let strict = FromJsonConfig::builder()
    .max_depth(100)                        // 100 levels
    .max_array_size(10_000)                // 10K elements
    .max_string_length(1_000_000)          // 1 MB
    .max_object_size(1_000)                // 1K keys
    .build();

let json = r#"{"name": "Bob", "age": 25}"#;
let doc = from_json(json, &strict)?;
```

Exceeding limits returns `JsonConversionError` variants: `MaxDepthExceeded`, `MaxArraySizeExceeded`, `MaxStringLengthExceeded`, `MaxObjectSizeExceeded`.

## Schema Caching: 30-50% Speedup

When converting JSON arrays with repeated structure (common in API responses), `hedl-json` caches inferred schemas automatically:

```rust
use hedl_json::schema_cache::{SchemaCache, SchemaCacheKey};

let cache = SchemaCache::new(100);  // Capacity: 100 schemas

// Cache is used automatically during from_json() for uniform arrays
// Manual cache usage (for advanced control):
let key = SchemaCacheKey::new(vec!["id".to_string(), "name".to_string()]);
cache.insert(key.clone(), vec!["id".to_string(), "name".to_string()]);

if let Some(schema) = cache.get(&key) {
    // Hit: 30-50% faster than re-inferring schema
}

// Monitor cache performance
let stats = cache.statistics();
println!("Hit rate: {:.2}%", stats.hit_rate() * 100.0);
println!("Hits: {}, Misses: {}, Evictions: {}",
    stats.hits, stats.misses, stats.evictions);
```

For 1000-row JSON arrays with repeated structure, schema caching provides 30-50% speedup over naive inference.

## JSONPath Queries

Query HEDL documents using standard JSONPath syntax (powered by `serde_json_path`):

```rust
use hedl_json::jsonpath::{query, query_first, query_single, query_exists, query_count, QueryConfig};

let doc = hedl_core::parse(br#"
users: @User[id, name, age]
 | alice, Alice Smith, 30
 | bob, Bob Jones, 25
 | carol, Carol White, 35
"#)?;

let config = QueryConfig::default();

// Get all matches
let results = query(&doc, "$.users[?(@.age > 30)].name", &config)?;
// Returns: [serde_json::Value("Carol White")]

// Get first match (returns Option)
let first = query_first(&doc, "$.users[0].name", &config)?;
// Returns: Some(serde_json::Value("Alice Smith"))

// Get exactly one match (errors if 0 or multiple matches)
let single = query_single(&doc, "$.users[?(@.id == 'alice')].name", &config)?;
// Returns: serde_json::Value("Alice Smith")

// Check if any matches exist
let exists = query_exists(&doc, "$.users[?(@.age > 40)]", &config)?;
// Returns: false

// Count matches
let count = query_count(&doc, "$.users[*]", &config)?;
// Returns: 3
```

### QueryConfig Options

```rust
use hedl_json::jsonpath::{QueryConfig, QueryConfigBuilder};

let config = QueryConfig {
    include_metadata: false,   // Don't add __type__ fields in results
    flatten_lists: false,       // Keep matrix structure
    include_children: true,     // Include nested data
    max_results: 100,           // Limit results (0 = unlimited)
};

// Or use builder
let config = QueryConfigBuilder::new()
    .include_metadata(false)
    .max_results(50)
    .build();
```

## JSON Schema Generation

Generate JSON Schema Draft 7 from HEDL documents for validation and documentation:

```rust
use hedl_json::schema_gen::{generate_schema, generate_schema_value, SchemaConfig};

let doc = hedl_core::parse(br#"
%S:User:[id, name, email, age]
---
users: @User
 | u1, Alice, alice@example.com, 30
"#)?;

let config = SchemaConfig::builder()
    .title("User API Schema")
    .description("Schema for user data endpoint")
    .schema_id("https://api.example.com/schemas/user.json")
    .strict(true)              // disallow additionalProperties
    .include_examples(true)    // add example values from data
    .include_metadata(true)    // include title/description/$id
    .build();

// Generate as JSON string (for documentation)
let schema_json = generate_schema(&doc, &config)?;

// Or as serde_json::Value (for programmatic use)
let schema_value = generate_schema_value(&doc, &config)?;
```

### Smart Type Inference

The schema generator infers JSON Schema formats from actual data:

**Value-Based Inference** (analyzed during schema generation):

```rust
// Field values → JSON Schema format annotation
"alice@example.com"              → {"type": "string", "format": "email"}
"https://example.com"            → {"type": "string", "format": "uri"}
"2024-01-15T10:30:00Z"          → {"type": "string", "format": "date-time"}
"550e8400-e29b-41d4-a716-..."   → {"type": "string", "format": "uuid"}
```

**Name-Based Inference** (fallback when values are ambiguous):

```rust
// Field names → format hints
"email" field      → format: "email"
"url" field        → format: "uri"
"created_at" field → format: "date-time"
"uuid" field       → format: "uuid"
```

### %NEST Relationships in Schemas

HEDL's `%NEST` declarations become nested object arrays in JSON Schema:

```rust
let doc = hedl_core::parse(br#"
%S:Team:[id, name]
%S:Member:[id, name, role]
%N:Team>Member
---
teams: @Team
 | t1, Engineering
"#)?;

let schema = generate_schema_value(&doc, &SchemaConfig::default())?;
// Team schema includes:
// {
//   "type": "object",
//   "properties": {
//     "id": {"type": "string"},
//     "name": {"type": "string"},
//     "members": {
//       "type": "array",
//       "items": {"$ref": "#/definitions/Member"}
//     }
//   }
// }
```

## Streaming: Process Large JSON Without Full Memory Load

### JSON Array Streaming

Stream elements from large JSON arrays incrementally:

```rust
use hedl_json::streaming::{JsonArrayStreamer, StreamConfig};
use std::fs::File;

// Open large JSON file: [{...}, {...}, {...}, ...]
let file = File::open("large_dataset.json")?;
let config = StreamConfig::default();
let streamer = JsonArrayStreamer::new(file, config)?;

let mut count = 0;
for result in streamer {
    let doc = result?;  // Each array element as HEDL document
    count += 1;
    // Process document: validate, transform, aggregate
}
println!("Processed {} documents", count);
```

**Performance**: Streaming is 1.2-2.1x faster than loading the full array and parsing.

### JSONL (JSON Lines) Streaming

Stream JSONL files line-by-line with robust error handling:

```rust
use hedl_json::streaming::{JsonLinesStreamer, StreamConfig};
use std::fs::File;

let file = File::open("logs.jsonl")?;  // One JSON object per line
let config = StreamConfig::default();
let streamer = JsonLinesStreamer::new(file, config);

for result in streamer {
    match result {
        Ok(doc) => {
            // Process valid log entry
        }
        Err(e) => {
            // Malformed line - log error and continue
            eprintln!("Skipping malformed line {}: {}",
                streamer.line_number(), e);
        }
    }
}
```

**JSONL Features**:
- Blank lines: automatically skipped
- Comments: lines starting with `#` are ignored
- Robust: continues processing on invalid lines (errors returned per line)
- Line tracking: `line_number()` method for debugging

### JSONL Writing

Write HEDL documents as JSONL for streaming output:

```rust
use hedl_json::streaming::JsonLinesWriter;
use std::fs::File;

let file = File::create("output.jsonl")?;
let mut writer = JsonLinesWriter::new(file);

for doc in documents {
    writer.write_document(&doc)?;  // One document per line
}

writer.flush()?;  // Ensure all data written
```

### StreamConfig Options

```rust
use hedl_json::streaming::StreamConfig;
use hedl_json::FromJsonConfig;

let config = StreamConfig {
    buffer_size: 64 * 1024,                 // 64 KB buffer (default)
    max_object_bytes: Some(10 * 1024 * 1024), // 10 MB per object (default)
    from_json: FromJsonConfig::default(),   // Security limits per object
    use_size_estimation: true,              // Efficient size checks (default)
    true_streaming: true,                   // Constant memory for arrays (default)
};

// Or use builder
let config = StreamConfig::builder()
    .buffer_size(128 * 1024)                    // 128 KB buffer
    .max_object_bytes(50 * 1024 * 1024)         // 50 MB per object
    .unlimited_object_size()                    // Disable limit (use with caution)
    .from_json_config(FromJsonConfig::builder()
        .max_depth(100)
        .build())
    .use_size_estimation(true)                  // Efficient size checks
    .true_streaming(true)                       // Constant memory mode
    .build();
```

## Format Mapping

### HEDL → JSON

| HEDL Type | JSON Output | Example |
|-----------|-------------|---------|
| Scalars (null, bool, number, string) | Direct mapping | `null`, `true`, `42`, `"text"` |
| Objects | JSON objects | `{"key": "value"}` |
| Arrays (tensors) | JSON arrays | `[1, 2, 3]` |
| `@User:alice` (reference) | `{"@ref": "@User:alice"}` | Special object format |
| `$(x + 1)` (expression) | `"$(x + 1)"` | String with `$()` wrapper |
| Matrix lists | Arrays of objects | `[{"id": "a", "name": "Alice"}, ...]` |

Example matrix list conversion:

```hedl
users: @User[id, name]
 | alice, Alice
 | bob, Bob
```

Becomes:

```json
{
  "users": [
    {"id": "alice", "name": "Alice"},
    {"id": "bob", "name": "Bob"}
  ]
}
```

### JSON → HEDL

| JSON Type | HEDL Result | Notes |
|-----------|-------------|-------|
| Objects | HEDL objects | Nested structures preserved |
| Arrays | HEDL arrays | Uniform objects become matrix lists |
| `{"@ref": "..."}` | HEDL reference | Special format recognized |
| `"$(...)"` strings | HEDL expression | Pattern triggers expression parsing |
| Primitives | Direct mapping | Null, bool, number, string |

**Schema Inference**: Uniform object arrays are automatically converted to matrix lists with inferred schemas. Fields are sorted alphabetically with `id` first if present.

## Use Cases

**API Integration**: Receive JSON from external APIs, convert to HEDL for structured processing, export back to JSON for responses. Save 46.7% on token costs for LLM API calls.

**Data Pipelines**: Read JSON logs/events, process with HEDL's structured model, export to CSV (`hedl-csv`) or Parquet (`hedl-parquet`) for analytics.

**Configuration Management**: Store configs in HEDL with schema validation (`hedl-lint`), export to JSON for runtime consumption by existing tools.

**LLM Context Optimization**: Convert verbose JSON prompts to HEDL (46.7% token savings), send compact HEDL to LLM provider's API (after JSON conversion at the boundary).

**Schema Documentation**: Generate JSON Schema from HEDL documents for API documentation, OpenAPI specs, and validation tools.

**Log Processing**: Stream large JSONL log files, filter/transform with HEDL's query API, aggregate statistics without full memory load.

## What This Crate Doesn't Do

**Schema Preservation**: JSON has no schema concept. HEDL's `%STRUCT`, `%NEST`, `%ALIAS` declarations are lost in JSON conversion. If you need validation after round-tripping through JSON, redefine schemas explicitly in HEDL.

**Validation**: Converts formats faithfully, doesn't validate data against schemas. For schema validation, use `hedl-lint`.

**Optimization**: Converts structures as-is, not optimally. Verbose JSON becomes verbose HEDL. To leverage HEDL's matrix efficiency, restructure data into uniform arrays intentionally.

**True Array Streaming**: `JsonArrayStreamer` loads the entire JSON array into memory (limitation of `serde_json`). For true incremental processing, use `JsonLinesStreamer` with JSONL format.

## Dependencies

- `serde_json` 1.0 - JSON parsing and serialization
- `serde_json_path` 0.7 - JSONPath query engine
- `hedl-core` 2.0 - HEDL parsing and data model
- `thiserror` 1.0 - Error type definitions

## Performance Characteristics

**Conversion**: HEDL → JSON is serialization-bound. JSON → HEDL is parsing-bound.

**Caching**: Schema inference with caching provides 30-50% speedup for repeated structures in JSON arrays.

**Streaming**:
- JSONL processing is O(1) memory per object
- JSON array streaming loads full array (use JSONL for large files)
- Streaming is 1.2-2.1x faster than full parse for large datasets

**JSONPath**: Query performance depends on `serde_json_path` implementation. Queries execute on JSON representation (HEDL → JSON conversion happens first).

Detailed performance benchmarks are available in the HEDL repository benchmark suite.

## License

Apache-2.0