hedl-stream 2.0.0

Streaming parser for HEDL - memory-efficient processing of large files
Documentation

hedl-stream

Streaming parser for HEDL documents. When you have a multi-gigabyte file and cannot load it into memory, this crate processes it incrementally with constant memory overhead.

hedl-core parses entire documents into memory before giving you access. That works for configuration files and small datasets, but fails for database exports, log archives, and data pipelines where files reach tens of gigabytes. This crate solves that by emitting events as it parses: you receive each node immediately after it's read from the input stream, and memory usage stays proportional to nesting depth rather than file size.

Getting Started

[dependencies]
hedl-stream = "1"

For async support with Tokio:

[dependencies]
hedl-stream = { version = "1", features = ["async"] }
tokio = { version = "1", features = ["io-util"] }

Basic Usage

Open a file, iterate through events, do something with each node:

use hedl_stream::{StreamingParser, NodeEvent};
use std::fs::File;

let file = File::open("massive_data.hedl")?;
let parser = StreamingParser::new(file)?;

let mut node_count = 0;
for event in parser {
    match event? {
        NodeEvent::Header(header) => {
            println!("Found schemas: {:?}", header.structs.keys());
        }
        NodeEvent::ListStart { key, type_name, .. } => {
            println!("Starting list '{}' of type {}", key, type_name);
        }
        NodeEvent::Node(node) => {
            node_count += 1;
            // Do something with each node
            // Memory stays constant no matter how many you process
        }
        NodeEvent::ListEnd { key, count, .. } => {
            println!("Finished '{}': {} nodes", key, count);
        }
        NodeEvent::EndOfDocument => break,
        _ => {}
    }
}
println!("Processed {} nodes", node_count);

Working with Untrusted Input

When parsing files you don't control, timeouts prevent someone from feeding you a file designed to hang your server:

use hedl_stream::{StreamingParser, StreamingParserConfig, MemoryLimits};
use std::time::Duration;
use std::fs::File;

let config = StreamingParserConfig {
    max_line_length: 500_000,                      // 500 KB lines max
    max_indent_depth: 50,                          // 50 levels of nesting
    buffer_size: 128 * 1024,                       // 128 KB I/O buffer
    timeout: Some(Duration::from_secs(30)),        // Give up after 30 seconds
    memory_limits: MemoryLimits::untrusted(),      // Stricter limits
    enable_pooling: false,
};

let file = File::open("user_upload.hedl")?;
let parser = StreamingParser::with_config(file, config)?;

Async Parsing

Same API, but for when you're juggling thousands of concurrent streams:

use hedl_stream::AsyncStreamingParser;
use tokio::fs::File;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open("large_data.hedl").await?;
    let mut parser = AsyncStreamingParser::new(file).await?;

    loop {
        match parser.next_event().await? {
            Some(NodeEvent::Node(node)) => {
                // process node
            }
            Some(NodeEvent::EndOfDocument) => break,
            Some(_) => {}
            None => break,
        }
    }

    Ok(())
}

Events You'll See

The parser emits events as it encounters them:

Header comes first, containing the document's version, schemas, aliases, and nesting rules.

ListStart marks the beginning of an entity list (like users: @User[id, name, age]), telling you the field name, type, and column schema.

Node is an individual row from a list. Each node carries its type, ID, field values, nesting depth, and parent information if nested.

ListEnd fires when a list finishes, including a count of how many nodes it contained.

Scalar appears for simple key-value pairs outside of lists.

ObjectStart and ObjectEnd bracket nested objects that aren't entity lists.

EndOfDocument means you've reached the end successfully.

HEDL Features the Parser Handles

Matrix rows with CSV-like syntax:

users: @User[id, name, age]
 | alice, Alice Smith, 30
 | bob, "Jones, Bob", 25
 | carol, "Said \"hello\"", 35

Entity references are detected automatically:

customer: @User:alice     # Qualified reference to User alice
parent: @previous_item    # Local reference

Alias substitution:

%A:api_url: https://api.example.com
---
config:
  endpoint: $api_url      # Becomes https://api.example.com

Comments anywhere:

# Full line comment
users: @User[id, name]
 | alice, Alice    # Inline comment
 | bob, "Bob # This is not a comment (inside quotes)"

Error Handling

Errors tell you what went wrong and where:

for event in parser {
    match event {
        Ok(event) => { /* process */ }
        Err(StreamError::Syntax { line, message }) => {
            eprintln!("Line {}: {}", line, message);
        }
        Err(StreamError::ShapeMismatch { line, expected, got }) => {
            eprintln!("Line {}: expected {} columns, got {}", line, expected, got);
        }
        Err(StreamError::Timeout { elapsed, limit }) => {
            eprintln!("Took too long: {:?}", elapsed);
        }
        Err(e) => eprintln!("Error: {}", e),
    }
}

Performance Notes

Memory usage depends on nesting depth, not file size. A 50 GB file with 10 levels of nesting uses the same memory as a 50 KB file with 10 levels of nesting.

The default 64 KB I/O buffer works well for most cases. Bump it to 128-256 KB for maximum throughput on large files.

On x86_64 with AVX2, comment detection uses SIMD for faster scanning of comment-heavy files.

Timeout checks happen every 100 operations, adding about 0.1% overhead.

Dependencies

  • hedl-core for shared types and lexer utilities
  • thiserror for error definitions
  • tokio (optional, with the "async" feature) for async I/O

License

Apache-2.0