hedl-stream
Memory-efficient streaming parser for HEDL documents—process multi-gigabyte files with constant memory usage.
Large HEDL files don't fit in RAM. Database exports, log archives, data pipelines—gigabytes of structured data that need processing without loading everything into memory. Traditional parsing loads the entire document first, then gives you access. That doesn't scale.
hedl-stream provides event-driven streaming parsing with O(1) memory regardless of file size. Process 10 GB files with 100 MB RAM. Iterate through nodes as they're parsed. Build custom processing pipelines with standard Rust iterators. Add timeout protection for untrusted input. Optional async support for high-concurrency scenarios.
What's Implemented
Production-grade streaming with comprehensive features:
- Event-Driven Streaming: SAX-style iterator API yielding nodes as they're parsed
- Constant Memory: O(nesting_depth) memory usage regardless of file size
- Full HEDL Support: Headers, schemas, matrix rows, nested structures, references, aliases
- Timeout Protection: Configurable timeout for untrusted input (DoS protection)
- SIMD Optimization: AVX2-accelerated comment detection (optional, x86_64 only)
- Async API: Full tokio-based async implementation (optional feature)
- Comprehensive Parsing: CSV-like rows, ditto operator, quoted strings, escape sequences
- Security Hardening: Max line length (1MB), max indent depth (100), timeout enforcement
- Type Inference: Automatic detection of null, bool, int, float, string, reference values
- Error Recovery: Line numbers on all errors, specific error types, clear messages
Installation
[]
= "1.2"
# For async support:
= { = "1.2", = ["async"] }
= { = "1", = ["io-util"] }
Streaming API
Basic Usage
Process large HEDL files with constant memory:
use ;
use File;
// Open large HEDL file (e.g., 5 GB database export)
let file = open?;
let parser = new?;
let mut node_count = 0;
for event in parser
println!;
Memory Usage: Only the current line and context stack are in memory. A 5 GB file uses the same memory as a 5 MB file.
Custom Configuration
Fine-tune parsing with StreamingParserConfig:
use ;
use Duration;
use File;
let config = StreamingParserConfig ;
let file = open?;
let parser = with_config?;
// Parsing will error if it exceeds 30 seconds (DoS protection)
Configuration Options
max_line_length (default: 1,000,000 bytes)
- Protection against malformed input with extremely long lines
- Raises
Syntaxerror if exceeded - Recommended: 1 MB for normal data, lower for untrusted input
max_indent_depth (default: 100 levels)
- Limits nesting depth to prevent stack overflow
- Raises
Syntaxerror if exceeded - Recommended: 100 for normal data, 20-50 for untrusted input
buffer_size (default: 65,536 bytes)
- I/O buffer size for reading input
- Larger buffers improve performance for large files
- Trade-off: memory vs syscall frequency
- Recommended: 64 KB general, 128-256 KB for high-throughput
timeout (default: None)
- Optional duration limit for parsing operations
- Checked every 100 operations (low overhead ~0.1%)
- Raises
Timeouterror if exceeded - Critical for untrusted input: prevents CPU DoS attacks
- Recommended: None for trusted data, 10-60s for untrusted
memory_limits (default: MemoryLimits::default())
- Controls buffer pooling behavior and memory constraints
- Use
MemoryLimits::default()for normal operation - Use
MemoryLimits::untrusted()for stricter limits on untrusted input - See
MemoryLimitsdocumentation for detailed configuration
enable_pooling (default: false)
- Enable buffer pooling to reduce allocations
- Useful for high-throughput scenarios with many concurrent parsers
- Requires
memory_limits.enable_buffer_poolingto also be true - Recommended: false for single-parser use, true for concurrent scenarios
Event Types
NodeEvent Enum
Header(HeaderInfo)
- Emitted first with document metadata
- Contains: version, schemas (%STRUCT), aliases (%ALIAS), nesting rules (%NEST)
ListStart { key, type_name, schema, line }
- Marks beginning of entity list
key: Field name (e.g., "users")type_name: Entity type (e.g., "User")schema: Column names for matrixline: Source line number
Node(NodeInfo)
- Individual entity row from matrix list
- Contains: type_name, id, fields (Vec), depth, parent info, line
- Emitted immediately after parsing (no buffering)
ListEnd { key, type_name, count }
- Marks end of entity list
count: Total nodes in list- Emitted when indentation decreases or EOF
Scalar { key, value, line }
- Key-value pair (not part of matrix list)
- Example:
name: "My App"
ObjectStart { key, line } / ObjectEnd { key }
- Nested object boundaries
- Used for non-list hierarchical data
EndOfDocument
- Marks successful parse completion
- Always emitted at EOF
NodeInfo Structure
Methods:
get_field(index) -> Option<&Value>- Get field by column indexis_nested() -> bool- Check if node has parent
HeaderInfo Structure
Methods:
get_schema(type_name) -> Option<&Vec<String>>- Lookup type schemaget_child_type(parent_type) -> Option<&String>- Get child type for parent
Async Support
For high-concurrency scenarios with thousands of concurrent streams:
use AsyncStreamingParser;
use File;
async
Performance: Async version has identical memory profile to sync version. Suitable for processing thousands of files concurrently without blocking.
Parsing Features
Matrix Row Parsing
CSV-like comma-separated rows prefixed with |:
users: @User[id, name, age, active]
| alice, Alice Smith, 30, true
| bob, Bob Jones, 25, false
| carol, Carol White, 35, true
Features:
- Quoted string handling:
| id1, "Smith, John", 30 - Escape sequences:
| id2, "Quote: \"value\"", 25 - Type inference: null, bool, int, float, string, reference
Ditto Operator
Repeat previous value with ^:
orders: @Order[id, customer, status]
| ord1, @User:alice, pending
| ord2, ^, shipped # customer = @User:alice (from previous row)
| ord3, @User:bob, pending
| ord4, ^, ^ # customer = @User:bob, status = pending
Reference Parsing
Automatic detection of entity references:
# Qualified reference
customer: @User:alice # Reference(qualified("User", "alice"))
# Local reference
parent: @previous_item # Reference(local("previous_item"))
Alias Substitution
Variable substitution with $:
%ALIAS: api_url: https://api.example.com
---
config:
endpoint: $api_url # Substituted to "https://api.example.com"
Comment Handling
Full-line and inline comments:
# This is a full-line comment
users: @User[id, name]
| alice, Alice # This is an inline comment
| bob, "Bob # Not a comment (inside quotes)"
SIMD Optimization: When compiled with AVX2 support (x86_64), comment detection uses 32-byte SIMD scanning for 2-3x speedup on comment-heavy files.
Error Handling
Comprehensive error types with line numbers:
use ;
for event in parser
Error Types
Io(std::io::Error)- I/O read failuresUtf8 { line, message }- Invalid UTF-8 encodingSyntax { line, message }- Parse syntax errorSchema { line, message }- Schema/type mismatchHeader(String)- Invalid header formatMissingVersion- No %VERSION directiveInvalidVersion(String)- Malformed version stringOrphanRow { line, message }- Child row without parent entityShapeMismatch { line, expected, got }- Column count doesn't match schemaTimeout { elapsed, limit }- Parsing exceeded timeout durationLineTooLong { line, length, limit }- Line exceeds max_line_length configurationInvalidUtf8 { line, error }- Invalid UTF-8 with detailed error information
Use Cases
Database Export Processing: Stream multi-GB database exports, transform data row-by-row, write to different format without loading entire export into memory.
Log File Analysis: Parse massive HEDL log archives, filter events, aggregate statistics, generate reports—all with constant memory usage.
Data Pipeline Integration: Read HEDL from network streams, process incrementally, forward to downstream systems without buffering.
ETL Workflows: Extract from large HEDL files, transform with custom logic, load to database with batch inserts—process millions of rows efficiently.
Real-Time Processing: Parse HEDL data as it arrives (stdin, network socket), emit events immediately, support backpressure naturally.
Untrusted Input Validation: Parse user-uploaded HEDL with timeout protection, validate structure, reject malicious input before full processing.
What This Crate Doesn't Do
Full Document Construction: Doesn't build complete Document objects—that's hedl-core's job. For full document parsing, use hedl_core::parse(). Use streaming when you need memory efficiency.
Random Access: Sequential-only parser. Can't jump to arbitrary positions. For random access, load full document with hedl-core.
Modification: Read-only parser. Can't modify nodes during parsing. For transformations, consume events and write new HEDL output.
Validation: Parses structure, doesn't validate business rules. For schema validation, use hedl-lint on parsed documents.
Performance Characteristics
Memory: O(nesting_depth) regardless of file size. Typically <1 MB for files of any size with reasonable nesting.
I/O: Configurable buffer size (default 64 KB) minimizes syscalls. Batched reads for optimal throughput.
Parsing: Linear pass through input. SIMD-accelerated comment detection (AVX2, ~2-3x faster for comment-heavy files).
Timeout Checks: Every 100 operations (~0.1% overhead). Negligible impact on normal workloads.
Async: Same memory profile as sync. Non-blocking I/O yields to runtime during reads. Suitable for thousands of concurrent streams.
Dependencies
hedl-core(workspace) - Core types (Value, Reference), lexer utilitiesthiserror1.0 - Error type definitionstokio1.35 (optional, "async" feature) - Async I/O runtime
License
Apache-2.0