Expand description
Bidirectional conversion between Parquet and HEDL formats.
This crate provides functionality to convert HEDL documents to Parquet files and vice versa. Parquet is a columnar storage format that is well-suited for HEDL’s matrix list structures.
§Features
- HEDL → Parquet: Convert HEDL documents to Parquet files with configurable compression
- Parquet → HEDL: Read Parquet files and convert them back to HEDL documents
- Matrix Lists: Natural mapping of HEDL matrix lists to Parquet tables
- Metadata: Support for key-value pairs as single-row tables
- Security: Built-in protections against decompression bombs and memory exhaustion
§Security Considerations
When reading untrusted Parquet files, this crate implements several security protections:
§Decompression Bomb Protection
Parquet files can use compression algorithms like GZIP or ZSTD, which could be exploited to create “decompression bombs” - small compressed files that expand to enormous sizes.
Protection: Total decompressed data is limited to 100 MB (MAX_DECOMPRESSED_SIZE).
Files exceeding this limit are rejected with a clear security error.
§Large Schema Attacks
Malicious files could contain thousands of columns, causing memory exhaustion during schema processing.
Protection: Schemas are limited to 1,000 columns (MAX_COLUMNS). Files with more
columns are rejected immediately.
§Memory Exhaustion Prevention
Large dimensions (many rows × many columns) could exhaust available memory.
Protection: The decompressed size limit (100 MB) effectively prevents row × column multiplication attacks. Memory usage is tracked cumulatively across all record batches.
§Metadata Validation
Parquet metadata could contain malicious identifiers (SQL injection, XSS, etc.).
Protection: All metadata values are validated as HEDL identifiers (alphanumeric + underscore, max 100 chars). Invalid characters are sanitized or rejected.
§Integer Overflow Protection
Size calculations could overflow when processing malicious files.
Protection: All size calculations use checked arithmetic (checked_add, etc.).
Overflows produce clear security errors.
For complete security documentation, see the SECURITY.md file in the workspace root.
§Examples
§Writing HEDL to Parquet
use hedl_core::{Document, MatrixList, Node, Value, Item};
use hedl_parquet::to_parquet;
use std::path::Path;
let mut doc = Document::new((2, 0));
let mut matrix_list = MatrixList::new(
"User",
vec!["id".to_string(), "name".to_string(), "age".to_string()]
);
let node = Node::new(
"User",
"alice",
vec![Value::String("Alice".to_string().into()), Value::Int(30)],
);
matrix_list.add_row(node);
doc.root.insert("users".to_string(), Item::List(matrix_list));
to_parquet(&doc, Path::new("output.parquet")).unwrap();§Reading Parquet to HEDL
use hedl_parquet::from_parquet;
use std::path::Path;
let doc = from_parquet(Path::new("input.parquet")).unwrap();
println!("Version: {:?}", doc.version);§Round-trip Conversion
use hedl_core::{Document, MatrixList, Node, Value, Item};
use hedl_parquet::{to_parquet_bytes, from_parquet_bytes};
let mut doc = Document::new((2, 0));
// ... populate document ...
// Convert to Parquet bytes
let bytes = to_parquet_bytes(&doc).unwrap();
// Convert back to HEDL
let doc2 = from_parquet_bytes(&bytes).unwrap();§Mapping Strategy
§Matrix Lists → Parquet Tables
HEDL matrix lists map naturally to Parquet tables:
- Each column in the HEDL schema becomes a Parquet column
- The first column (ID) is always a string column
- Data types are inferred from values: Int, Float, Bool, String
- References are stored as strings (e.g., “@User:alice”)
- Tensors are serialized as strings
§Key-Value Pairs → Metadata Table
When a HEDL document contains only scalar key-value pairs (no matrix lists), they are stored as a two-column table with “key” and “value” columns.
§Type Inference
When reading Parquet files, HEDL types are inferred from Arrow types:
- Arrow Boolean → HEDL Bool
- Arrow Int8/16/32/64, UInt8/16/32/64 → HEDL Int
- Arrow Float32/64 → HEDL Float
- Arrow Utf8 → HEDL String (or Reference if starts with ‘@’)
§Compression
Parquet files can be compressed using various algorithms. The default is SNAPPY,
but you can customize this using ToParquetConfig:
use hedl_core::Document;
use hedl_parquet::{to_parquet_with_config, ToParquetConfig};
use parquet::basic::Compression;
use std::path::Path;
let doc = Document::new((2, 0));
let config = ToParquetConfig {
compression: Compression::GZIP(Default::default()),
..Default::default()
};
to_parquet_with_config(&doc, Path::new("output.parquet"), &config).unwrap();§Null ID Handling
HEDL requires all entities to have non-null IDs. By default, Parquet files with null IDs are rejected:
use hedl_parquet::{from_parquet_bytes, FromParquetConfig};
// This will return an error if any ID is null
let result = from_parquet_bytes(&parquet_bytes);
assert!(result.is_err());For lenient parsing (not recommended), use:
use hedl_parquet::{from_parquet_bytes_with_config, FromParquetConfig};
let config = FromParquetConfig::lenient();
let doc = from_parquet_bytes_with_config(&parquet_bytes, &config).unwrap();
// Null IDs become "__generated_row_N"§Async I/O Support
When the async-io feature is enabled, async variants of read/write operations
are available in the async_io module.
Modules§
- async_
io - Async Parquet I/O. Async I/O support for hedl-parquet.
- predicate
- Predicate pushdown for Parquet. Predicate pushdown support for efficient Parquet filtering.
Structs§
- From
Parquet Config - Configuration for converting Parquet files to HEDL documents.
- ToParquet
Config - Configuration for Parquet writing.
Enums§
- Batch
Size - Batch size strategy for reading Parquet files.
- Compression
Strategy - Compression strategy for Parquet files.
- Enabled
Statistics - Controls the level of statistics to be computed by the writer and stored in the parquet file.
- Null
IdHandling - Strategy for handling null or missing ID values.
- Statistics
Level - Statistics level for Parquet columns.
Functions§
- from_
parquet - Read a HEDL document from a Parquet file with default configuration (strict mode).
- from_
parquet_ bytes - Read a HEDL document from Parquet bytes with default configuration (strict mode).
- from_
parquet_ bytes_ select - Read Parquet bytes selecting only specific columns.
- from_
parquet_ bytes_ with_ config - Read a HEDL document from Parquet bytes with custom configuration.
- from_
parquet_ select - Read Parquet file selecting only specific columns.
- from_
parquet_ with_ config - Read a HEDL document from a Parquet file with custom configuration.
- get_
parquet_ columns - Get column names from a Parquet file without reading data.
- to_
parquet - Write a HEDL document to a Parquet file.
- to_
parquet_ bytes - Convert a HEDL document to Parquet bytes.
- to_
parquet_ bytes_ with_ config - Convert a HEDL document to Parquet bytes with custom configuration.
- to_
parquet_ with_ config - Write a HEDL document to a Parquet file with custom configuration.