Skip to main content

Crate hedl_parquet

Crate hedl_parquet 

Source
Expand description

Bidirectional conversion between Parquet and HEDL formats.

This crate provides functionality to convert HEDL documents to Parquet files and vice versa. Parquet is a columnar storage format that is well-suited for HEDL’s matrix list structures.

§Features

  • HEDL → Parquet: Convert HEDL documents to Parquet files with configurable compression
  • Parquet → HEDL: Read Parquet files and convert them back to HEDL documents
  • Matrix Lists: Natural mapping of HEDL matrix lists to Parquet tables
  • Metadata: Support for key-value pairs as single-row tables
  • Security: Built-in protections against decompression bombs and memory exhaustion

§Security Considerations

When reading untrusted Parquet files, this crate implements several security protections:

§Decompression Bomb Protection

Parquet files can use compression algorithms like GZIP or ZSTD, which could be exploited to create “decompression bombs” - small compressed files that expand to enormous sizes.

Protection: Total decompressed data is limited to 100 MB (MAX_DECOMPRESSED_SIZE). Files exceeding this limit are rejected with a clear security error.

§Large Schema Attacks

Malicious files could contain thousands of columns, causing memory exhaustion during schema processing.

Protection: Schemas are limited to 1,000 columns (MAX_COLUMNS). Files with more columns are rejected immediately.

§Memory Exhaustion Prevention

Large dimensions (many rows × many columns) could exhaust available memory.

Protection: The decompressed size limit (100 MB) effectively prevents row × column multiplication attacks. Memory usage is tracked cumulatively across all record batches.

§Metadata Validation

Parquet metadata could contain malicious identifiers (SQL injection, XSS, etc.).

Protection: All metadata values are validated as HEDL identifiers (alphanumeric + underscore, max 100 chars). Invalid characters are sanitized or rejected.

§Integer Overflow Protection

Size calculations could overflow when processing malicious files.

Protection: All size calculations use checked arithmetic (checked_add, etc.). Overflows produce clear security errors.

For complete security documentation, see the SECURITY.md file in the workspace root.

§Examples

§Writing HEDL to Parquet

use hedl_core::{Document, MatrixList, Node, Value, Item};
use hedl_parquet::to_parquet;
use std::path::Path;

let mut doc = Document::new((2, 0));
let mut matrix_list = MatrixList::new(
    "User",
    vec!["id".to_string(), "name".to_string(), "age".to_string()]
);

let node = Node::new(
    "User",
    "alice",
    vec![Value::String("Alice".to_string().into()), Value::Int(30)],
);
matrix_list.add_row(node);
doc.root.insert("users".to_string(), Item::List(matrix_list));

to_parquet(&doc, Path::new("output.parquet")).unwrap();

§Reading Parquet to HEDL

use hedl_parquet::from_parquet;
use std::path::Path;

let doc = from_parquet(Path::new("input.parquet")).unwrap();
println!("Version: {:?}", doc.version);

§Round-trip Conversion

use hedl_core::{Document, MatrixList, Node, Value, Item};
use hedl_parquet::{to_parquet_bytes, from_parquet_bytes};

let mut doc = Document::new((2, 0));
// ... populate document ...

// Convert to Parquet bytes
let bytes = to_parquet_bytes(&doc).unwrap();

// Convert back to HEDL
let doc2 = from_parquet_bytes(&bytes).unwrap();

§Mapping Strategy

§Matrix Lists → Parquet Tables

HEDL matrix lists map naturally to Parquet tables:

  • Each column in the HEDL schema becomes a Parquet column
  • The first column (ID) is always a string column
  • Data types are inferred from values: Int, Float, Bool, String
  • References are stored as strings (e.g., “@User:alice”)
  • Tensors are serialized as strings

§Key-Value Pairs → Metadata Table

When a HEDL document contains only scalar key-value pairs (no matrix lists), they are stored as a two-column table with “key” and “value” columns.

§Type Inference

When reading Parquet files, HEDL types are inferred from Arrow types:

  • Arrow Boolean → HEDL Bool
  • Arrow Int8/16/32/64, UInt8/16/32/64 → HEDL Int
  • Arrow Float32/64 → HEDL Float
  • Arrow Utf8 → HEDL String (or Reference if starts with ‘@’)

§Compression

Parquet files can be compressed using various algorithms. The default is SNAPPY, but you can customize this using ToParquetConfig:

use hedl_core::Document;
use hedl_parquet::{to_parquet_with_config, ToParquetConfig};
use parquet::basic::Compression;
use std::path::Path;

let doc = Document::new((2, 0));
let config = ToParquetConfig {
    compression: Compression::GZIP(Default::default()),
    ..Default::default()
};

to_parquet_with_config(&doc, Path::new("output.parquet"), &config).unwrap();

§Null ID Handling

HEDL requires all entities to have non-null IDs. By default, Parquet files with null IDs are rejected:

use hedl_parquet::{from_parquet_bytes, FromParquetConfig};

// This will return an error if any ID is null
let result = from_parquet_bytes(&parquet_bytes);
assert!(result.is_err());

For lenient parsing (not recommended), use:

use hedl_parquet::{from_parquet_bytes_with_config, FromParquetConfig};

let config = FromParquetConfig::lenient();
let doc = from_parquet_bytes_with_config(&parquet_bytes, &config).unwrap();
// Null IDs become "__generated_row_N"

§Async I/O Support

When the async-io feature is enabled, async variants of read/write operations are available in the async_io module.

Modules§

async_io
Async Parquet I/O. Async I/O support for hedl-parquet.
predicate
Predicate pushdown for Parquet. Predicate pushdown support for efficient Parquet filtering.

Structs§

FromParquetConfig
Configuration for converting Parquet files to HEDL documents.
ToParquetConfig
Configuration for Parquet writing.

Enums§

BatchSize
Batch size strategy for reading Parquet files.
CompressionStrategy
Compression strategy for Parquet files.
EnabledStatistics
Controls the level of statistics to be computed by the writer and stored in the parquet file.
NullIdHandling
Strategy for handling null or missing ID values.
StatisticsLevel
Statistics level for Parquet columns.

Functions§

from_parquet
Read a HEDL document from a Parquet file with default configuration (strict mode).
from_parquet_bytes
Read a HEDL document from Parquet bytes with default configuration (strict mode).
from_parquet_bytes_select
Read Parquet bytes selecting only specific columns.
from_parquet_bytes_with_config
Read a HEDL document from Parquet bytes with custom configuration.
from_parquet_select
Read Parquet file selecting only specific columns.
from_parquet_with_config
Read a HEDL document from a Parquet file with custom configuration.
get_parquet_columns
Get column names from a Parquet file without reading data.
to_parquet
Write a HEDL document to a Parquet file.
to_parquet_bytes
Convert a HEDL document to Parquet bytes.
to_parquet_bytes_with_config
Convert a HEDL document to Parquet bytes with custom configuration.
to_parquet_with_config
Write a HEDL document to a Parquet file with custom configuration.