Skip to main content

Module format

Module format 

Source
Expand description

Format detection and preprocessing pipeline.

Phase 0: heuristic detection. Phase 1: format-aware preprocessing (JSON key interning) + detection.

Modules§

json
JSON key interning — replace repeated keys with short references.
json_array
Nested JSON array columnar reorg — lossless transform for JSON files containing arrays of objects with consistent schema.
ndjson
NDJSON columnar reorg — lossless transform that reorders row-oriented NDJSON data into column-oriented layout.
schema
Schema inference engine for columnar JSON/NDJSON data.
transform
Transform framework — chain of reversible preprocessing transforms.
typed_encoding
Typed encoding — type-specific binary encoding for columnar data.
value_dict
Per-column value dictionary transform — replaces repeated multi-byte values with single-byte dictionary codes.

Functions§

detect_format
Detect file format from content bytes.
detect_from_extension
Detect format from file extension (fallback).
preprocess
Apply format-aware preprocessing transforms. Returns (preprocessed_data, transform_chain).
reverse_preprocess
Reverse preprocessing transforms (applied in reverse order).