Expand description
Format detection and preprocessing pipeline.
Phase 0: heuristic detection. Phase 1: format-aware preprocessing (JSON key interning) + detection.
Modules§
- json
- JSON key interning — replace repeated keys with short references.
- json_
array - Nested JSON array columnar reorg — lossless transform for JSON files containing arrays of objects with consistent schema.
- ndjson
- NDJSON columnar reorg — lossless transform that reorders row-oriented NDJSON data into column-oriented layout.
- schema
- Schema inference engine for columnar JSON/NDJSON data.
- transform
- Transform framework — chain of reversible preprocessing transforms.
- typed_
encoding - Typed encoding — type-specific binary encoding for columnar data.
- value_
dict - Per-column value dictionary transform — replaces repeated multi-byte values with single-byte dictionary codes.
Functions§
- detect_
format - Detect file format from content bytes.
- detect_
from_ extension - Detect format from file extension (fallback).
- preprocess
- Apply format-aware preprocessing transforms. Returns (preprocessed_data, transform_chain).
- reverse_
preprocess - Reverse preprocessing transforms (applied in reverse order).