Skip to main content

Module profile

Module profile 

Source
Expand description

Streaming dataset profiler — one pass over a feature stream, producing null counts, min/max, distinct counts, top-N values, and a small sample per field. Plus a computed bbox extent and per-geometry-type histogram.

§Design

  • One passprofile(schema, features, opts) consumes any iterator, so it works with Layer::read(), GeoParquetReader::into_features(), GeoJsonReader::into_features(), etc. — anything yielding Feature.
  • Bounded memory — distinct/top-N tracking caps at opts.distinct_limit per field. Past that, we stop counting individual values and just report distinct_count = None (meaning “more than distinct_limit”).
  • First-N sampling — v0.1 takes the first opts.sample_n features verbatim. Deterministic, reproducible, but biased toward the head of the file. Reservoir sampling is a future improvement.
  • Float-awareFloat32 / Float64 columns get min/max but no top-N or distinct count (NaN-safe hashing isn’t worth the complication for v0.1). Min/max uses partial_cmp and skips NaN.

Structs§

FieldStats
GeometryStats
ProfileOptions
ProfileReport
SerdeFeature
Serializable feature payload (Value → JSON). Geometry is rendered as GeoJSON-shaped JSON to keep the report self-contained without pulling in geonative-geojson as a dep (which would create a cycle with the CLI).
TopValue

Enums§

JsonValue
Lightweight JSON-equivalent used in serialized output. We avoid pulling serde_json::Value into the public API so this crate doesn’t force a serde_json dependency on downstream library consumers.

Functions§

profile
value_to_json_repr
Lossy Value → JsonValue. Binary/Guid become hex strings, DateTime stays numeric (days since 1899-12-30), Xml stays as a string.