Expand description
Streaming dataset profiler — one pass over a feature stream, producing null counts, min/max, distinct counts, top-N values, and a small sample per field. Plus a computed bbox extent and per-geometry-type histogram.
§Design
- One pass —
profile(schema, features, opts)consumes any iterator, so it works withLayer::read(),GeoParquetReader::into_features(),GeoJsonReader::into_features(), etc. — anything yieldingFeature. - Bounded memory — distinct/top-N tracking caps at
opts.distinct_limitper field. Past that, we stop counting individual values and just reportdistinct_count = None(meaning “more thandistinct_limit”). - First-N sampling — v0.1 takes the first
opts.sample_nfeatures verbatim. Deterministic, reproducible, but biased toward the head of the file. Reservoir sampling is a future improvement. - Float-aware —
Float32/Float64columns get min/max but no top-N or distinct count (NaN-safe hashing isn’t worth the complication for v0.1). Min/max usespartial_cmpand skips NaN.
Structs§
- Field
Stats - Geometry
Stats - Profile
Options - Profile
Report - Serde
Feature - Serializable feature payload (Value → JSON). Geometry is rendered as
GeoJSON-shaped JSON to keep the report self-contained without pulling
in
geonative-geojsonas a dep (which would create a cycle with the CLI). - TopValue
Enums§
- Json
Value - Lightweight JSON-equivalent used in serialized output. We avoid pulling
serde_json::Valueinto the public API so this crate doesn’t force aserde_jsondependency on downstream library consumers.
Functions§
- profile
- value_
to_ json_ repr - Lossy
Value → JsonValue. Binary/Guid become hex strings, DateTime stays numeric (days since 1899-12-30), Xml stays as a string.