Schemata
Core Functionality
- Structure Inference: Analyzes newline-delimited JSON (JSONL) to map dot-notation paths and observed JSON types.
- Deep Metadata: Records field statistics, including min/max values, null rates, cardinality (distinct values), and array lengths.
- Flexible Output: Generates formal Avro schemas or comprehensive JSON reports.
Key Benefits
- Prevent Pipeline Breaks: Spot type collisions between documents before they reach production.
- Smart Modeling: Identify unbounded keys (e.g., ID-based keys) to correctly model them as Maps instead of Records.
- Schema Automation: Instantly bootstrap schemas for new data sources. w
Building
Requires Rust 1.70 or later.
# development build
# install the binary to ~/.cargo/bin
Quick Start
# Infer an Avro schema from a local file
# Get a full JSON report with statistics
# Pipe from stdin
|
Usage
schemata [OPTIONS] [FILES]...
If no files are provided, schemata reads from stdin. Running it without
arguments or a pipe prints this help.
Options
| Flag | Default | Description |
|---|---|---|
-n, --limit <N> |
— | Stop after N records |
--max-keys <N> |
1000 |
Distinct key threshold before a field is flagged as an unbounded map |
--distinct-cap <N> |
1000 |
Exact distinct value cap before switching to HyperLogLog++ |
-o, --output <FORMAT> |
avro |
Output format: avro or json |
-p, --pretty |
— | Pretty-print the output |
-h, --help |
— | Print help |
-V, --version |
— | Print version |
Path Notation
Fields are identified by dot-notation paths. Array elements use $ as a
placeholder so the element schema is captured independently of array length.
| JSON | Path | Type |
|---|---|---|
{"a": 1} |
a |
integer |
{"a": {"b": 1}} |
a.b |
integer |
{"a": [1, 2]} |
a |
array · a.$ → integer |
{"a": [{"b": 1}]} |
a.$.b |
integer |
Output Formats
Avro (default)
Emits a valid Avro schema
in JSON encoding. The top-level schema is always a record named Root.
- Fields seen as both
nulland a concrete type become nullable unions:["null", "long"] - Fields with a type collision (multiple non-null types) emit a union and a
docannotation so the problem is visible in the schema itself: - Fields whose distinct key count exceeds
--max-keysbecome Avromaptypes with adocannotation. - Arrays become
{"type": "array", "items": <element_type>}. - Nested objects become named
recordtypes (PascalCase of the field name).
JSON Report
Emits a flat JSON document keyed by path. Each entry contains the type counts, field statistics, and any collision or unbounded-key annotations:
Fields with type collisions include "type_collision": true and a
"collision_types" array. Unbounded-key objects include "unbounded_keys": true.
Cloud Data Lakes
schemata reads plain JSONL from stdin. Decompression and cloud storage access
are handled by standard shell tools — pipe the data in and schemata does the
rest.
Google Cloud Storage (GCS)
# Uncompressed file
|
# Gzip-compressed file
| |
# Multiple objects via wildcard
|
# Compressed wildcard
| |
# Save schema to a file (diagnostics go to stderr, schema to stdout)
|
Amazon S3
# Uncompressed file
|
# Gzip-compressed file
| |
# Multiple objects (list then stream each)
| | |
# Save schema to a file
|
Diagnostics
All diagnostic output goes to stderr so stdout always contains only the schema. Redirect stderr separately if you need to capture both:
| Message | Meaning |
|---|---|
[WARN] Type collision at path "x.y" |
Field observed with multiple non-null types — manual transformation likely needed |
[WARN] Path "x.y" has exceeded N distinct keys |
Field has too many distinct keys to be a fixed record — treated as a map |
[WARN] Skipping invalid JSON on record N |
A line could not be parsed as JSON and was skipped |
[INFO] Processed N records ... |
Summary printed after analysis completes |