Genson CLI
A command-line tool for JSON schema inference with support for both regular and NDJSON.
Built on top of genson-core, this CLI tool provides a simple yet powerful interface for generating JSON schemas from JSON data files or standard input.
It was mainly for testing but might be useful in its own right as a standalone binary for schema inference.
Installation
or regular cargo install
if you like building from source.
Usage
Basic Examples
# From a JSON file
# From standard input
|
# From stdin with multiple JSON objects
|
NDJSON Support
# Process newline-delimited JSON
# From stdin
|
Array Handling
# Treat top-level arrays as object streams (default)
# Preserve array structure
Command Line Options
genson-cli - JSON schema inference tool
USAGE:
genson-cli [OPTIONS] [FILE]
ARGS:
<FILE> Input JSON file (reads from stdin if not provided)
OPTIONS:
-h, --help Print this help message
--no-ignore-array Don't treat top-level arrays as object streams
--ndjson Treat input as newline-delimited JSON
--avro Output Avro schema instead of JSON Schema
--normalise Normalise the input data against the inferred schema
--coerce-strings Coerce numeric/boolean strings to schema type during normalisation
--keep-empty Keep empty arrays/maps instead of turning them into nulls
--map-threshold <N> Treat objects with >N keys as map candidates (default 20)
--map-max-rk <N> Maximum required keys for Map inference (default: no limit)
--map-max-required-keys <N>
--unify-maps Enable unification of compatible record schemas into maps
Same as --map-max-rk
--no-unify <fields> Exclude fields from record unification (comma-separated)
Example: --no-unify qualifiers,references
--force-type k:v,... Force field(s) to 'map' or 'record'
Example: --force-type labels:map,claims:record
--force-parent-type k:v,... Force parent objects containing field(s) to 'map' or 'record'
Example: --force-parent-type mainsnak:record
--force-scalar-promotion <fields>
Always promote these fields to wrapped scalars (comma-separated)
Example: --force-scalar-promotion precision,datavalue
--map-encoding <mode> Choose map encoding (mapping|entries|kv)
mapping = Avro/JSON object (shared dict)
entries = list of single-entry objects (individual dicts)
kv = list of {key,value} objects
--no-wrap-scalars Disable scalar promotion (keep raw scalar types)
--wrap-root <field> Wrap top-level schema under this required field
--root-map Allow document root to become a map
--max-builders <N> Maximum schema builders to create in parallel at once
Lower values reduce peak memory (default: unlimited)
--debug Enable debug output during schema inference
--profile Enable profiling output during schema inference
EXAMPLES:
genson-cli data.json
echo '{"name": "test"}' | genson-cli
genson-cli --ndjson multi-line.jsonl
Normalisation
Normalisation rewrites raw JSON data so that every record conforms to a single inferred Avro schema. This is especially useful when input data is jagged, inconsistent, or comes from semi-structured sources.
Features:
- Converts empty arrays/maps to
null
(default), or preserves them with--keep-empty
. - Ensures missing keys are present with
null
values. - Handles unions (e.g.
["null", "string"]
where values may be either). - Optionally coerces numeric/boolean strings into real types (
--coerce-strings
).
Examples
Simple Object Schema
Input:
Command:
|
Output:
Avro Schema
|
Output:
{
"type": "record",
"name": "document",
"namespace": "genson",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "age",
"type": "int"
},
{
"name": "active",
"type": "boolean"
}
]
}
Multiple Objects Schema
Input file (users.json
):
Command:
Output:
NDJSON Processing
Input file (events.ndjson
):
{"event": "login", "user": "alice", "timestamp": "2024-01-01T10:00:00Z"}
{"event": "logout", "user": "alice", "timestamp": "2024-01-01T11:00:00Z", "duration": 3600}
{"event": "login", "user": "bob", "timestamp": "2024-01-01T10:30:00Z", "ip": "192.168.1.100"}
Command:
Output:
Array Schema
Input file (array.json
):
Command (treat as object stream - default):
Output:
Command (preserve array structure):
Output:
Empty Values
Input (empty.json
):
Command:
Output:
String Coercion
Input (stringy.json
):
Command (default):
Output (no coercion, strings remain strings):
Command (with coercion):
Output:
Error Handling
The CLI provides clear error messages for common issues:
Invalid JSON
|
File Not Found
)
Empty Input
|
Performance
- Parallel Processing: Automatically uses multiple cores for large datasets
- Memory Efficient: Streams large files without loading everything into memory
- Fast Parsing: Uses SIMD-accelerated JSON parsing where available
For a 100MB NDJSON file with 1M records:
- Processing time: ~5-10 seconds (depending on CPU cores)
- Memory usage: <100MB (constant regardless of file size)
- Schema accuracy: 100% type detection
Integration
The CLI tool is part of the larger polars-genson ecosystem:
- genson-core: Core Rust library
- polars-genson: Python plugin for Polars
- polars-jsonschema-bridge: Type conversion utilities
Use Cases
Data Analysis Pipeline
# Extract schema from API responses
|
# Process log files
# Validate data structure
| |
Schema-Driven Development
# Generate schema for documentation
# Validate API responses match expected schema
# (combine with tools like ajv-cli for validation)
Data Migration
# Understand structure of legacy data
# Compare schemas between different data sources
Advanced Usage
Processing Large Files
For very large JSON files, consider using streaming tools:
# Process large file in chunks
for; do
done
# Merge resulting schemas (requires additional tooling)
Custom Schema URIs
The tool supports different schema versions:
# Default: http://json-schema.org/schema#
# The schema URI is automatically included in output
Contributing
This crate is part of the polars-genson project. See the main repository for the contribution and development docs.
License
Licensed under the MIT License. See LICENSE](https://github.com/lmmx/polars-genson/blob/master/LICENSE) for details.