genson-cli 0.7.4

Command-line interface for JSON schema inference
genson-cli-0.7.4 is not a library.

Genson CLI

crates.io MIT/Apache-2.0 licensed Documentation

A command-line tool for JSON schema inference with support for both regular and NDJSON.

Built on top of genson-core, this CLI tool provides a simple yet powerful interface for generating JSON schemas from JSON data files or standard input.

It was mainly for testing but might be useful in its own right as a standalone binary for schema inference.

Installation

cargo binstall genson-cli

or regular cargo install if you like building from source.

Usage

Basic Examples

# From a JSON file
genson-cli data.json

# From standard input
echo '{"name": "Alice", "age": 30}' | genson-cli

# From stdin with multiple JSON objects
cat multiple-objects.json | genson-cli

NDJSON Support

# Process newline-delimited JSON
genson-cli --ndjson data.jsonl

# From stdin
cat events.ndjson | genson-cli --ndjson

Array Handling

# Treat top-level arrays as object streams (default)
genson-cli data.json

# Preserve array structure
genson-cli --no-ignore-array array-data.json

Command Line Options

genson-cli - JSON schema inference tool

USAGE:
    genson-cli [OPTIONS] [FILE]

ARGS:
    <FILE>    Input JSON file (reads from stdin if not provided)

OPTIONS:
    -h, --help            Print this help message
    --no-ignore-array     Don't treat top-level arrays as object streams
    --ndjson              Treat input as newline-delimited JSON
    --avro                Output Avro schema instead of JSON Schema
    --normalise           Normalise the input data against the inferred schema
    --coerce-strings      Coerce numeric/boolean strings to schema type during normalisation
    --keep-empty          Keep empty arrays/maps instead of turning them into nulls
    --map-threshold <N>   Treat objects with >N keys as map candidates (default 20)
    --map-max-rk <N>      Maximum required keys for Map inference (default: no limit)
    --map-max-required-keys <N>
    --unify-maps          Enable unification of compatible record schemas into maps
                          Same as --map-max-rk
    --no-unify <fields>   Exclude fields from record unification (comma-separated)
                          Example: --no-unify qualifiers,references
    --force-type k:v,...  Force field(s) to 'map' or 'record'
                          Example: --force-type labels:map,claims:record
    --force-parent-type k:v,...  Force parent objects containing field(s) to 'map' or 'record'
                                 Example: --force-parent-type mainsnak:record
    --force-scalar-promotion <fields>
                          Always promote these fields to wrapped scalars (comma-separated)
                          Example: --force-scalar-promotion precision,datavalue
    --map-encoding <mode> Choose map encoding (mapping|entries|kv)
                          mapping = Avro/JSON object (shared dict)
                          entries = list of single-entry objects (individual dicts)
                          kv      = list of {key,value} objects
    --no-wrap-scalars     Disable scalar promotion (keep raw scalar types)
    --wrap-root <field>   Wrap top-level schema under this required field
    --root-map            Allow document root to become a map
    --max-builders <N>    Maximum schema builders to create in parallel at once
                          Lower values reduce peak memory (default: unlimited)
    --debug               Enable debug output during schema inference
    --profile             Enable profiling output during schema inference

EXAMPLES:
    genson-cli data.json
    echo '{"name": "test"}' | genson-cli
    genson-cli --ndjson multi-line.jsonl

Normalisation

Normalisation rewrites raw JSON data so that every record conforms to a single inferred Avro schema. This is especially useful when input data is jagged, inconsistent, or comes from semi-structured sources.

Features:

  • Converts empty arrays/maps to null (default), or preserves them with --keep-empty.
  • Ensures missing keys are present with null values.
  • Handles unions (e.g. ["null", "string"] where values may be either).
  • Optionally coerces numeric/boolean strings into real types (--coerce-strings).

Examples

Simple Object Schema

Input:

{"name": "Alice", "age": 30, "active": true}

Command:

echo '{"name": "Alice", "age": 30, "active": true}' | genson-cli

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "active": {
      "type": "boolean"
    }
  },
  "required": [
    "age",
    "active", 
    "name"
  ]
}

Avro Schema

echo '{"name": "Alice", "age": 30, "active": true}' | genson-cli --avro

Output:

{
  "type": "record",
  "name": "document",
  "namespace": "genson",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "age",
      "type": "int"
    },
    {
      "name": "active",
      "type": "boolean"
    }
  ]
}

Multiple Objects Schema

Input file (users.json):

{"name": "Alice", "age": 30, "scores": [95, 87]}
{"name": "Bob", "age": 25, "city": "NYC", "active": true}
{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}

Command:

genson-cli users.json

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "scores": {
      "type": "array",
      "items": {
        "type": "integer"
      }
    },
    "city": {
      "type": "string"
    },
    "active": {
      "type": "boolean"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": ["role"]
    }
  },
  "required": ["age", "name"]
}

NDJSON Processing

Input file (events.ndjson):

{"event": "login", "user": "alice", "timestamp": "2024-01-01T10:00:00Z"}
{"event": "logout", "user": "alice", "timestamp": "2024-01-01T11:00:00Z", "duration": 3600}
{"event": "login", "user": "bob", "timestamp": "2024-01-01T10:30:00Z", "ip": "192.168.1.100"}

Command:

genson-cli --ndjson events.ndjson

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "event": {
      "type": "string"
    },
    "user": {
      "type": "string"
    },
    "timestamp": {
      "type": "string"
    },
    "duration": {
      "type": "integer"
    },
    "ip": {
      "type": "string"
    }
  },
  "required": ["event", "timestamp", "user"]
}

Array Schema

Input file (array.json):

[
  {"id": 1, "name": "Product A"},
  {"id": 2, "name": "Product B", "category": "electronics"}
]

Command (treat as object stream - default):

genson-cli array.json

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "id": {
      "type": "integer"
    },
    "name": {
      "type": "string"
    },
    "category": {
      "type": "string"
    }
  },
  "required": ["id", "name"]
}

Command (preserve array structure):

genson-cli --no-ignore-array array.json

Output:

{
  "$schema": "http://json-schema.org/schema#",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "id": {
        "type": "integer"
      },
      "name": {
        "type": "string"
      },
      "category": {
        "type": "string"
      }
    },
    "required": ["id", "name"]
  }
}

Empty Values

Input (empty.json):

{"id": "Q1", "labels": {}}
{"id": "Q2", "labels": {"en": "Hello"}}

Command:

genson-cli --ndjson --normalise empty.json

Output:

{"id": "Q1", "labels": null}
{"id": "Q2", "labels": {"en": "Hello"}}

String Coercion

Input (stringy.json):

{"id": "42", "active": "true"}
{"id": 7, "active": false}

Command (default):

genson-cli --ndjson --normalise stringy.json

Output (no coercion, strings remain strings):

{"id": null, "active": null}
{"id": 7, "active": false}

Command (with coercion):

genson-cli --ndjson --normalise --coerce-strings data.json

Output:

{"id": 42, "active": true}
{"id": 7, "active": false}

Error Handling

The CLI provides clear error messages for common issues:

Invalid JSON

$ echo '{"invalid": json}' | genson-cli
Error: Invalid JSON input at index 1: expected value at line 1 column 13 - JSON: {"invalid": json}

File Not Found

$ genson-cli nonexistent.json
Error: No such file or directory (os error 2)

Empty Input

$ echo '' | genson-cli  
Error: No JSON strings provided

Performance

  • Parallel Processing: Automatically uses multiple cores for large datasets
  • Memory Efficient: Streams large files without loading everything into memory
  • Fast Parsing: Uses SIMD-accelerated JSON parsing where available

For a 100MB NDJSON file with 1M records:

  • Processing time: ~5-10 seconds (depending on CPU cores)
  • Memory usage: <100MB (constant regardless of file size)
  • Schema accuracy: 100% type detection

Integration

The CLI tool is part of the larger polars-genson ecosystem:

Use Cases

Data Analysis Pipeline

# Extract schema from API responses
curl https://api.example.com/users | genson-cli > users-schema.json

# Process log files
genson-cli --ndjson application.log > log-schema.json

# Validate data structure
cat data.json | genson-cli | jq '.properties | keys'

Schema-Driven Development

# Generate schema for documentation
genson-cli sample-data.json > api-schema.json

# Validate API responses match expected schema
# (combine with tools like ajv-cli for validation)

Data Migration

# Understand structure of legacy data
genson-cli legacy-export.json > legacy-schema.json

# Compare schemas between different data sources
diff <(genson-cli source1.json) <(genson-cli source2.json)

Advanced Usage

Processing Large Files

For very large JSON files, consider using streaming tools:

# Process large file in chunks
split -l 10000 large-file.ndjson chunk_
for chunk in chunk_*; do
    genson-cli --ndjson "$chunk" > "schema_${chunk}.json"
done

# Merge resulting schemas (requires additional tooling)

Custom Schema URIs

The tool supports different schema versions:

# Default: http://json-schema.org/schema#
genson-cli data.json

# The schema URI is automatically included in output

Contributing

This crate is part of the polars-genson project. See the main repository for the contribution and development docs.

License

Licensed under the MIT License. See LICENSE](https://github.com/lmmx/polars-genson/blob/master/LICENSE) for details.