dump-json-refs 0.2.0

Extract canonical JSON/JSONL schemas and SQLite metadata for database reverse engineering.
dump-json-refs-0.2.0 is not a library.

dump-json-refs

dump-json-refs is a Rust CLI for extracting canonical metadata from JSON and JSONL datasets. It walks object structures, writes deduplicated schema reference files, indexes schema/object/field metadata into SQLite, and generates relation seeds that help reverse engineer a relational database model from semi-structured data.

The tool is intended for database reverse engineering workflows where raw JSON exports need to be turned into inspectable schema candidates, field frequency statistics, and foreign-key-like relationship graphs.

install

Install from crates.io:

cargo install dump-json-refs

Or install directly from Git:

cargo install --git https://github.com/widehyo1/dump-json-refs.git

Usage

dump-json-refs --outdir refs input.json
dump-json-refs --jsonl --outdir refs input.jsonl
cat input.json | dump-json-refs --outdir refs
dump-json-refs --outdir refs --graph input.json
dump-json-refs --from-sqlite refs/schemas.sqlite --graph-format mermaid-md

Large JSONL inputs

For JSONL input, dump-json-refs streams records instead of loading the whole file into memory. This keeps large JSONL extraction practical while preserving the same schema, field count, object path, alias, and array index reference semantics as normal JSONL processing.

Output

The command removes and recreates the output directory, then creates:

  • canonical schema JSON files under refs/
  • relative symlinks for duplicate canonical schemas on Unix, or copied schema files on platforms without symlink support
  • refs/schemas.sqlite
  • an optional schema relation graph, such as refs/schema.mmd, refs/schema.md, or refs/schema.dot

Stdout prints only the [summary] section by default.

Use --output with no file to print the standard report to stdout. The standard report includes schemas, field counts, aliases, and summary metadata.

Use --output <file> to write the standard report to a file while keeping stdout to [summary].

Use --full-report <file> to write the full report, including object path, array index, site occurrence, site field count, and presence-shape mappings, while keeping stdout to [summary].

Presence-shape rows are enabled by default and capped per site:

  • --no-presence-shapes: skip schema_site_presence_shapes rows while still writing exact site object and field counts.
  • --max-presence-shapes-per-site <N>: store at most N distinct shapes for a single (schema_path, site_path, site_kind). The default is 1024; 0 is rejected.

Graphs

Graph generation reads schema_relations from SQLite. It can run immediately after extraction or replay from an existing index:

dump-json-refs --graph input.json
dump-json-refs --graph refs/schema.dot --graph-format dot input.json
dump-json-refs --from-sqlite refs/schemas.sqlite --graph refs/schema.md --graph-format mermaid-md

Graph options:

  • --graph [FILE]: writes a graph projection. If FILE is omitted, the default path depends on the format.
  • --graph-format mermaid|mermaid-md|dot: selects Mermaid, fenced Mermaid Markdown, or Graphviz DOT output. Mermaid is the default.
  • --graph-rankdir LR|TB|RL|BT: selects graph direction for Mermaid and DOT.
  • --graph-include-marked: includes structural marker relations such as nested array items; otherwise graphs include foreign-key candidates only.
  • --from-sqlite [FILE]: loads report and graph data from an existing SQLite index without regenerating refs. If the flag is used without a value, refs/schemas.sqlite is used.

SQLite schema:

CREATE TABLE schema_paths (
  schema_path TEXT NOT NULL,
  object_path TEXT PRIMARY KEY
);

CREATE TABLE array_index_refs (
  array_path TEXT NOT NULL,
  array_index_path TEXT NOT NULL,
  schema_path TEXT NOT NULL,
  PRIMARY KEY (array_path, array_index_path)
);

CREATE TABLE schema_definitions (
  schema_path TEXT PRIMARY KEY,
  schema_kind TEXT NOT NULL,
  schema_json TEXT NOT NULL
);

CREATE TABLE schema_object_counts (
  schema_path TEXT PRIMARY KEY,
  object_count INTEGER NOT NULL CHECK (object_count > 0)
);

CREATE TABLE schema_field_counts (
  schema_path TEXT NOT NULL,
  field_name TEXT NOT NULL,
  field_type TEXT NOT NULL,
  field_count INTEGER NOT NULL CHECK (field_count > 0),
  PRIMARY KEY (schema_path, field_name, field_type)
);

CREATE TABLE schema_site_counts (
  schema_path TEXT NOT NULL,
  site_path TEXT NOT NULL,
  site_kind TEXT NOT NULL CHECK (site_kind IN ('object', 'array_item', 'root_collection')),
  object_count INTEGER NOT NULL CHECK (object_count > 0),
  PRIMARY KEY (schema_path, site_path, site_kind)
);

CREATE TABLE schema_site_field_counts (
  schema_path TEXT NOT NULL,
  site_path TEXT NOT NULL,
  site_kind TEXT NOT NULL CHECK (site_kind IN ('object', 'array_item', 'root_collection')),
  field_name TEXT NOT NULL,
  schema_field_type TEXT NOT NULL,
  present_count INTEGER NOT NULL CHECK (present_count >= 0),
  missing_count INTEGER NOT NULL CHECK (missing_count >= 0),
  PRIMARY KEY (schema_path, site_path, site_kind, field_name)
);

CREATE TABLE schema_site_presence_shapes (
  schema_path TEXT NOT NULL,
  site_path TEXT NOT NULL,
  site_kind TEXT NOT NULL CHECK (site_kind IN ('object', 'array_item', 'root_collection')),
  present_fields_hash TEXT NOT NULL,
  present_fields_json TEXT NOT NULL,
  missing_fields_json TEXT NOT NULL,
  object_count INTEGER NOT NULL CHECK (object_count > 0),
  first_array_index_path TEXT,
  PRIMARY KEY (schema_path, site_path, site_kind, present_fields_hash)
);

CREATE TABLE schema_site_presence_shape_limits (
  schema_path TEXT NOT NULL,
  site_path TEXT NOT NULL,
  site_kind TEXT NOT NULL CHECK (site_kind IN ('object', 'array_item', 'root_collection')),
  observed_shape_count INTEGER NOT NULL CHECK (observed_shape_count >= 0),
  stored_shape_count INTEGER NOT NULL CHECK (stored_shape_count >= 0),
  truncated INTEGER NOT NULL CHECK (truncated IN (0, 1)),
  PRIMARY KEY (schema_path, site_path, site_kind)
);

CREATE TABLE schema_relations (
  relation_id INTEGER PRIMARY KEY AUTOINCREMENT,
  from_schema_path TEXT NOT NULL,
  to_schema_path TEXT NOT NULL,
  relation_kind TEXT NOT NULL,
  fk_owner TEXT NOT NULL,
  fk_candidate INTEGER NOT NULL CHECK (fk_candidate IN (0, 1)),
  field_name TEXT NOT NULL,
  field_type TEXT NOT NULL,
  cardinality TEXT NOT NULL,
  required INTEGER NOT NULL CHECK (required IN (0, 1)),
  mixed INTEGER NOT NULL CHECK (mixed IN (0, 1)),
  nested_array_depth INTEGER NOT NULL DEFAULT 0 CHECK (nested_array_depth >= 0),
  via_schema_path TEXT,
  via_array_path TEXT,
  parent_schema_path TEXT NOT NULL,
  child_schema_path TEXT NOT NULL,
  parent_object_count INTEGER NOT NULL DEFAULT 0 CHECK (parent_object_count >= 0),
  child_object_count INTEGER NOT NULL DEFAULT 0 CHECK (child_object_count >= 0),
  field_count INTEGER NOT NULL DEFAULT 0 CHECK (field_count >= 0),
  UNIQUE (
    from_schema_path,
    to_schema_path,
    relation_kind,
    field_name,
    field_type,
    via_schema_path,
    via_array_path
  )
);

Semantics

  • JSON object input: root path is <filename-without-ext> or root for stdin.
  • JSON top-level array of objects: root collection path is <filename-without-ext>_ref or root_item for stdin.
  • JSONL input: every JSON value must be an object; root collection naming is the same as top-level array.
  • Primitive labels: string, number, boolean, null, optional suffix ?, union via |.
  • Object fields use a plain reference such as <path>.json; nullable or missing object fields do not add ? or |null to that reference.
  • Every array field uses array(...): primitive arrays use labels such as array(string), object arrays use array(<path>.json), nested arrays add another wrapper, and mixed object/string items use array(<path>.json|string).
  • A field observed both as an array and as a scalar preserves both shapes, for example array(<path>.json)|string.
  • Object arrays use one <array_path>.json when homogeneous. Heterogeneous object arrays use <array_path>.json containing { "$refs_mut": "<array_path>/" }, with distinct schemas stored as <array_path>/<index_path>.json.
  • array_index_refs.array_index_path is a JSON array of every enclosing array position; [1,0] identifies the first nested item inside the second outer item.
  • JSONL input ignores raw NUL padding only at the start of a physical record.
  • An incomplete trailing JSONL record may be ignored to tolerate partially written files.
  • Other malformed JSON remains an error that identifies its line.
  • Top-level collection:
    • homogeneous: one <root_collection>.json
    • heterogeneous: distinct <root_collection>/<first_index>.json files only

Each walked JSON object increments the object_count of the canonical schema selected by schema_paths or array_index_refs. Each key present in that object increments its field_count; a key with a null value is present, while an omitted key is not. $refs_mut array containers are structural schemas, not objects, and are not counted.

The schema_site_* tables preserve the concrete occurrence site before canonical aliasing. schema_site_field_counts records structural key presence and missingness by site, using canonical schema fields as the missingness source of truth. schema_site_presence_shapes records sorted field combinations that appeared together at a site, with missing fields computed from the same canonical field set. If a site's shape cap is reached, schema_site_presence_shape_limits.truncated is 1; site object and field counts remain exact.

schema_relations stores inferred links between canonical schemas. Relation rows describe direct object references, array item references, heterogeneous array variants, and nested array markers. Rows marked with fk_candidate = 1 are the default graph source for relational reverse engineering; marked structural rows can be included with --graph-include-marked.