dump-json-refs
dump-json-refs is a Rust CLI for extracting canonical metadata from JSON and
JSONL datasets. It walks object structures, writes deduplicated schema reference
files, indexes schema/object/field metadata into SQLite, and generates relation
seeds that help reverse engineer a relational database model from semi-structured
data.
The tool is intended for database reverse engineering workflows where raw JSON exports need to be turned into inspectable schema candidates, field frequency statistics, and foreign-key-like relationship graphs.
install
Install from crates.io:
Or install directly from Git:
Usage
|
Large JSONL inputs
For JSONL input, dump-json-refs streams records instead of loading the whole
file into memory. This keeps large JSONL extraction practical while preserving
the same schema, field count, object path, alias, and array index reference
semantics as normal JSONL processing.
Output
The command removes and recreates the output directory, then creates:
- canonical schema JSON files under
refs/ - relative symlinks for duplicate canonical schemas on Unix, or copied schema files on platforms without symlink support
refs/schemas.sqlite- an optional schema relation graph, such as
refs/schema.mmd,refs/schema.md, orrefs/schema.dot
Stdout prints only the [summary] section by default.
Use --output with no file to print the standard report to stdout. The standard
report includes schemas, field counts, aliases, and summary metadata.
Use --output <file> to write the standard report to a file while keeping
stdout to [summary].
Use --full-report <file> to write the full report, including object path,
array index, site occurrence, site field count, and presence-shape mappings,
while keeping stdout to [summary].
Presence-shape rows are enabled by default and capped per site:
--no-presence-shapes: skipschema_site_presence_shapesrows while still writing exact site object and field counts.--max-presence-shapes-per-site <N>: store at mostNdistinct shapes for a single(schema_path, site_path, site_kind). The default is1024;0is rejected.
Graphs
Graph generation reads schema_relations from SQLite. It can run immediately
after extraction or replay from an existing index:
Graph options:
--graph [FILE]: writes a graph projection. IfFILEis omitted, the default path depends on the format.--graph-format mermaid|mermaid-md|dot: selects Mermaid, fenced Mermaid Markdown, or Graphviz DOT output. Mermaid is the default.--graph-rankdir LR|TB|RL|BT: selects graph direction for Mermaid and DOT.--graph-include-marked: includes structural marker relations such as nested array items; otherwise graphs include foreign-key candidates only.--from-sqlite [FILE]: loads report and graph data from an existing SQLite index without regenerating refs. If the flag is used without a value,refs/schemas.sqliteis used.
SQLite schema:
(
schema_path TEXT NOT NULL,
object_path TEXT PRIMARY KEY
);
(
array_path TEXT NOT NULL,
array_index_path TEXT NOT NULL,
schema_path TEXT NOT NULL,
PRIMARY KEY (array_path, array_index_path)
);
(
schema_path TEXT PRIMARY KEY,
schema_kind TEXT NOT NULL,
schema_json TEXT NOT NULL
);
(
schema_path TEXT PRIMARY KEY,
object_count INTEGER NOT NULL CHECK (object_count > 0)
);
(
schema_path TEXT NOT NULL,
field_name TEXT NOT NULL,
field_type TEXT NOT NULL,
field_count INTEGER NOT NULL CHECK (field_count > 0),
PRIMARY KEY (schema_path, field_name, field_type)
);
(
schema_path TEXT NOT NULL,
site_path TEXT NOT NULL,
site_kind TEXT NOT NULL CHECK (site_kind IN ('object', 'array_item', 'root_collection')),
object_count INTEGER NOT NULL CHECK (object_count > 0),
PRIMARY KEY (schema_path, site_path, site_kind)
);
(
schema_path TEXT NOT NULL,
site_path TEXT NOT NULL,
site_kind TEXT NOT NULL CHECK (site_kind IN ('object', 'array_item', 'root_collection')),
field_name TEXT NOT NULL,
schema_field_type TEXT NOT NULL,
present_count INTEGER NOT NULL CHECK (present_count >= 0),
missing_count INTEGER NOT NULL CHECK (missing_count >= 0),
PRIMARY KEY (schema_path, site_path, site_kind, field_name)
);
(
schema_path TEXT NOT NULL,
site_path TEXT NOT NULL,
site_kind TEXT NOT NULL CHECK (site_kind IN ('object', 'array_item', 'root_collection')),
present_fields_hash TEXT NOT NULL,
present_fields_json TEXT NOT NULL,
missing_fields_json TEXT NOT NULL,
object_count INTEGER NOT NULL CHECK (object_count > 0),
first_array_index_path TEXT,
PRIMARY KEY (schema_path, site_path, site_kind, present_fields_hash)
);
(
schema_path TEXT NOT NULL,
site_path TEXT NOT NULL,
site_kind TEXT NOT NULL CHECK (site_kind IN ('object', 'array_item', 'root_collection')),
observed_shape_count INTEGER NOT NULL CHECK (observed_shape_count >= 0),
stored_shape_count INTEGER NOT NULL CHECK (stored_shape_count >= 0),
truncated INTEGER NOT NULL CHECK (truncated IN (0, 1)),
PRIMARY KEY (schema_path, site_path, site_kind)
);
(
relation_id INTEGER PRIMARY KEY AUTOINCREMENT,
from_schema_path TEXT NOT NULL,
to_schema_path TEXT NOT NULL,
relation_kind TEXT NOT NULL,
fk_owner TEXT NOT NULL,
fk_candidate INTEGER NOT NULL CHECK (fk_candidate IN (0, 1)),
field_name TEXT NOT NULL,
field_type TEXT NOT NULL,
cardinality TEXT NOT NULL,
required INTEGER NOT NULL CHECK (required IN (0, 1)),
mixed INTEGER NOT NULL CHECK (mixed IN (0, 1)),
nested_array_depth INTEGER NOT NULL DEFAULT 0 CHECK (nested_array_depth >= 0),
via_schema_path TEXT,
via_array_path TEXT,
parent_schema_path TEXT NOT NULL,
child_schema_path TEXT NOT NULL,
parent_object_count INTEGER NOT NULL DEFAULT 0 CHECK (parent_object_count >= 0),
child_object_count INTEGER NOT NULL DEFAULT 0 CHECK (child_object_count >= 0),
field_count INTEGER NOT NULL DEFAULT 0 CHECK (field_count >= 0),
UNIQUE (
from_schema_path,
to_schema_path,
relation_kind,
field_name,
field_type,
via_schema_path,
via_array_path
)
);
Semantics
- JSON object input: root path is
<filename-without-ext>orrootfor stdin. - JSON top-level array of objects: root collection path is
<filename-without-ext>_reforroot_itemfor stdin. - JSONL input: every JSON value must be an object; root collection naming is the same as top-level array.
- Primitive labels:
string,number,boolean,null, optional suffix?, union via|. - Object fields use a plain reference such as
<path>.json; nullable or missing object fields do not add?or|nullto that reference. - Every array field uses
array(...): primitive arrays use labels such asarray(string), object arrays usearray(<path>.json), nested arrays add another wrapper, and mixed object/string items usearray(<path>.json|string). - A field observed both as an array and as a scalar preserves both shapes, for example
array(<path>.json)|string. - Object arrays use one
<array_path>.jsonwhen homogeneous. Heterogeneous object arrays use<array_path>.jsoncontaining{ "$refs_mut": "<array_path>/" }, with distinct schemas stored as<array_path>/<index_path>.json. array_index_refs.array_index_pathis a JSON array of every enclosing array position;[1,0]identifies the first nested item inside the second outer item.- JSONL input ignores raw NUL padding only at the start of a physical record.
- An incomplete trailing JSONL record may be ignored to tolerate partially written files.
- Other malformed JSON remains an error that identifies its line.
- Top-level collection:
- homogeneous: one
<root_collection>.json - heterogeneous: distinct
<root_collection>/<first_index>.jsonfiles only
- homogeneous: one
Each walked JSON object increments the object_count of the canonical schema
selected by schema_paths or array_index_refs. Each key present in that
object increments its field_count; a key with a null value is present, while
an omitted key is not. $refs_mut array containers are structural schemas, not
objects, and are not counted.
The schema_site_* tables preserve the concrete occurrence site before
canonical aliasing. schema_site_field_counts records structural key presence
and missingness by site, using canonical schema fields as the missingness source
of truth. schema_site_presence_shapes records sorted field combinations that
appeared together at a site, with missing fields computed from the same
canonical field set. If a site's shape cap is reached,
schema_site_presence_shape_limits.truncated is 1; site object and field
counts remain exact.
schema_relations stores inferred links between canonical schemas. Relation
rows describe direct object references, array item references, heterogeneous
array variants, and nested array markers. Rows marked with fk_candidate = 1
are the default graph source for relational reverse engineering; marked
structural rows can be included with --graph-include-marked.