dump-json-refs
dump-json-refs is a Rust CLI for extracting canonical metadata from JSON and
JSONL datasets. It walks object structures, writes deduplicated schema reference
files, indexes schema/object/field metadata into SQLite, and generates relation
seeds that help reverse engineer a relational database model from semi-structured
data.
The tool is intended for database reverse engineering workflows where raw JSON exports need to be turned into inspectable schema candidates, field frequency statistics, and foreign-key-like relationship graphs.
install
Install from crates.io:
Or install directly from Git:
Usage
|
Output
The command removes and recreates the output directory, then creates:
- canonical schema JSON files under
refs/ - relative symlinks for duplicate canonical schemas on Unix, or copied schema files on platforms without symlink support
refs/schemas.sqlite- an optional schema relation graph, such as
refs/schema.mmd,refs/schema.md, orrefs/schema.dot
Stdout prints a compact report with schemas, field counts, aliases, and summary
metadata. Use --output <file> to write the full report, including object path
and array index mappings, while keeping stdout to the summary.
Graphs
Graph generation reads schema_relations from SQLite. It can run immediately
after extraction or replay from an existing index:
Graph options:
--graph [FILE]: writes a graph projection. IfFILEis omitted, the default path depends on the format.--graph-format mermaid|mermaid-md|dot: selects Mermaid, fenced Mermaid Markdown, or Graphviz DOT output. Mermaid is the default.--graph-rankdir LR|TB|RL|BT: selects graph direction for Mermaid and DOT.--graph-include-marked: includes structural marker relations such as nested array items; otherwise graphs include foreign-key candidates only.--from-sqlite [FILE]: loads report and graph data from an existing SQLite index without regenerating refs. If the flag is used without a value,refs/schemas.sqliteis used.
SQLite schema:
(
schema_path TEXT NOT NULL,
object_path TEXT PRIMARY KEY
);
(
array_path TEXT NOT NULL,
array_index_path TEXT NOT NULL,
schema_path TEXT NOT NULL,
PRIMARY KEY (array_path, array_index_path)
);
(
schema_path TEXT PRIMARY KEY,
schema_kind TEXT NOT NULL,
schema_json TEXT NOT NULL
);
(
schema_path TEXT PRIMARY KEY,
object_count INTEGER NOT NULL CHECK (object_count > 0)
);
(
schema_path TEXT NOT NULL,
field_name TEXT NOT NULL,
field_type TEXT NOT NULL,
field_count INTEGER NOT NULL CHECK (field_count > 0),
PRIMARY KEY (schema_path, field_name)
);
(
relation_id INTEGER PRIMARY KEY AUTOINCREMENT,
from_schema_path TEXT NOT NULL,
to_schema_path TEXT NOT NULL,
relation_kind TEXT NOT NULL,
fk_owner TEXT NOT NULL,
fk_candidate INTEGER NOT NULL CHECK (fk_candidate IN (0, 1)),
field_name TEXT NOT NULL,
field_type TEXT NOT NULL,
cardinality TEXT NOT NULL,
required INTEGER NOT NULL CHECK (required IN (0, 1)),
mixed INTEGER NOT NULL CHECK (mixed IN (0, 1)),
nested_array_depth INTEGER NOT NULL DEFAULT 0 CHECK (nested_array_depth >= 0),
via_schema_path TEXT,
via_array_path TEXT,
parent_schema_path TEXT NOT NULL,
child_schema_path TEXT NOT NULL,
parent_object_count INTEGER NOT NULL DEFAULT 0 CHECK (parent_object_count >= 0),
child_object_count INTEGER NOT NULL DEFAULT 0 CHECK (child_object_count >= 0),
field_count INTEGER NOT NULL DEFAULT 0 CHECK (field_count >= 0),
UNIQUE (
from_schema_path,
to_schema_path,
relation_kind,
field_name,
field_type,
via_schema_path,
via_array_path
)
);
Semantics
- JSON object input: root path is
<filename-without-ext>orrootfor stdin. - JSON top-level array of objects: root collection path is
<filename-without-ext>_reforroot_itemfor stdin. - JSONL input: every JSON value must be an object; root collection naming is the same as top-level array.
- Primitive labels:
string,number,boolean,null, optional suffix?, union via|. - Object fields use a plain reference such as
<path>.json; nullable or missing object fields do not add?or|nullto that reference. - Every array field uses
array(...): primitive arrays use labels such asarray(string), object arrays usearray(<path>.json), nested arrays add another wrapper, and mixed object/string items usearray(<path>.json|string). - A field observed both as an array and as a scalar preserves both shapes, for example
array(<path>.json)|string. - Object arrays use one
<array_path>.jsonwhen homogeneous. Heterogeneous object arrays use<array_path>.jsoncontaining{ "$refs_mut": "<array_path>/" }, with distinct schemas stored as<array_path>/<index_path>.json. array_index_refs.array_index_pathis a JSON array of every enclosing array position;[1,0]identifies the first nested item inside the second outer item.- JSONL input ignores raw NUL padding only at the start of a physical record; other malformed JSON remains an error that identifies its line.
- Top-level collection:
- homogeneous: one
<root_collection>.json - heterogeneous: distinct
<root_collection>/<first_index>.jsonfiles only
- homogeneous: one
Each walked JSON object increments the object_count of the canonical schema selected by schema_paths or array_index_refs. Each key present in that object increments its field_count; a key with a null value is present, while an omitted key is not. $refs_mut array containers are structural schemas, not objects, and are not counted.
schema_relations stores inferred links between canonical schemas. Relation
rows describe direct object references, array item references, heterogeneous
array variants, and nested array markers. Rows marked with fk_candidate = 1
are the default graph source for relational reverse engineering; marked
structural rows can be included with --graph-include-marked.