bq-schema-gen
Generate BigQuery schemas from JSON or CSV data. Unlike BigQuery's built-in auto-detect which only examines the first 500 records, this tool processes all records to generate complete and accurate schemas.
Quick Start
# Install
# Generate a schema
|
Features
- Schema Generation - Infer BigQuery schemas from JSON or CSV files
- Schema Diff - Compare schemas and detect breaking changes
- Data Validation - Validate data against existing schemas
- Watch Mode - Auto-regenerate schemas when files change
- Parallel Processing - Fast processing of large datasets
- Multiple Output Formats - JSON, DDL, JSON Schema
Installation
From crates.io
Using Homebrew
From GitHub Releases
Download pre-built packages from GitHub Releases.
Each release includes:
- Pre-compiled binary
- Shell completions (bash, zsh, fish, PowerShell)
- Man pages
# Example: Extract and install on macOS/Linux
# Optionally install completions (e.g., for zsh)
From Source
Usage
Generate Schema
From stdin:
|
From a file:
Multiple files with glob patterns:
Output separate schemas per file:
CSV input:
Compare Schemas (diff)
Compare two schemas to identify changes:
Example output:
Schema Diff Report
==================
Summary: 1 added, 1 removed, 1 modified (2 breaking)
Added Fields:
+ email (STRING, NULLABLE)
Removed Fields:
- legacy_id (INTEGER, NULLABLE) [BREAKING]
Modified Fields:
~ name: Mode changed: NULLABLE -> REQUIRED [BREAKING]
Output formats: text (default), json, json-patch, sql
Validate Data
Validate data against an existing schema:
Watch Mode
Auto-regenerate schemas when files change:
CLI Reference
| Flag | Description |
|---|---|
--input-format <FORMAT> |
Input format: json (default) or csv |
--output-format <FORMAT> |
Output format: json, ddl, debug-map, or json-schema |
--table-name <NAME> |
Table name for DDL output |
-o, --output <FILE> |
Output file (stdout if not provided) |
-q, --quiet |
Suppress progress messages |
--per-file |
Output separate schema for each input file |
--output-dir <DIR> |
Output directory for per-file schemas |
--keep-nulls |
Include null values and empty containers in schema |
--quoted-values-are-strings |
Treat quoted values as strings |
--infer-mode |
Infer REQUIRED mode for CSV fields |
--sanitize-names |
Replace invalid characters in field names |
--preserve-input-sort-order |
Preserve field order from input |
--existing-schema-path <FILE> |
Merge with an existing schema |
--ignore-invalid-lines |
Skip unparseable lines |
All flags support both kebab-case (
--keep-nulls) and underscore (--keep_nulls) syntax.
Diff Options
| Flag | Description |
|---|---|
--format <FORMAT> |
Output: text, json, json-patch, sql |
--color <WHEN> |
Color output: auto, always, never |
--strict |
Flag ALL changes as breaking |
-o, --output <FILE> |
Output file |
Output Formats
JSON (default)
Standard BigQuery schema format:
|
DDL
BigQuery CREATE TABLE statement:
|
(
age INT64,
name STRING
);
JSON Schema
JSON Schema draft-07 format:
|
Type Inference
The tool automatically infers BigQuery types:
| JSON Type | BigQuery Type |
|---|---|
| string | STRING, DATE, TIME, or TIMESTAMP (auto-detected) |
| number (integer) | INTEGER |
| number (float) | FLOAT |
| boolean | BOOLEAN |
| object | RECORD |
| array | REPEATED |
Type Evolution
Types evolve as more data is processed:
- INTEGER + FLOAT = FLOAT
- DATE/TIME/TIMESTAMP combinations = STRING
- Type widening is automatic (INTEGER -> FLOAT, anything -> STRING)
Shell Completions
Shell completions for bash, zsh, fish, and PowerShell are included in GitHub releases and automatically installed via Homebrew.
Library Usage
The crate can be used as a Rust library:
use ;
use json;
let config = default;
let mut generator = new;
let mut schema_map = new;
let record = json!;
generator.process_record.unwrap;
let schema = generator.flatten_schema;
See docs.rs for the full API documentation.
License
Apache-2.0 (same as the original Python project)
Credits
- Original Python implementation by Brian T. Park
- Rust port maintains compatibility with the original tool's behavior