data-modelling-sdk 2.0.2

# Data Modelling SDK

Shared SDK for model operations across platforms (API, WASM, Native).

Copyright (c) 2025 Mark Olliver - Licensed under MIT

## CLI Tool

The SDK includes a command-line interface (CLI) for importing and exporting schemas. See [CLI.md](CLI.md) for detailed usage instructions.

**Quick Start:**
```bash
# Build the CLI (with OpenAPI and ODPS validation support)
cargo build --release --bin data-modelling-cli --features cli,openapi,odps-validation

# Run it
./target/release/data-modelling-cli --help
```

**Note:** The CLI now includes OpenAPI support by default in GitHub releases. For local builds, include the `openapi` feature to enable OpenAPI import/export. Include `odps-validation` to enable ODPS schema validation.

**ODPS Import/Export Examples:**
```bash
# Import ODPS YAML file
data-modelling-cli import odps product.odps.yaml

# Export ODCS to ODPS format
data-modelling-cli export odps input.odcs.yaml output.odps.yaml

# Test ODPS round-trip (requires odps-validation feature)
cargo run --bin test-odps --features odps-validation,cli -- product.odps.yaml --verbose
```

## Features

- **Storage Backends**: File system, browser storage (IndexedDB/localStorage), and HTTP API
- **Database Backends**: DuckDB (embedded) and PostgreSQL for high-performance queries
- **Model Loading/Saving**: Load and save models from various storage backends
- **Import/Export**: Import from SQL (PostgreSQL, MySQL, SQLite, Generic, Databricks), ODCS, ODCL, JSON Schema, AVRO, Protobuf (proto2/proto3), CADS, ODPS, BPMN, DMN, OpenAPI; Export to various formats
- **Decision Records (DDL)**: MADR-compliant Architecture Decision Records with full lifecycle management
- **Knowledge Base (KB)**: Domain-partitioned knowledge articles with Markdown content support
- **Business Domain Schema**: Organize systems, CADS nodes, and ODCS nodes within business domains
- **Universal Converter**: Convert any format to ODCS v3.1.0 format
- **OpenAPI to ODCS Converter**: Convert OpenAPI schema components to ODCS table definitions
- **Validation**: Table and relationship validation (naming conflicts, circular dependencies)
- **Relationship Modeling**: Crow's feet notation cardinality (zeroOrOne, exactlyOne, zeroOrMany, oneOrMany) and data flow directions
- **Schema Reference**: JSON Schema definitions for all supported formats in `schemas/` directory
- **Database Sync**: Bidirectional sync between YAML files and database with change detection
- **Git Hooks**: Automatic pre-commit and post-checkout hooks for database synchronization

## Decision Records (DDL)

The SDK includes full support for **Architecture Decision Records** following the MADR (Markdown Any Decision Records) format. Decisions are stored as YAML files and can be exported to Markdown for documentation.

### Decision File Structure

```
workspace/
├── decisions/
│   ├── index.yaml                              # Decision index with metadata
│   ├── 0001-use-postgresql-database.yaml       # Individual decision records
│   ├── 0002-adopt-microservices.yaml
│   └── ...
└── decisions-md/                               # Markdown exports (auto-generated)
    ├── 0001-use-postgresql-database.md
    └── 0002-adopt-microservices.md
```

### Decision Lifecycle

Decisions follow a defined lifecycle with these statuses:
- **Draft**: Initial proposal, open for discussion
- **Proposed**: Formal proposal awaiting decision
- **Accepted**: Approved and in effect
- **Deprecated**: No longer recommended but still valid
- **Superseded**: Replaced by a newer decision
- **Rejected**: Not approved

### Decision Categories

- **Architecture**: System design and structure decisions
- **Technology**: Technology stack and tool choices
- **Process**: Development workflow decisions
- **Security**: Security-related decisions
- **Data**: Data modeling and storage decisions
- **Integration**: External system integration decisions

### CLI Commands

```bash
# Create a new decision
data-modelling-cli decision new --title "Use PostgreSQL" --domain platform

# List all decisions
data-modelling-cli decision list --workspace ./my-workspace

# Show a specific decision
data-modelling-cli decision show 1 --workspace ./my-workspace

# Filter by status or category
data-modelling-cli decision list --status accepted --category architecture

# Export decisions to Markdown
data-modelling-cli decision export --workspace ./my-workspace
```

## Knowledge Base (KB)

The SDK provides a **Knowledge Base** system for storing domain knowledge, guides, and documentation as structured articles.

### Knowledge Base File Structure

```
workspace/
├── knowledge/
│   ├── index.yaml                              # Knowledge index with metadata
│   ├── 0001-api-authentication-guide.yaml      # Individual knowledge articles
│   ├── 0002-deployment-procedures.yaml
│   └── ...
└── knowledge-md/                               # Markdown exports (auto-generated)
    ├── 0001-api-authentication-guide.md
    └── 0002-deployment-procedures.md
```

### Article Types

- **Guide**: Step-by-step instructions and tutorials
- **Reference**: API documentation and technical references
- **Concept**: Explanations of concepts and principles
- **Tutorial**: Learning-focused content with examples
- **Troubleshooting**: Problem-solving guides
- **Runbook**: Operational procedures

### Article Status

- **Draft**: Work in progress
- **Review**: Ready for peer review
- **Published**: Approved and available
- **Archived**: No longer actively maintained
- **Deprecated**: Outdated, pending replacement

### CLI Commands

```bash
# Create a new knowledge article
data-modelling-cli knowledge new --title "API Authentication Guide" --domain platform --type guide

# List all articles
data-modelling-cli knowledge list --workspace ./my-workspace

# Show a specific article
data-modelling-cli knowledge show 1 --workspace ./my-workspace

# Filter by type, status, or domain
data-modelling-cli knowledge list --type guide --status published

# Search article content
data-modelling-cli knowledge search "authentication" --workspace ./my-workspace

# Export articles to Markdown
data-modelling-cli knowledge export --workspace ./my-workspace
```

## File Structure

The SDK organizes files using a flat file naming convention within a workspace:

```
workspace/
├── .git/                                        # Git folder (if present)
├── README.md                                    # Repository files
├── workspace.yaml                               # Workspace metadata with assets and relationships
├── myworkspace_sales_customers.odcs.yaml        # ODCS table: workspace_domain_resource.type.yaml
├── myworkspace_sales_orders.odcs.yaml           # Another ODCS table in sales domain
├── myworkspace_sales_crm_leads.odcs.yaml        # ODCS table with system: workspace_domain_system_resource.type.yaml
├── myworkspace_analytics_metrics.odps.yaml      # ODPS product file
├── myworkspace_platform_api.cads.yaml           # CADS asset file
├── myworkspace_platform_api.openapi.yaml        # OpenAPI specification file
├── myworkspace_ops_approval.bpmn.xml            # BPMN process model file
└── myworkspace_ops_routing.dmn.xml              # DMN decision model file
```

### File Naming Convention

Files follow the pattern: `{workspace}_{domain}_{system}_{resource}.{type}.{ext}`

- **workspace**: The workspace name (required)
- **domain**: The business domain (required)
- **system**: The system within the domain (optional)
- **resource**: The resource/asset name (required)
- **type**: The asset type (`odcs`, `odps`, `cads`, `openapi`, `bpmn`, `dmn`)
- **ext**: File extension (`yaml`, `xml`, `json`)

### Workspace-Level Files

- `workspace.yaml`: Workspace metadata including domains, systems, asset references, and relationships

### Asset Types

- `*.odcs.yaml`: ODCS table/schema definitions (Open Data Contract Standard)
- `*.odps.yaml`: ODPS data product definitions (Open Data Product Standard)
- `*.cads.yaml`: CADS asset definitions (architecture assets)
- `*.openapi.yaml` / `*.openapi.json`: OpenAPI specification files
- `*.bpmn.xml`: BPMN 2.0 process model files
- `*.dmn.xml`: DMN 1.3 decision model files

## Usage

### File System Backend (Native Apps)

```rust
use data_modelling_sdk::storage::filesystem::FileSystemStorageBackend;
use data_modelling_sdk::model::ModelLoader;

let storage = FileSystemStorageBackend::new("/path/to/workspace");
let loader = ModelLoader::new(storage);
let result = loader.load_model("workspace_path").await?;
```

### Browser Storage Backend (WASM Apps)

```rust
use data_modelling_sdk::storage::browser::BrowserStorageBackend;
use data_modelling_sdk::model::ModelLoader;

let storage = BrowserStorageBackend::new("db_name", "store_name");
let loader = ModelLoader::new(storage);
let result = loader.load_model("workspace_path").await?;
```

### API Backend (Online Mode)

```rust
use data_modelling_sdk::storage::api::ApiStorageBackend;
use data_modelling_sdk::model::ModelLoader;

let storage = ApiStorageBackend::new("http://localhost:8081/api/v1", Some("session_id"));
let loader = ModelLoader::new(storage);
let result = loader.load_model("workspace_path").await?;
```

### WASM Bindings (Browser/Offline Mode)

The SDK exposes WASM bindings for parsing and export operations, enabling offline functionality in web applications.

**Build the WASM module**:
```bash
wasm-pack build --target web --out-dir pkg --features wasm
```

**Use in JavaScript/TypeScript**:
```javascript
import init, { parseOdcsYaml, exportToOdcsYaml } from './pkg/data_modelling_sdk.js';

// Initialize the module
await init();

// Parse ODCS YAML
const yaml = `apiVersion: v3.1.0
kind: DataContract
name: users
schema:
  fields:
    - name: id
      type: bigint`;

const resultJson = parseOdcsYaml(yaml);
const result = JSON.parse(resultJson);
console.log('Parsed tables:', result.tables);

// Export to ODCS YAML
const workspace = {
  tables: [{
    id: "550e8400-e29b-41d4-a716-446655440000",
    name: "users",
    columns: [{ name: "id", data_type: "bigint", nullable: false, primary_key: true }]
  }],
  relationships: []
};

const exportedYaml = exportToOdcsYaml(JSON.stringify(workspace));
console.log('Exported YAML:', exportedYaml);
```

**Available WASM Functions**:

**Import/Export**:
- `parseOdcsYaml(yamlContent: string): string` - Parse ODCS YAML to workspace structure
- `exportToOdcsYaml(workspaceJson: string): string` - Export workspace to ODCS YAML
- `importFromSql(sqlContent: string, dialect: string): string` - Import from SQL (supported dialects: "postgres"/"postgresql", "mysql", "sqlite", "generic", "databricks")
- `importFromAvro(avroContent: string): string` - Import from AVRO schema
- `importFromJsonSchema(jsonSchemaContent: string): string` - Import from JSON Schema
- `importFromProtobuf(protobufContent: string): string` - Import from Protobuf
- `importFromCads(yamlContent: string): string` - Import CADS (Compute Asset Description Specification) YAML
- `importFromOdps(yamlContent: string): string` - Import ODPS (Open Data Product Standard) YAML
- `exportToOdps(productJson: string): string` - Export ODPS data product to YAML format
- `validateOdps(yamlContent: string): void` - Validate ODPS YAML content against ODPS JSON Schema (requires `odps-validation` feature)
- `importBpmnModel(domainId: string, xmlContent: string, modelName?: string): string` - Import BPMN 2.0 XML model
- `importDmnModel(domainId: string, xmlContent: string, modelName?: string): string` - Import DMN 1.3 XML model
- `importOpenapiSpec(domainId: string, content: string, apiName?: string): string` - Import OpenAPI 3.1.1 specification
- `exportToSql(workspaceJson: string, dialect: string): string` - Export to SQL (supported dialects: "postgres"/"postgresql", "mysql", "sqlite", "generic", "databricks")
- `exportToAvro(workspaceJson: string): string` - Export to AVRO schema
- `exportToJsonSchema(workspaceJson: string): string` - Export to JSON Schema
- `exportToProtobuf(workspaceJson: string): string` - Export to Protobuf
- `exportToCads(workspaceJson: string): string` - Export to CADS YAML
- `exportToOdps(workspaceJson: string): string` - Export to ODPS YAML
- `exportBpmnModel(xmlContent: string): string` - Export BPMN model to XML
- `exportDmnModel(xmlContent: string): string` - Export DMN model to XML
- `exportOpenapiSpec(content: string, sourceFormat: string, targetFormat?: string): string` - Export OpenAPI spec with optional format conversion
- `convertToOdcs(input: string, format?: string): string` - Universal converter: convert any format to ODCS v3.1.0
- `convertOpenapiToOdcs(openapiContent: string, componentName: string, tableName?: string): string` - Convert OpenAPI schema component to ODCS table
- `analyzeOpenapiConversion(openapiContent: string, componentName: string): string` - Analyze OpenAPI component conversion feasibility
- `migrateDataflowToDomain(dataflowYaml: string, domainName?: string): string` - Migrate DataFlow YAML to Domain schema format

**Domain Operations**:
- `createDomain(name: string): string` - Create a new business domain
- `addSystemToDomain(workspaceJson: string, domainId: string, systemJson: string): string` - Add a system to a domain
- `addCadsNodeToDomain(workspaceJson: string, domainId: string, nodeJson: string): string` - Add a CADS node to a domain
- `addOdcsNodeToDomain(workspaceJson: string, domainId: string, nodeJson: string): string` - Add an ODCS node to a domain

**Filtering**:
- `filterNodesByOwner(workspaceJson: string, owner: string): string` - Filter tables by owner
- `filterRelationshipsByOwner(workspaceJson: string, owner: string): string` - Filter relationships by owner
- `filterNodesByInfrastructureType(workspaceJson: string, infrastructureType: string): string` - Filter tables by infrastructure type
- `filterRelationshipsByInfrastructureType(workspaceJson: string, infrastructureType: string): string` - Filter relationships by infrastructure type
- `filterByTags(workspaceJson: string, tag: string): string` - Filter nodes and relationships by tag (supports Simple, Pair, and List tag formats)

## Database Support

The SDK includes an optional database layer for high-performance queries on large workspaces (10-100x faster than file-based operations).

### Database Backends

- **DuckDB**: Embedded analytical database, ideal for CLI tools and local development
- **PostgreSQL**: Server-based database for team environments and shared access

### Quick Start

```bash
# Build CLI with database support
cargo build --release --bin data-modelling-cli --features cli-full

# Initialize database for a workspace
./target/release/data-modelling-cli db init --workspace ./my-workspace

# Sync YAML files to database
./target/release/data-modelling-cli db sync --workspace ./my-workspace

# Query the database
./target/release/data-modelling-cli query "SELECT name FROM tables" --workspace ./my-workspace
```

### Configuration

Database settings are stored in `.data-model.toml`:

```toml
[database]
backend = "duckdb"
path = ".data-model.duckdb"

[sync]
auto_sync = true

[git]
hooks_enabled = true
```

### Git Hooks Integration

When initializing a database in a Git repository, the CLI automatically installs:

- **Pre-commit hook**: Exports database changes to YAML before commit
- **Post-checkout hook**: Syncs YAML files to database after checkout

This ensures YAML files and database stay in sync across branches and collaborators.

See [CLI.md](docs/CLI.md) for detailed database command documentation.

## Development

### Pre-commit Hooks

This project uses pre-commit hooks to ensure code quality. Install them with:

```bash
# Install pre-commit (if not already installed)
pip install pre-commit

# Install the git hooks
pre-commit install

# Run hooks manually on all files
pre-commit run --all-files
```

The hooks will automatically run on `git commit` and check:
- Rust formatting (`cargo fmt`)
- Rust linting (`cargo clippy`)
- Security audit (`cargo audit`)
- File formatting (trailing whitespace, end of file, etc.)
- YAML/TOML/JSON syntax

### CI/CD

GitHub Actions workflows automatically run on push and pull requests:
- **Lint**: Format check, clippy, and security audit
- **Test**: Unit and integration tests on Linux, macOS, and Windows
- **Build**: Release build verification
- **Publish**: Automatic publishing to crates.io on main branch (after all checks pass)

## Documentation

- **[Architecture Guide](docs/ARCHITECTURE.md)**: Comprehensive guide to project architecture, design decisions, and use cases
- **[Schema Overview Guide](docs/SCHEMA_OVERVIEW.md)**: Detailed documentation of all supported schemas

The SDK supports:
- **ODCS v3.1.0**: Primary format for data contracts (tables)
- **ODCL v1.2.1**: Legacy data contract format (backward compatibility)
- **ODPS**: Data products linking to ODCS Tables
- **CADS v1.0**: Compute assets (AI/ML models, applications, pipelines)
- **BPMN 2.0**: Business Process Model and Notation (process models stored in native XML)
- **DMN 1.3**: Decision Model and Notation (decision models stored in native XML)
- **OpenAPI 3.1.1**: API specifications (stored in native YAML or JSON)
- **Business Domain Schema**: Organize systems, CADS nodes, and ODCS nodes
- **Universal Converter**: Convert any format to ODCS v3.1.0
- **OpenAPI to ODCS Converter**: Convert OpenAPI schema components to ODCS table definitions

### Schema Reference Directory

The SDK maintains JSON Schema definitions for all supported formats in the `schemas/` directory:

- **ODCS v3.1.0**: `schemas/odcs-json-schema-v3.1.0.json` - Primary format for data contracts
- **ODCL v1.2.1**: `schemas/odcl-json-schema-1.2.1.json` - Legacy data contract format
- **ODPS**: `schemas/odps-json-schema-latest.json` - Data products linking to ODCS tables
- **CADS v1.0**: `schemas/cads.schema.json` - Compute assets (AI/ML models, applications, pipelines)

These schemas serve as authoritative references for validation, documentation, and compliance. See [schemas/README.md](schemas/README.md) for detailed information about each schema.

## Data Pipeline

The SDK includes a complete data pipeline for ingesting JSON data, inferring schemas, and mapping to target formats.

### Pipeline Features

- **JSON Ingestion**: Ingest JSON/JSONL files into a staging database with deduplication
- **S3 Ingestion**: Ingest directly from AWS S3 buckets with streaming downloads (feature: `s3`)
- **Databricks Volumes**: Ingest from Databricks Unity Catalog Volumes (feature: `databricks`)
- **Progress Reporting**: Real-time progress bars with throughput metrics
- **Schema Inference**: Automatically infer types, formats, and nullability from data
- **LLM Refinement**: Optionally enhance schemas using Ollama or local LLM models
- **Schema Mapping**: Map inferred schemas to target schemas with transformation generation
- **Checkpointing**: Resume pipelines from the last successful stage
- **Secure Credentials**: Credential wrapper types preventing accidental logging

### Quick Start

```bash
# Build with pipeline support
cargo build --release -p odm --features pipeline

# Initialize staging database
odm staging init staging.duckdb

# Run full pipeline
odm pipeline run \
  --database staging.duckdb \
  --source ./json-data \
  --output-dir ./output \
  --verbose

# Check pipeline status
odm pipeline status --database staging.duckdb
```

### Schema Mapping

Map source schemas to target schemas with fuzzy matching and transformation script generation:

```bash
# Map schemas with fuzzy matching
odm map source.json target.json --fuzzy --min-similarity 0.7

# Generate SQL transformation
odm map source.json target.json \
  --transform-format sql \
  --transform-output transform.sql

# Generate Python transformation
odm map source.json target.json \
  --transform-format python \
  --transform-output transform.py
```

See [CLI.md](docs/CLI.md) for detailed pipeline and mapping documentation.

## Status

The SDK provides comprehensive support for multiple data modeling formats:

- ✅ Storage backend abstraction and implementations
- ✅ Database backend abstraction (DuckDB, PostgreSQL)
- ✅ Model loader/saver structure
- ✅ Full import/export implementation for all supported formats
- ✅ Validation module structure
- ✅ Business Domain schema support
- ✅ Universal format converter
- ✅ Enhanced tag support (Simple, Pair, List)
- ✅ Full ODCS/ODCL field preservation
- ✅ Schema reference directory (`schemas/`) with JSON Schema definitions for all supported formats
- ✅ Bidirectional YAML ↔ Database sync with change detection
- ✅ Git hooks for automatic synchronization
- ✅ Decision Records (DDL) with MADR format support
- ✅ Knowledge Base (KB) with domain partitioning
- ✅ Data Pipeline with staging, inference, and mapping
- ✅ Schema Mapping with fuzzy matching and transformation generation
- ✅ LLM-enhanced schema refinement (Ollama and local models)
- ✅ S3 ingestion with AWS SDK for Rust
- ✅ Databricks Unity Catalog Volumes ingestion
- ✅ Real-time progress reporting with indicatif
- ✅ Secure credential handling with automatic redaction