data-modelling-sdk 2.0.2

Shared SDK for model operations across platforms (API, WASM, Native)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
# Data Modelling SDK

Shared SDK for model operations across platforms (API, WASM, Native).

Copyright (c) 2025 Mark Olliver - Licensed under MIT

## CLI Tool

The SDK includes a command-line interface (CLI) for importing and exporting schemas. See [CLI.md](CLI.md) for detailed usage instructions.

**Quick Start:**
```bash
# Build the CLI (with OpenAPI and ODPS validation support)
cargo build --release --bin data-modelling-cli --features cli,openapi,odps-validation

# Run it
./target/release/data-modelling-cli --help
```

**Note:** The CLI now includes OpenAPI support by default in GitHub releases. For local builds, include the `openapi` feature to enable OpenAPI import/export. Include `odps-validation` to enable ODPS schema validation.

**ODPS Import/Export Examples:**
```bash
# Import ODPS YAML file
data-modelling-cli import odps product.odps.yaml

# Export ODCS to ODPS format
data-modelling-cli export odps input.odcs.yaml output.odps.yaml

# Test ODPS round-trip (requires odps-validation feature)
cargo run --bin test-odps --features odps-validation,cli -- product.odps.yaml --verbose
```

## Features

- **Storage Backends**: File system, browser storage (IndexedDB/localStorage), and HTTP API
- **Database Backends**: DuckDB (embedded) and PostgreSQL for high-performance queries
- **Model Loading/Saving**: Load and save models from various storage backends
- **Import/Export**: Import from SQL (PostgreSQL, MySQL, SQLite, Generic, Databricks), ODCS, ODCL, JSON Schema, AVRO, Protobuf (proto2/proto3), CADS, ODPS, BPMN, DMN, OpenAPI; Export to various formats
- **Decision Records (DDL)**: MADR-compliant Architecture Decision Records with full lifecycle management
- **Knowledge Base (KB)**: Domain-partitioned knowledge articles with Markdown content support
- **Business Domain Schema**: Organize systems, CADS nodes, and ODCS nodes within business domains
- **Universal Converter**: Convert any format to ODCS v3.1.0 format
- **OpenAPI to ODCS Converter**: Convert OpenAPI schema components to ODCS table definitions
- **Validation**: Table and relationship validation (naming conflicts, circular dependencies)
- **Relationship Modeling**: Crow's feet notation cardinality (zeroOrOne, exactlyOne, zeroOrMany, oneOrMany) and data flow directions
- **Schema Reference**: JSON Schema definitions for all supported formats in `schemas/` directory
- **Database Sync**: Bidirectional sync between YAML files and database with change detection
- **Git Hooks**: Automatic pre-commit and post-checkout hooks for database synchronization

## Decision Records (DDL)

The SDK includes full support for **Architecture Decision Records** following the MADR (Markdown Any Decision Records) format. Decisions are stored as YAML files and can be exported to Markdown for documentation.

### Decision File Structure

```
workspace/
├── decisions/
│   ├── index.yaml                              # Decision index with metadata
│   ├── 0001-use-postgresql-database.yaml       # Individual decision records
│   ├── 0002-adopt-microservices.yaml
│   └── ...
└── decisions-md/                               # Markdown exports (auto-generated)
    ├── 0001-use-postgresql-database.md
    └── 0002-adopt-microservices.md
```

### Decision Lifecycle

Decisions follow a defined lifecycle with these statuses:
- **Draft**: Initial proposal, open for discussion
- **Proposed**: Formal proposal awaiting decision
- **Accepted**: Approved and in effect
- **Deprecated**: No longer recommended but still valid
- **Superseded**: Replaced by a newer decision
- **Rejected**: Not approved

### Decision Categories

- **Architecture**: System design and structure decisions
- **Technology**: Technology stack and tool choices
- **Process**: Development workflow decisions
- **Security**: Security-related decisions
- **Data**: Data modeling and storage decisions
- **Integration**: External system integration decisions

### CLI Commands

```bash
# Create a new decision
data-modelling-cli decision new --title "Use PostgreSQL" --domain platform

# List all decisions
data-modelling-cli decision list --workspace ./my-workspace

# Show a specific decision
data-modelling-cli decision show 1 --workspace ./my-workspace

# Filter by status or category
data-modelling-cli decision list --status accepted --category architecture

# Export decisions to Markdown
data-modelling-cli decision export --workspace ./my-workspace
```

## Knowledge Base (KB)

The SDK provides a **Knowledge Base** system for storing domain knowledge, guides, and documentation as structured articles.

### Knowledge Base File Structure

```
workspace/
├── knowledge/
│   ├── index.yaml                              # Knowledge index with metadata
│   ├── 0001-api-authentication-guide.yaml      # Individual knowledge articles
│   ├── 0002-deployment-procedures.yaml
│   └── ...
└── knowledge-md/                               # Markdown exports (auto-generated)
    ├── 0001-api-authentication-guide.md
    └── 0002-deployment-procedures.md
```

### Article Types

- **Guide**: Step-by-step instructions and tutorials
- **Reference**: API documentation and technical references
- **Concept**: Explanations of concepts and principles
- **Tutorial**: Learning-focused content with examples
- **Troubleshooting**: Problem-solving guides
- **Runbook**: Operational procedures

### Article Status

- **Draft**: Work in progress
- **Review**: Ready for peer review
- **Published**: Approved and available
- **Archived**: No longer actively maintained
- **Deprecated**: Outdated, pending replacement

### CLI Commands

```bash
# Create a new knowledge article
data-modelling-cli knowledge new --title "API Authentication Guide" --domain platform --type guide

# List all articles
data-modelling-cli knowledge list --workspace ./my-workspace

# Show a specific article
data-modelling-cli knowledge show 1 --workspace ./my-workspace

# Filter by type, status, or domain
data-modelling-cli knowledge list --type guide --status published

# Search article content
data-modelling-cli knowledge search "authentication" --workspace ./my-workspace

# Export articles to Markdown
data-modelling-cli knowledge export --workspace ./my-workspace
```

## File Structure

The SDK organizes files using a flat file naming convention within a workspace:

```
workspace/
├── .git/                                        # Git folder (if present)
├── README.md                                    # Repository files
├── workspace.yaml                               # Workspace metadata with assets and relationships
├── myworkspace_sales_customers.odcs.yaml        # ODCS table: workspace_domain_resource.type.yaml
├── myworkspace_sales_orders.odcs.yaml           # Another ODCS table in sales domain
├── myworkspace_sales_crm_leads.odcs.yaml        # ODCS table with system: workspace_domain_system_resource.type.yaml
├── myworkspace_analytics_metrics.odps.yaml      # ODPS product file
├── myworkspace_platform_api.cads.yaml           # CADS asset file
├── myworkspace_platform_api.openapi.yaml        # OpenAPI specification file
├── myworkspace_ops_approval.bpmn.xml            # BPMN process model file
└── myworkspace_ops_routing.dmn.xml              # DMN decision model file
```

### File Naming Convention

Files follow the pattern: `{workspace}_{domain}_{system}_{resource}.{type}.{ext}`

- **workspace**: The workspace name (required)
- **domain**: The business domain (required)
- **system**: The system within the domain (optional)
- **resource**: The resource/asset name (required)
- **type**: The asset type (`odcs`, `odps`, `cads`, `openapi`, `bpmn`, `dmn`)
- **ext**: File extension (`yaml`, `xml`, `json`)

### Workspace-Level Files

- `workspace.yaml`: Workspace metadata including domains, systems, asset references, and relationships

### Asset Types

- `*.odcs.yaml`: ODCS table/schema definitions (Open Data Contract Standard)
- `*.odps.yaml`: ODPS data product definitions (Open Data Product Standard)
- `*.cads.yaml`: CADS asset definitions (architecture assets)
- `*.openapi.yaml` / `*.openapi.json`: OpenAPI specification files
- `*.bpmn.xml`: BPMN 2.0 process model files
- `*.dmn.xml`: DMN 1.3 decision model files

## Usage

### File System Backend (Native Apps)

```rust
use data_modelling_sdk::storage::filesystem::FileSystemStorageBackend;
use data_modelling_sdk::model::ModelLoader;

let storage = FileSystemStorageBackend::new("/path/to/workspace");
let loader = ModelLoader::new(storage);
let result = loader.load_model("workspace_path").await?;
```

### Browser Storage Backend (WASM Apps)

```rust
use data_modelling_sdk::storage::browser::BrowserStorageBackend;
use data_modelling_sdk::model::ModelLoader;

let storage = BrowserStorageBackend::new("db_name", "store_name");
let loader = ModelLoader::new(storage);
let result = loader.load_model("workspace_path").await?;
```

### API Backend (Online Mode)

```rust
use data_modelling_sdk::storage::api::ApiStorageBackend;
use data_modelling_sdk::model::ModelLoader;

let storage = ApiStorageBackend::new("http://localhost:8081/api/v1", Some("session_id"));
let loader = ModelLoader::new(storage);
let result = loader.load_model("workspace_path").await?;
```

### WASM Bindings (Browser/Offline Mode)

The SDK exposes WASM bindings for parsing and export operations, enabling offline functionality in web applications.

**Build the WASM module**:
```bash
wasm-pack build --target web --out-dir pkg --features wasm
```

**Use in JavaScript/TypeScript**:
```javascript
import init, { parseOdcsYaml, exportToOdcsYaml } from './pkg/data_modelling_sdk.js';

// Initialize the module
await init();

// Parse ODCS YAML
const yaml = `apiVersion: v3.1.0
kind: DataContract
name: users
schema:
  fields:
    - name: id
      type: bigint`;

const resultJson = parseOdcsYaml(yaml);
const result = JSON.parse(resultJson);
console.log('Parsed tables:', result.tables);

// Export to ODCS YAML
const workspace = {
  tables: [{
    id: "550e8400-e29b-41d4-a716-446655440000",
    name: "users",
    columns: [{ name: "id", data_type: "bigint", nullable: false, primary_key: true }]
  }],
  relationships: []
};

const exportedYaml = exportToOdcsYaml(JSON.stringify(workspace));
console.log('Exported YAML:', exportedYaml);
```

**Available WASM Functions**:

**Import/Export**:
- `parseOdcsYaml(yamlContent: string): string` - Parse ODCS YAML to workspace structure
- `exportToOdcsYaml(workspaceJson: string): string` - Export workspace to ODCS YAML
- `importFromSql(sqlContent: string, dialect: string): string` - Import from SQL (supported dialects: "postgres"/"postgresql", "mysql", "sqlite", "generic", "databricks")
- `importFromAvro(avroContent: string): string` - Import from AVRO schema
- `importFromJsonSchema(jsonSchemaContent: string): string` - Import from JSON Schema
- `importFromProtobuf(protobufContent: string): string` - Import from Protobuf
- `importFromCads(yamlContent: string): string` - Import CADS (Compute Asset Description Specification) YAML
- `importFromOdps(yamlContent: string): string` - Import ODPS (Open Data Product Standard) YAML
- `exportToOdps(productJson: string): string` - Export ODPS data product to YAML format
- `validateOdps(yamlContent: string): void` - Validate ODPS YAML content against ODPS JSON Schema (requires `odps-validation` feature)
- `importBpmnModel(domainId: string, xmlContent: string, modelName?: string): string` - Import BPMN 2.0 XML model
- `importDmnModel(domainId: string, xmlContent: string, modelName?: string): string` - Import DMN 1.3 XML model
- `importOpenapiSpec(domainId: string, content: string, apiName?: string): string` - Import OpenAPI 3.1.1 specification
- `exportToSql(workspaceJson: string, dialect: string): string` - Export to SQL (supported dialects: "postgres"/"postgresql", "mysql", "sqlite", "generic", "databricks")
- `exportToAvro(workspaceJson: string): string` - Export to AVRO schema
- `exportToJsonSchema(workspaceJson: string): string` - Export to JSON Schema
- `exportToProtobuf(workspaceJson: string): string` - Export to Protobuf
- `exportToCads(workspaceJson: string): string` - Export to CADS YAML
- `exportToOdps(workspaceJson: string): string` - Export to ODPS YAML
- `exportBpmnModel(xmlContent: string): string` - Export BPMN model to XML
- `exportDmnModel(xmlContent: string): string` - Export DMN model to XML
- `exportOpenapiSpec(content: string, sourceFormat: string, targetFormat?: string): string` - Export OpenAPI spec with optional format conversion
- `convertToOdcs(input: string, format?: string): string` - Universal converter: convert any format to ODCS v3.1.0
- `convertOpenapiToOdcs(openapiContent: string, componentName: string, tableName?: string): string` - Convert OpenAPI schema component to ODCS table
- `analyzeOpenapiConversion(openapiContent: string, componentName: string): string` - Analyze OpenAPI component conversion feasibility
- `migrateDataflowToDomain(dataflowYaml: string, domainName?: string): string` - Migrate DataFlow YAML to Domain schema format

**Domain Operations**:
- `createDomain(name: string): string` - Create a new business domain
- `addSystemToDomain(workspaceJson: string, domainId: string, systemJson: string): string` - Add a system to a domain
- `addCadsNodeToDomain(workspaceJson: string, domainId: string, nodeJson: string): string` - Add a CADS node to a domain
- `addOdcsNodeToDomain(workspaceJson: string, domainId: string, nodeJson: string): string` - Add an ODCS node to a domain

**Filtering**:
- `filterNodesByOwner(workspaceJson: string, owner: string): string` - Filter tables by owner
- `filterRelationshipsByOwner(workspaceJson: string, owner: string): string` - Filter relationships by owner
- `filterNodesByInfrastructureType(workspaceJson: string, infrastructureType: string): string` - Filter tables by infrastructure type
- `filterRelationshipsByInfrastructureType(workspaceJson: string, infrastructureType: string): string` - Filter relationships by infrastructure type
- `filterByTags(workspaceJson: string, tag: string): string` - Filter nodes and relationships by tag (supports Simple, Pair, and List tag formats)

## Database Support

The SDK includes an optional database layer for high-performance queries on large workspaces (10-100x faster than file-based operations).

### Database Backends

- **DuckDB**: Embedded analytical database, ideal for CLI tools and local development
- **PostgreSQL**: Server-based database for team environments and shared access

### Quick Start

```bash
# Build CLI with database support
cargo build --release --bin data-modelling-cli --features cli-full

# Initialize database for a workspace
./target/release/data-modelling-cli db init --workspace ./my-workspace

# Sync YAML files to database
./target/release/data-modelling-cli db sync --workspace ./my-workspace

# Query the database
./target/release/data-modelling-cli query "SELECT name FROM tables" --workspace ./my-workspace
```

### Configuration

Database settings are stored in `.data-model.toml`:

```toml
[database]
backend = "duckdb"
path = ".data-model.duckdb"

[sync]
auto_sync = true

[git]
hooks_enabled = true
```

### Git Hooks Integration

When initializing a database in a Git repository, the CLI automatically installs:

- **Pre-commit hook**: Exports database changes to YAML before commit
- **Post-checkout hook**: Syncs YAML files to database after checkout

This ensures YAML files and database stay in sync across branches and collaborators.

See [CLI.md](docs/CLI.md) for detailed database command documentation.

## Development

### Pre-commit Hooks

This project uses pre-commit hooks to ensure code quality. Install them with:

```bash
# Install pre-commit (if not already installed)
pip install pre-commit

# Install the git hooks
pre-commit install

# Run hooks manually on all files
pre-commit run --all-files
```

The hooks will automatically run on `git commit` and check:
- Rust formatting (`cargo fmt`)
- Rust linting (`cargo clippy`)
- Security audit (`cargo audit`)
- File formatting (trailing whitespace, end of file, etc.)
- YAML/TOML/JSON syntax

### CI/CD

GitHub Actions workflows automatically run on push and pull requests:
- **Lint**: Format check, clippy, and security audit
- **Test**: Unit and integration tests on Linux, macOS, and Windows
- **Build**: Release build verification
- **Publish**: Automatic publishing to crates.io on main branch (after all checks pass)

## Documentation

- **[Architecture Guide]docs/ARCHITECTURE.md**: Comprehensive guide to project architecture, design decisions, and use cases
- **[Schema Overview Guide]docs/SCHEMA_OVERVIEW.md**: Detailed documentation of all supported schemas

The SDK supports:
- **ODCS v3.1.0**: Primary format for data contracts (tables)
- **ODCL v1.2.1**: Legacy data contract format (backward compatibility)
- **ODPS**: Data products linking to ODCS Tables
- **CADS v1.0**: Compute assets (AI/ML models, applications, pipelines)
- **BPMN 2.0**: Business Process Model and Notation (process models stored in native XML)
- **DMN 1.3**: Decision Model and Notation (decision models stored in native XML)
- **OpenAPI 3.1.1**: API specifications (stored in native YAML or JSON)
- **Business Domain Schema**: Organize systems, CADS nodes, and ODCS nodes
- **Universal Converter**: Convert any format to ODCS v3.1.0
- **OpenAPI to ODCS Converter**: Convert OpenAPI schema components to ODCS table definitions

### Schema Reference Directory

The SDK maintains JSON Schema definitions for all supported formats in the `schemas/` directory:

- **ODCS v3.1.0**: `schemas/odcs-json-schema-v3.1.0.json` - Primary format for data contracts
- **ODCL v1.2.1**: `schemas/odcl-json-schema-1.2.1.json` - Legacy data contract format
- **ODPS**: `schemas/odps-json-schema-latest.json` - Data products linking to ODCS tables
- **CADS v1.0**: `schemas/cads.schema.json` - Compute assets (AI/ML models, applications, pipelines)

These schemas serve as authoritative references for validation, documentation, and compliance. See [schemas/README.md](schemas/README.md) for detailed information about each schema.

## Data Pipeline

The SDK includes a complete data pipeline for ingesting JSON data, inferring schemas, and mapping to target formats.

### Pipeline Features

- **JSON Ingestion**: Ingest JSON/JSONL files into a staging database with deduplication
- **S3 Ingestion**: Ingest directly from AWS S3 buckets with streaming downloads (feature: `s3`)
- **Databricks Volumes**: Ingest from Databricks Unity Catalog Volumes (feature: `databricks`)
- **Progress Reporting**: Real-time progress bars with throughput metrics
- **Schema Inference**: Automatically infer types, formats, and nullability from data
- **LLM Refinement**: Optionally enhance schemas using Ollama or local LLM models
- **Schema Mapping**: Map inferred schemas to target schemas with transformation generation
- **Checkpointing**: Resume pipelines from the last successful stage
- **Secure Credentials**: Credential wrapper types preventing accidental logging

### Quick Start

```bash
# Build with pipeline support
cargo build --release -p odm --features pipeline

# Initialize staging database
odm staging init staging.duckdb

# Run full pipeline
odm pipeline run \
  --database staging.duckdb \
  --source ./json-data \
  --output-dir ./output \
  --verbose

# Check pipeline status
odm pipeline status --database staging.duckdb
```

### Schema Mapping

Map source schemas to target schemas with fuzzy matching and transformation script generation:

```bash
# Map schemas with fuzzy matching
odm map source.json target.json --fuzzy --min-similarity 0.7

# Generate SQL transformation
odm map source.json target.json \
  --transform-format sql \
  --transform-output transform.sql

# Generate Python transformation
odm map source.json target.json \
  --transform-format python \
  --transform-output transform.py
```

See [CLI.md](docs/CLI.md) for detailed pipeline and mapping documentation.

## Status

The SDK provides comprehensive support for multiple data modeling formats:

- ✅ Storage backend abstraction and implementations
- ✅ Database backend abstraction (DuckDB, PostgreSQL)
- ✅ Model loader/saver structure
- ✅ Full import/export implementation for all supported formats
- ✅ Validation module structure
- ✅ Business Domain schema support
- ✅ Universal format converter
- ✅ Enhanced tag support (Simple, Pair, List)
- ✅ Full ODCS/ODCL field preservation
- ✅ Schema reference directory (`schemas/`) with JSON Schema definitions for all supported formats
- ✅ Bidirectional YAML ↔ Database sync with change detection
- ✅ Git hooks for automatic synchronization
- ✅ Decision Records (DDL) with MADR format support
- ✅ Knowledge Base (KB) with domain partitioning
- ✅ Data Pipeline with staging, inference, and mapping
- ✅ Schema Mapping with fuzzy matching and transformation generation
- ✅ LLM-enhanced schema refinement (Ollama and local models)
- ✅ S3 ingestion with AWS SDK for Rust
- ✅ Databricks Unity Catalog Volumes ingestion
- ✅ Real-time progress reporting with indicatif
- ✅ Secure credential handling with automatic redaction