data-modelling-sdk 2.4.0

Shared SDK for model operations across platforms (API, WASM, Native)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
# Data Pipeline Tutorial

This tutorial walks you through using the data pipeline to ingest JSON data, infer schemas, and map to target formats.

## Overview

The data pipeline consists of five stages:

1. **Ingest** - Load JSON/JSONL files into a staging database
2. **Infer** - Automatically infer schema from staged data
3. **Refine** - (Optional) Enhance schema using LLM
4. **Map** - (Optional) Map inferred schema to a target schema
5. **Export** - Export results (schemas, mappings, transformed data)

## Prerequisites

Build the CLI with pipeline support:

```bash
cargo build --release -p odm --features pipeline
```

For LLM refinement, also include:

```bash
cargo build --release -p odm --features pipeline,llm-online
```

## Quick Start

### 1. Prepare Your Data

Create a directory with JSON files:

```bash
mkdir -p data/raw
echo '{"id": 1, "name": "Alice", "email": "alice@example.com", "created_at": "2025-01-10T10:30:00Z"}' > data/raw/user1.json
echo '{"id": 2, "name": "Bob", "email": "bob@example.com", "created_at": "2025-01-10T11:00:00Z"}' > data/raw/user2.json
```

Or use JSONL format (one JSON object per line):

```bash
cat > data/raw/users.jsonl << 'EOF'
{"id": 1, "name": "Alice", "email": "alice@example.com", "created_at": "2025-01-10T10:30:00Z"}
{"id": 2, "name": "Bob", "email": "bob@example.com", "created_at": "2025-01-10T11:00:00Z"}
{"id": 3, "name": "Charlie", "email": "charlie@example.com", "created_at": "2025-01-10T11:30:00Z"}
EOF
```

### 2. Initialize Staging Database

```bash
odm staging init staging.duckdb
```

### 3. Run the Pipeline

Basic pipeline run:

```bash
odm pipeline run \
  --database staging.duckdb \
  --source ./data/raw \
  --output-dir ./output \
  --verbose
```

This will:
- Ingest all JSON files from `./data/raw`
- Infer schema from the staged data
- Export the inferred schema to `./output/`

### 4. Check Results

```bash
# View inferred schema
cat output/inferred_schema.json

# Query staged data
odm staging query "SELECT * FROM staged_json LIMIT 5" --database staging.duckdb
```

## Step-by-Step Guide

### Stage 1: Data Ingestion

The ingest stage loads JSON files into a DuckDB staging database.

```bash
# Ingest with specific file pattern
odm staging ingest \
  --database staging.duckdb \
  --source ./data/raw \
  --pattern "**/*.jsonl"

# Ingest with partition key (for organizing data)
odm staging ingest \
  --database staging.duckdb \
  --source ./data/raw \
  --partition users-v1

# Ingest with deduplication
odm staging ingest \
  --database staging.duckdb \
  --source ./data/raw \
  --dedup content  # Skip files with duplicate content hashes
```

#### Deduplication Strategies

| Strategy | Description |
|----------|-------------|
| `none` | No deduplication (default) |
| `path` | Skip files with same path |
| `content` | Skip files with same content hash |
| `both` | Skip if path OR content matches |

### Stage 2: Schema Inference

The infer stage analyzes staged data to determine field types, formats, and nullability.

```bash
odm inference infer \
  --database staging.duckdb \
  --output schema.json \
  --format json-schema
```

#### Inference Options

```bash
# Limit sample size for faster inference
odm inference infer \
  --database staging.duckdb \
  --sample-size 10000 \
  --output schema.json

# Set minimum field frequency (fields appearing in <10% of records are excluded)
odm inference infer \
  --database staging.duckdb \
  --min-frequency 0.1 \
  --output schema.json

# Disable format detection (faster, but no email/date/uuid detection)
odm inference infer \
  --database staging.duckdb \
  --no-formats \
  --output schema.json
```

#### Example Output

```json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "id": {
      "type": "integer"
    },
    "name": {
      "type": "string"
    },
    "email": {
      "type": "string",
      "format": "email"
    },
    "created_at": {
      "type": "string",
      "format": "date-time"
    }
  },
  "required": ["id", "name", "email", "created_at"]
}
```

### Stage 3: LLM Refinement (Optional)

Enhance the inferred schema using a language model to add descriptions and improve type specificity.

#### Using Ollama

Start Ollama:

```bash
ollama serve
ollama pull llama3.2
```

Run with LLM refinement:

```bash
odm pipeline run \
  --database staging.duckdb \
  --source ./data/raw \
  --output-dir ./output \
  --llm-mode online \
  --ollama-url http://localhost:11434 \
  --model llama3.2
```

#### Providing Context

Add domain documentation for better refinement:

```bash
odm pipeline run \
  --database staging.duckdb \
  --source ./data/raw \
  --output-dir ./output \
  --llm-mode online \
  --model llama3.2 \
  --doc-path ./docs/data-dictionary.md
```

#### LLM Enhancements

The LLM refiner can:
- Add meaningful field descriptions
- Improve type specificity (e.g., `string``string` with `format: email`)
- Suggest constraints (min/max values, patterns)
- Identify semantic relationships between fields

### Stage 4: Schema Mapping (Optional)

Map the inferred schema to a target schema for data transformation.

#### Create Target Schema

```json
{
  "type": "object",
  "properties": {
    "user_id": {"type": "integer", "description": "Unique user identifier"},
    "full_name": {"type": "string", "description": "User's full name"},
    "email_address": {"type": "string", "format": "email"},
    "registration_date": {"type": "string", "format": "date-time"}
  },
  "required": ["user_id", "full_name", "email_address"]
}
```

Save as `target-schema.json`.

#### Run with Mapping

```bash
odm pipeline run \
  --database staging.duckdb \
  --source ./data/raw \
  --output-dir ./output \
  --target-schema target-schema.json \
  --stages ingest,infer,map,export
```

#### Standalone Mapping

```bash
# Map schemas with fuzzy matching
odm map inferred-schema.json target-schema.json \
  --fuzzy \
  --min-similarity 0.7 \
  --output mapping-result.json

# Generate SQL transformation
odm map inferred-schema.json target-schema.json \
  --transform-format sql \
  --transform-output transform.sql
```

#### Generated SQL Example

```sql
-- Transformation from source to target schema
-- Generated by ODM Schema Mapper

INSERT INTO target_table (user_id, full_name, email_address, registration_date)
SELECT
  id AS user_id,                    -- Direct mapping (confidence: 0.85)
  name AS full_name,                -- Fuzzy match (confidence: 0.72)
  email AS email_address,           -- Fuzzy match (confidence: 0.78)
  created_at AS registration_date   -- Fuzzy match (confidence: 0.65)
FROM source_table;
```

### Stage 5: Export

The export stage writes results to the output directory:

```
output/
├── inferred_schema.json       # Inferred schema
├── refined_schema.json        # LLM-refined schema (if enabled)
├── mapping_result.json        # Schema mapping (if target provided)
├── transform.sql              # Generated transformation script
└── pipeline_report.json       # Execution summary
```

## Checkpointing and Resume

The pipeline automatically saves checkpoints after each stage.

### Resume After Failure

If the pipeline fails mid-execution:

```bash
# Check status
odm pipeline status --database staging.duckdb

# Resume from last checkpoint
odm pipeline run --database staging.duckdb --resume
```

### Checkpoint Contents

Checkpoints track:
- Completed stages
- Current stage
- Stage timing and results
- Configuration hash (detects changes)

## Advanced Usage

### Running Specific Stages

```bash
# Only run ingest and infer
odm pipeline run \
  --database staging.duckdb \
  --source ./data/raw \
  --stages ingest,infer \
  --output-dir ./output

# Skip ingest (data already staged)
odm pipeline run \
  --database staging.duckdb \
  --stages infer,refine,export \
  --output-dir ./output
```

### Dry Run

Validate configuration without executing:

```bash
odm pipeline run \
  --database staging.duckdb \
  --source ./data/raw \
  --output-dir ./output \
  --dry-run
```

### Apache Iceberg Backend

For production workloads, use Iceberg for time travel and catalog integration:

```bash
# Initialize with REST catalog (Lakekeeper)
odm staging init staging.duckdb \
  --catalog rest \
  --endpoint http://localhost:8181 \
  --warehouse ./warehouse

# Query with time travel
odm staging query "SELECT * FROM staged" --version 5
odm staging query "SELECT * FROM staged" --timestamp "2025-01-10T00:00:00Z"

# Export to Unity Catalog
odm staging export \
  --database staging.duckdb \
  --target unity \
  --endpoint https://workspace.cloud.databricks.com \
  --catalog main \
  --schema staging \
  --table users
```

## Troubleshooting

### Out of Memory

For large datasets, limit the sample size:

```bash
odm inference infer \
  --database staging.duckdb \
  --sample-size 10000 \
  --output schema.json
```

### LLM Timeout

Increase timeout or use a smaller model:

```bash
odm pipeline run \
  --database staging.duckdb \
  --llm-mode online \
  --model llama3.2:1b \  # Smaller model
  --verbose
```

### Schema Mismatch

If mapping produces poor results:

1. Check inferred schema accuracy
2. Increase fuzzy matching threshold
3. Add documentation context for LLM refinement
4. Manually adjust the mapping result

### Database Locked

Ensure no other process is using the database:

```bash
# Kill stuck connections
lsof staging.duckdb

# Or use a fresh database
rm staging.duckdb
odm staging init staging.duckdb
```

## Best Practices

1. **Use partitions** to organize data by source or version
2. **Enable deduplication** for incremental ingestion
3. **Start with small samples** to validate schema inference
4. **Provide documentation context** for better LLM refinement
5. **Review mapping results** before generating transformations
6. **Use checkpointing** for long-running pipelines
7. **Monitor with `--verbose`** during development

## Next Steps

- [CLI Reference]./CLI.md - Complete command documentation
- [Schema Overview]./SCHEMA_OVERVIEW.md - Supported schema formats
- [Architecture Guide]./ARCHITECTURE.md - System design details