data-modelling-sdk 2.4.0

# Data Model: Enhanced Databricks SQL Syntax Support

**Feature**: Enhanced Databricks SQL Syntax Support
**Date**: 2026-01-04
**Phase**: Phase 1 - Design

## Overview

This feature extends the existing SQL import data model to support Databricks-specific syntax patterns. No new entities are introduced - the feature enhances the parsing logic for existing `ImportResult`, `TableData`, and `ColumnData` entities.

## Existing Entities (Enhanced)

### ImportResult

**Purpose**: Represents the outcome of importing SQL, containing successfully parsed tables and any parse errors.

**Fields** (unchanged):
- `tables: Vec<TableData>` - Tables extracted from the import
- `tables_requiring_name: Vec<TableRequiringName>` - Tables that require name input (for SQL imports with unnamed tables)
- `errors: Vec<ImportError>` - Parse errors/warnings
- `ai_suggestions: Option<Vec<serde_json::Value>>` - Whether AI suggestions are available

**Enhancements**:
- When Databricks SQL contains `IDENTIFIER(:variable)` with only variables (no literals), tables may be added to `tables_requiring_name` with placeholder names
- Error messages in `errors` will include context about Databricks-specific syntax when parsing fails

### TableData

**Purpose**: Represents a parsed table structure with name, columns, and metadata extracted from CREATE TABLE statements.

**Fields** (unchanged):
- `table_index: usize` - Index of table in import
- `name: Option<String>` - Table name (may be None if extracted from IDENTIFIER() with only variables)
- `columns: Vec<ColumnData>` - Column definitions

**Enhancements**:
- Table names extracted from `IDENTIFIER()` expressions will be normalized (e.g., `IDENTIFIER(:catalog || '.schema.table')` → `schema.table`)
- If IDENTIFIER() contains only variables, table name will be a placeholder like `__databricks_table_0__` and table added to `tables_requiring_name`

### ColumnData

**Purpose**: Represents a column definition extracted from SQL.

**Fields** (unchanged):
- `name: String` - Column name
- `data_type: String` - Column data type
- `nullable: bool` - Whether column allows NULL
- `primary_key: bool` - Whether column is part of primary key
- `description: Option<String>` - Column description/documentation
- `quality: Option<Vec<HashMap<String, serde_json::Value>>>` - Quality rules
- `ref_path: Option<String>` - JSON Schema $ref reference path

**Enhancements**:
- When variable references in type definitions are replaced (e.g., `STRUCT<field: :variable_type>` → `STRUCT<field: STRING>`), the `data_type` field will contain the fallback type
- Original variable references are not preserved (per spec assumption that variable substitution is not required)

## New Internal Entities

### DatabricksDialect

**Purpose**: Custom SQL dialect implementation for Databricks-specific syntax.

**Type**: Struct implementing `sqlparser::dialect::Dialect` trait

**Responsibilities**:
- Recognize `:` as valid in identifiers (for variable references)
- Handle backtick-quoted identifiers
- Customize identifier parsing behavior for Databricks SQL

**Location**: `src/import/sql.rs` (internal to module)

### PreprocessingState

**Purpose**: Tracks preprocessing transformations applied to SQL for later reference.

**Type**: Internal struct (not exposed in public API)

**Fields**:
- `identifier_replacements: Vec<(String, String)>` - Maps placeholder table names to original IDENTIFIER() expressions
- `variable_replacements: Vec<(String, String)>` - Tracks variable references replaced in type definitions

**Location**: `src/import/sql.rs` (internal to module)

## Relationships

```
SQLImporter
  ├── Uses DatabricksDialect (when dialect="databricks")
  ├── Produces ImportResult
  │     ├── Contains Vec<TableData>
  │     │     └── Contains Vec<ColumnData>
  │     └── Contains Vec<ImportError>
  └── Uses PreprocessingState (internal, during parsing)
```

## Validation Rules

### Table Name Validation

- Table names extracted from `IDENTIFIER()` expressions must pass existing `validate_table_name()` checks
- Placeholder table names (for variables-only IDENTIFIER()) are exempt from validation but marked in `tables_requiring_name`

### Column Type Validation

- Data types with variable references replaced (e.g., `STRING` fallback) must pass existing `validate_data_type()` checks
- Nested STRUCT/ARRAY types with replaced variables must be valid SQL type syntax

### Error Handling

- Parse errors must include context about which Databricks pattern caused the failure
- Errors should suggest using "databricks" dialect if Databricks syntax is detected but generic dialect is used

## State Transitions

### SQL Import Flow

```
1. User provides SQL + dialect="databricks"
   ↓
2. SQLImporter::parse() called
   ↓
3. Preprocessing applied (if Databricks dialect):
   - Replace IDENTIFIER() expressions
   - Replace variable references in types
   - Replace variables in COMMENT/TBLPROPERTIES
   ↓
4. sqlparser::Parser::parse_sql() called with DatabricksDialect
   ↓
5. AST parsed into Statement::CreateTable nodes
   ↓
6. parse_create_table() extracts TableData
   ↓
7. ImportResult returned with tables and errors
```

## Notes

- No database schema changes required (parsing-only feature)
- No new public API types introduced
- All enhancements are internal to SQL import logic
- Backward compatibility maintained - existing imports unchanged