data-modelling-sdk 2.4.0

Shared SDK for model operations across platforms (API, WASM, Native)
Documentation
# API Contracts: Enhanced Databricks SQL Syntax Support

**Feature**: Enhanced Databricks SQL Syntax Support
**Date**: 2026-01-04
**Phase**: Phase 1 - Design

## Overview

This feature extends the existing SQL import API to support Databricks dialect. No new public API functions are introduced - the feature enhances existing `SQLImporter` functionality.

## Public API (No Changes)

### SQLImporter

**Module**: `data_modelling_sdk::import::sql`

**Existing API** (unchanged):

```rust
pub struct SQLImporter {
    pub dialect: String,
}

impl SQLImporter {
    pub fn new(dialect: &str) -> Self;
    pub fn parse(&self, sql: &str) -> Result<ImportResult>;
    pub fn parse_liquibase(&self, sql: &str) -> Result<ImportResult>;
}
```

**Enhancement**: The `dialect` parameter now accepts `"databricks"` as a valid value.

**Dialect Values**:
- `"postgres"` or `"postgresql"` - PostgreSQL dialect
- `"mysql"` - MySQL dialect
- `"sqlite"` - SQLite dialect
- `"generic"` - Generic SQL dialect (default)
- `"databricks"` - **NEW** Databricks SQL dialect with IDENTIFIER() and variable reference support

## Internal API (New)

### DatabricksDialect

**Module**: `src/import/sql.rs` (internal, not exported)

**Type**: Struct implementing `sqlparser::dialect::Dialect`

```rust
struct DatabricksDialect;

impl Dialect for DatabricksDialect {
    // Override trait methods for Databricks-specific behavior
}
```

**Responsibilities**:
- Recognize `:` as valid in identifiers (for variable references)
- Handle Databricks-specific identifier quoting
- Customize parser behavior for Databricks SQL patterns

## Function Contracts

### SQLImporter::new()

**Signature**: `pub fn new(dialect: &str) -> Self`

**Preconditions**:
- `dialect` is a non-empty string

**Postconditions**:
- Returns `SQLImporter` with specified dialect
- If `dialect == "databricks"`, Databricks-specific parsing will be enabled

**Side Effects**: None

**Errors**: None (invalid dialects default to "generic")

### SQLImporter::parse()

**Signature**: `pub fn parse(&self, sql: &str) -> Result<ImportResult>`

**Preconditions**:
- `sql` is a valid UTF-8 string
- If `self.dialect == "databricks"`, SQL may contain Databricks-specific syntax

**Postconditions**:
- Returns `ImportResult` containing:
  - Successfully parsed tables in `tables` field
  - Tables requiring name resolution in `tables_requiring_name` (if IDENTIFIER() contains only variables)
  - Parse errors in `errors` field (if parsing fails)
- Databricks-specific syntax patterns are handled:
  - `IDENTIFIER()` function calls are recognized and table names extracted
  - Variable references in type definitions are replaced with fallback types
  - Variable references in COMMENT/TBLPROPERTIES are handled gracefully

**Side Effects**: None

**Errors**:
- Returns `Ok(ImportResult)` with errors in `errors` field if parsing fails
- Error messages include context about Databricks-specific syntax when applicable

### Preprocessing Functions (Internal)

**Module**: `src/import/sql.rs` (internal, not exported)

**Functions**:
- `preprocess_databricks_sql(sql: &str) -> (String, PreprocessingState)` - Preprocesses SQL to handle Databricks syntax
- `extract_identifier_table_name(expr: &str) -> Option<String>` - Extracts table name from IDENTIFIER() expression
- `replace_variables_in_types(sql: &str) -> String` - Replaces variable references in type definitions

**Contracts**: Internal implementation details, not part of public API

## Error Contracts

### ImportError::ParseError

**When**: SQL parsing fails due to syntax errors

**Message Format**: `"Parse error: {details}"`

**Enhancements for Databricks**:
- When Databricks syntax is detected but generic dialect is used, error suggests using `dialect="databricks"`
- When IDENTIFIER() parsing fails, error includes the problematic expression
- When variable references cause parse errors, error indicates the location and suggests preprocessing

**Example Messages**:
- `"Parse error: Expected: column name or constraint definition, found: : at Line: 1, Column: 39. This appears to be Databricks SQL syntax. Try using dialect='databricks'."`
- `"Parse error: IDENTIFIER() expression contains only variables. Table name cannot be determined: IDENTIFIER(:catalog)"`

## Test Contracts

### Unit Tests

**Location**: `tests/import_tests.rs` or `src/import/sql.rs` (module tests)

**Test Cases**:
1. `test_databricks_identifier_with_literal()` - IDENTIFIER('table') parses correctly
2. `test_databricks_identifier_with_variable()` - IDENTIFIER(:var) creates placeholder table
3. `test_databricks_identifier_with_concatenation()` - IDENTIFIER(:var || '.schema.table') extracts table name
4. `test_databricks_variable_in_struct()` - STRUCT<field: :type> replaces variable with STRING
5. `test_databricks_variable_in_array()` - ARRAY<:type> replaces variable with STRING
6. `test_databricks_nested_variables()` - ARRAY<STRUCT<field: :type>> handles nested patterns
7. `test_databricks_comment_variable()` - COMMENT ':var' handles variable gracefully
8. `test_databricks_tblproperties_variable()` - TBLPROPERTIES ('key' = ':var') handles variable
9. `test_databricks_backward_compatibility()` - Existing dialects still work
10. `test_databricks_error_messages()` - Error messages include helpful context

### Integration Tests

**Location**: `tests/import_tests.rs`

**Test Cases**:
1. `test_databricks_full_example()` - Complete Databricks SQL DDL from GitHub issue #13
2. `test_databricks_mixed_sql()` - Databricks SQL mixed with standard SQL
3. `test_databricks_performance()` - Performance is within 10% of standard SQL parsing

## Backward Compatibility

**Guarantee**: All existing API contracts remain unchanged. Adding Databricks support does not modify behavior for existing dialects.

**Verification**:
- All existing tests for PostgreSQL, MySQL, SQLite, Generic dialects must pass
- No changes to `ImportResult`, `TableData`, `ColumnData` structures
- No changes to error types or error handling patterns

## Feature Flag

**Cargo Feature**: `databricks-dialect` (already defined in Cargo.toml)

**Behavior**:
- When feature is enabled: Databricks dialect is available
- When feature is disabled: Attempting to use `dialect="databricks"` may default to generic dialect or return an error (implementation detail)

**Note**: Feature flag may not be strictly necessary if Databricks support adds no new dependencies, but it's already defined in Cargo.toml and should be respected.