xdl-dataframe 0.1.1

DataFrame module for XDL - pandas/Spark-style data manipulation with support for CSV, TSV, Parquet, Avro
Documentation
# XDL DataFrame Examples

This directory contains comprehensive examples demonstrating the XDL DataFrame module's capabilities for data manipulation, analysis, and integration with XDL's machine learning and visualization features.

## Examples Overview

### 1. **csv_analysis.xdl** - CSV Data Analysis
Demonstrates basic DataFrame operations with CSV data:
- Reading CSV files
- Data exploration (head, tail, describe)
- Column selection and filtering
- Grouping and aggregation
- Sorting and value counts
- Integration with XDL plotting
- Exporting to TSV format

**Key Features:**
- Employee data analysis
- Statistical summaries
- Data filtering and sorting
- Scatter plot visualization

### 2. **parquet_example.xdl** - Parquet Format Support
Shows how to work with Parquet files for big data:
- Reading Parquet files (columnar format)
- Efficient data processing
- Advanced filtering and grouping
- Feature extraction for ML
- Data normalization

**Key Features:**
- Big data handling
- ML feature preparation
- Multi-column grouping
- Complex filtering operations

### 3. **database_integration.xdl** - Database Integration
Demonstrates seamless integration between SQL databases and DataFrames:
- Connecting to SQLite databases
- Converting Recordsets to DataFrames
- DataFrame analytics on query results
- Joining DataFrames
- Writing results back to database
- Bar chart visualization

**Key Features:**
- SQL to DataFrame conversion
- Data joining operations
- Calculated columns
- Database round-trip (query → DataFrame → database)

### 4. **ml_and_viz_integration.xdl** - ML and Visualization
Comprehensive example showing DataFrame integration with XDL's ML and visualization:
- Loading scientific datasets
- Feature engineering
- Data normalization
- Training classification models
- Model evaluation (accuracy, confusion matrix)
- 2D scatter plots and correlation heatmaps
- 3D surface plots
- 3D scatter plots with classification results
- Time series visualization

**Key Features:**
- Sensor data analysis
- Random Forest classification
- Feature importance analysis
- Correlation matrix visualization
- 3D surface and scatter plots
- Time series trends

## Running the Examples

### Prerequisites

1. Build XDL with DataFrame support:
```bash
cargo build --features "all"
```

2. For Parquet support:
```bash
cargo build --features "parquet-support"
```

3. For Avro support:
```bash
cargo build --features "avro-support"
```

### Execution

Run any example using the XDL interpreter:
```bash
xdl csv_analysis.xdl
xdl database_integration.xdl
xdl ml_and_viz_integration.xdl
```

## Data Formats Supported

- **CSV** - Comma-separated values (default)
- **TSV** - Tab-separated values
- **Parquet** - Columnar storage format (requires feature flag)
- **Avro** - Binary serialization format (requires feature flag)
- **Database** - Integration with SQL databases via Recordset

## DataFrame Operations

### Basic Operations
- `Head(n)` - First n rows
- `Tail(n)` - Last n rows
- `Shape()` - Get dimensions
- `ColumnNames()` - List column names
- `Describe()` - Statistical summary

### Selection and Filtering
- `Select(columns)` - Select specific columns
- `Filter(predicate)` - Filter rows based on condition
- `Column(name)` - Get specific column

### Aggregation
- `GroupBy(columns)` - Group data
- `Sum()` - Sum of groups
- `Mean()` - Mean of groups
- `Count()` - Count of groups

### Sorting
- `SortBy(columns, ascending)` - Sort by columns

### Data Export
- `WriteCSV(path)` - Export to CSV
- `WriteTSV(path)` - Export to TSV
- `ToJSON()` - Convert to JSON
- `ToXDLValue()` - Convert to XDL arrays

## Integration Points

### With XDL ML Functions
```xdl
; Extract features
X = df->Select(['feature1', 'feature2', 'feature3'])->ToXDLValue()

; Normalize
X_norm = XDLML_NORMALIZE(X, METHOD='minmax')

; Train model
model = XDLML_TRAIN_CLASSIFIER(X_train, y_train, ALGORITHM='random_forest')

; Predict
y_pred = XDLML_PREDICT(model, X_test)
```

### With XDL Charts
```xdl
; Extract data for plotting
x_data = df->Column('x')->Data()
y_data = df->Column('y')->Data()

; Create plot
PLOT, x_data, y_data, PSYM=4, XTITLE='X', YTITLE='Y'

; Create bar chart
XDLCHART_BAR, labels, values, TITLE='My Chart', OUTPUT='chart.html'
```

### With XDL 3D Visualization
```xdl
; 3D scatter plot
x = df->Column('x')->Data()
y = df->Column('y')->Data()
z = df->Column('z')->Data()

XDLVIZ3D_SCATTER, x, y, z, TITLE='3D Data', OUTPUT='scatter3d.html'

; 3D surface plot
XDLVIZ3D_SURFACE, x_grid, y_grid, z_surface, COLORMAP='jet'
```

### With Databases
```xdl
; Query to DataFrame
recordset = objdb->ExecuteSQL('SELECT * FROM table')
df = XDLDATAFRAME_FROM_RECORDSET(recordset)

; DataFrame to CSV for database import
df->WriteCSV, 'for_import.csv'
```

## Performance Tips

1. **Use Parquet for large datasets** - Columnar format is much faster
2. **Filter early** - Reduce data size before complex operations
3. **Type inference** - Let CSV reader infer types automatically
4. **Column selection** - Select only needed columns
5. **Batch operations** - Use GroupBy instead of row-by-row processing

## Common Patterns

### ETL Pipeline
```xdl
; Extract
df = XDLDATAFRAME_READ_CSV('raw_data.csv')

; Transform
df_clean = df->Filter(COLUMN='value', CONDITION='>0')
df_grouped = df_clean->GroupBy(['category'])->Mean()

; Load
df_grouped->WriteCSV, 'summary.csv'
```

### Data Quality Check
```xdl
; Check for missing values
stats = df->Describe()

; Check value distributions
counts = df->Column('status')->ValueCounts()

; Detect outliers
threshold = stats.value.mean + 3 * stats.value.std
outliers = df->Filter(COLUMN='value', CONDITION='>' + STRING(threshold))
```

### Feature Engineering
```xdl
; Normalize
X_norm = XDLML_NORMALIZE(X_raw)

; Create interaction features
df->AddColumn('feature_product', col1 * col2)

; Binning
df->AddColumn('age_group', BIN(ages, [0, 18, 35, 50, 100]))
```

## Additional Resources

- [XDL Documentation]https://xdl-lang.org/docs
- [DataFrame API Reference]https://xdl-lang.org/docs/dataframe
- [ML Functions Reference]https://xdl-lang.org/docs/ml
- [Visualization Guide]https://xdl-lang.org/docs/visualization

## Support

For questions and issues:
- GitHub Issues: https://github.com/xdl-lang/xdl/issues
- Documentation: https://xdl-lang.org/docs