elusion 0.2.0

Elusion is DataFrame library, built on top of DataFusion SQL engine, for easy usage, with familiar DataFrame syntax like: PySpark, Pandas, Polars...
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
# Elusion ๐Ÿฆ€ DataFrame Library for Everybody!


Elusion is a high-performance, flexible DataFrame library built on top of DataFusion SQL query engine, for managing and querying data using a DataFrame-like interface. Designed for developers who need a powerful abstraction over data transformations, Elusion simplifies complex operations such as filtering, joining, aggregating, and more with an intuitive, chainable API.

# Motivation


I believe that DataFusion has great potential in Data Engineering / Data Analytics world, but I also believe that design choices for SQL and DataFrame API do not resemble popular DataFrame soultions out there, and I am here to narrow this gap, by creating easily chainable constructs for anybody to use and understand.

## Key Features


### ๐Ÿš€ High-Performance DataFrame Operations

- Load and process data from CSV files with ease.
- Perform SQL-like transformations such as `SELECT`, `WHERE`, `GROUP BY`, and `JOIN`.

### ๐Ÿ“Š Aggregations and Analytics

- Built-in support for functions like `SUM`, `AVG`, `MIN`, `MAX`, `COUNT`, and more.
- Advanced statistical functions like `CORR`, `STDDEV`, `VAR_POP`, and `PERCENTILE`.

### ๐Ÿ”— Flexible Joins

- Join tables with various join types (`INNER`, `LEFT`, `RIGHT`, `FULL`, etc.).
- Intuitive syntax for specifying join conditions and aliases.

### ๐ŸชŸ Window Functions

- Add analytical window functions like `RANK`, `DENSE_RANK`, `ROW_NUMBER`, and custom partition-based calculations.

### ๐Ÿงน Clean Query Construction

- Construct readable and reusable SQL-like queries.
- Support for Common Table Expressions (CTEs), subqueries, and set operations (`UNION`, `INTERSECT`, `EXCEPT`).

### ๐Ÿ› ๏ธ Easy-to-Use API

- Chainable and intuitive API for building queries.
- Readable debug output of generated SQL for verification.

- **Data Preview**: Preview your data easily by displaying a subset of rows in the terminal.
- **Composable Queries**: Chain transformations seamlessly to build reusable and testable workflows.

---

## Installation


To add **Elusion** to your Rust project, include the following line in your `Cargo.toml` under `[dependencies]`:

```toml
elusion = "0.2.0"
```
---
## Dependencies that you need in Cargo.toml to use Elusion:


```toml
[dependencies]
elusion = "0.2.0"
tokio = { version = "1.42.0", features = ["rt-multi-thread"] }

```

---
## Usage examples:


### MAIN function


```rust
#[tokio::main]

async fn main() -> ElusionResult<()> {
    Ok(())
}
```

### MAIN function with small example


```rust
use elusion::prelude::*; // Import everything needed

#[tokio::main]

async fn main() -> ElusionResult<()> {
    let sales_columns = vec![
        ("OrderDate", "DATE", false),
        ("StockDate", "DATE", false),
        ("OrderNumber", "VARCHAR", false),
    ];

    let sales_data = "path\\to\\sales_data.csv";
    let df_sales = CustomDataFrame::new(sales_data, sales_columns, "sales").await?;

    let result = df_sales
        .select(vec!["OrderDate", "OrderNumber"])
        .limit(10);

    result.display().await?;

    Ok(())
}
```

### Schema establishing

#### **Column Name**, **SQL DataType** and If is **Null**-able (true, false) needs to be provided


```rust
let sales_columns = vec![
    ("OrderDate", "DATE", false),
    ("OrderNumber", "VARCHAR", false),
    ("ProductKey", "INT", false),
    ("CustomerKey", "INT", true),
    ("OrderQuantity", "INT", false)
    ];

let customers_columns = vec![
    ("CustomerKey", "INT", true),
    ("FirstName", "VARCHAR", true),
    ("LastName", "VARCHAR", true),
    ("EmailAddress", "VARCHAR", true),
    ("AnnualIncome", "INT", true)
];
```
### Currently supported SQL Data Types

```rust
"CHAR" => SQLDataType::Char,
"VARCHAR" => SQLDataType::Varchar,
"TEXT" | "STRING" => SQLDataType::Text,
"TINYINT" => SQLDataType::TinyInt,
"SMALLINT" => SQLDataType::SmallInt,
"INT" | "INTEGER" => SQLDataType::Int,
"BIGINT" => SQLDataType::BigInt,
"FLOAT" => SQLDataType::Float,
"DOUBLE" => SQLDataType::Double,
"DECIMAL" => SQLDataType::Decimal(20, 4), 
"NUMERIC" | "NUMBER" => SQLDataType::Decimal(20,4),
"DATE" => SQLDataType::Date,
"TIME" => SQLDataType::Time,
"TIMESTAMP" => SQLDataType::Timestamp,
"BOOLEAN" => SQLDataType::Boolean,
"BYTEA" => SQLDataType::ByteA
```

### CSV file paths


```rust
let sales_data = "C:\\Path\\To\\Your\\FIle.csv";
let customers_data = "C:\\Path\\To\\Your\\FIle.csv";
```
### Creating Custom data frame 

#### 3 arguments needed:  **Path**, **Schema**, **Table Alias**


```rust
let df_sales = CustomDataFrame::new(sales_data, sales_columns, "sales").await; 
let df_customers = CustomDataFrame::new(customers_data, customers_columns, "customers").await;
```
### RULE of thumb: ALL Column names and Dataframe alias names, will be LOWERCASE() regardles of how you write it, or how they are writen in CSV file.


### ALIAS column names in SELECT() function (AS is case insensitive)

```rust
let customers_alias = df_customers
    .select(vec!["CustomerKey AS customerkey_alias", "FirstName as first_name", "LastName", "EmailAddress"]);
```
### JOIN

```rust
let join_df = df_sales
    .join(
        df_customers,
        "sales.CustomerKey == customers.CustomerKey",
        "INNER",
    )
    .select(vec![
        "sales.OrderDate",
        "sales.OrderQuantity",
        "customers.FirstName",
        "customers.LastName",
    ])
    .limit(10);
        
    join_df.display_query(); // if you want to see generated sql query
    join_df.display().await?;
```

### SELECT without Aggregation

```rust
let result_sales = sales_order_data.clone()
    .select(vec!["customer_name", "order_date", "billable_value"])
    .filter("billable_value > 100.0")
    .order_by(vec!["order_date"], vec![true])
    .limit(10);

    result_sales.display_query(); // if you want to see generated sql query
    result_sales.display().await?;
```

### SELECT with Aggregation

```rust
let result_df = sales_order_data
    .aggregation(vec![
        AggregationBuilder::new("billable_value").sum().alias("total_sales"),
        AggregationBuilder::new("billable_value").avg().alias("avg_sales")
    ])
    .group_by(vec!["customer_name", "order_date"])
    .having("total_sales > 1000")
    .select(vec!["customer_name", "order_date", "total_sales", "avg_sales"]) // SELECT is used with Final columns after aggregation
    .order_by(vec!["total_sales"], vec![false])
    .limit(10);

    result_df.display_query(); // if you want to see generated sql query
    result_df.display().await?;
```

### FILTER 

```rust
 let result_sales = sales_order_data.clone()
    .select(vec!["customer_name", "order_date", "billable_value"])
    .filter("billable_value > 100.0")
    .order_by(vec!["order_date"], vec![true])
    .limit(10);

    result_sales.display_query();   
    result_sales.display().await?;
```

### Raw SQL Querying

#### FULL SQL SUPPORT is available

```rust
let sales_columns = vec![
    ("OrderDate", "DATE", false),
    ("StockDate", "DATE", false),
    ("OrderNumber", "VARCHAR", false),
    ("ProductKey", "INT", false),
    ("CustomerKey", "INT", true),
    ("TerritoryKey", "INT", false),
    ("OrderLineItem", "INT", false),
    ("OrderQuantity", "INT", false)
];

let customers_columns = vec![
    ("CustomerKey", "INT", true),
    ("Prefix", "VARCHAR", true),
    ("FirstName", "VARCHAR", true),
    ("LastName", "VARCHAR", true),
    ("BirthDate", "DATE", true),
    ("MaritialStatus", "CHAR", true),
    ("Gender", "VARCHAR", true),
    ("EmailAddress", "VARCHAR", true),
    ("AnnualIncome", "INT", true),
    ("TotalChildren", "INT", true),
    ("EducationLevel", "VARCHAR", true),
    ("Occupation", "VARCHAR", true),
    ("HomeOwner","CHAR", true)
];

let products_columns = vec![
    ("ProductKey", "INT", false),
    ("ProductSubcategoryKey", "INT", false),
    ("ProductSKU", "VARCHAR", false),
    ("ProductName", "VARCHAR", false),
    ("ModelName", "VARCHAR", false),
    ("ProductDescription", "VARCHAR", false),
    ("ProductColor", "VARCHAR", false),
    ("ProductSize", "VARCHAR", false),
    ("ProductStyle", "VARCHAR", false),
    ("ProductCost", "DOUBLE", false),
    ("ProductPrice", "DOUBLE", false),
];

let sales_data = "C:\\Borivoj\\RUST\\Elusion\\SalesData2022.csv";
let customers_data = "C:\\Borivoj\\RUST\\Elusion\\Customers.csv";
let products_data = "C:\\Borivoj\\RUST\\Elusion\\Products.csv";

let df_sales = CustomDataFrame::new(sales_data, sales_columns, "sales").await; 
let df_customers = CustomDataFrame::new(customers_data, customers_columns, "customers").await; 
let df_products = CustomDataFrame::new(products_data, products_columns, "products").await; 

// Query on 1 DataFrame
let sql_one = "
        SELECT
            CAST(BirthDate AS DATE) as date_of_birth,
            CONCAT(firstname, ' ',lastname) as full_name
            FROM CUSTOMERS
        LIMIT 10;
    ";

let result_one = df_customers.raw_sql(sql_one, "customers_data", &[]).await?;
result_one.display().await?;

// Query on 2 DataFrames
let sql_two = "
    WITH agg_sales AS (
        SELECT
            CustomerKey,
            SUM(OrderQuantity) AS total_order_quantity,
            COUNT(OrderLineItem) AS total_orders
        FROM sales
        GROUP BY CustomerKey
    ),
    customer_details AS (
        SELECT
            *
        FROM customers
    )
    SELECT
        cd.*,
        asales.total_order_quantity,
        asales.total_orders
    FROM agg_sales asales
    INNER JOIN customer_details cd ON asales.CustomerKey = cd.CustomerKey
    ORDER BY asales.total_order_quantity DESC
    LIMIT 100;
";

let result_two = df_sales.raw_sql(sql_two, "top_customers", &[&df_customers]).await?;
result_two.display().await?;

// Query on 3 DataFrames (same approach is used on any number of DataFrames)
let sql_three = "
    SELECT
        c.CustomerKey,
        c.FirstName,
        c.LastName,
        p.ProductName,
        SUM(s.OrderQuantity) AS TotalQuantity
    FROM
        sales s
    INNER JOIN
        customers c
    ON
        s.CustomerKey = c.CustomerKey
    INNER JOIN
        products p
    ON
        s.ProductKey = p.ProductKey
    GROUP BY
        c.CustomerKey,
        c.FirstName,
        c.LastName,
        p.ProductName
    ORDER BY
        TotalQuantity DESC
    LIMIT 100;
    ";

    let result_three = df_sales.raw_sql(sql_three, "customer_product_sales_summary", &[&df_customers, &df_products]).await?;
    result_three.display().await?;

```
### Writing to Parquet File

#### We have 2 writing modes: Overwrite and Append

```rust
// overwrite existing file
result_df
    .write_to_parquet("overwrite","C:\\Path\\To\\Your\\test.parquet",None)
    .await
    .expect("Failed to write to Parquet");

//append to exisiting file
result_df
    .write_to_parquet("append","C:\\Path\\To\\Your\\test.parquet",None)
    .await
    .expect("Failed to append to Parquet");
```

---
### Current Clause functions (some still under development)


```rust
load(...)
select(...)
group_by(...)
order_by(...)
limit(...)
filter(...)
having(...)
join(...)
window(...)
aggregation(...)
from_subquery(...)
with_cte(...)
union(...)
intersect(...)
except(...)
display(...)
display_query(...)
display_query_plan(...)
```
### Current Aggregation functions (soon to be more)


```rust
sum(mut self)
avg(mut self)
min(mut self)
max(mut self)
stddev(mut self)
count(mut self)
count_distinct(mut self)
corr(mut self, other_column: &str)
grouping(mut self)
var_pop(mut self)
stddev_pop(mut self)
array_agg(mut self)
approx_percentile(mut self, percentile: f64)
first_value(mut self) 
nth_value(mut self, n: i64)

```

### License

Elusion is distributed under the [MIT License](https://opensource.org/licenses/MIT). 
However, since it builds upon [DataFusion](https://datafusion.apache.org/), which is distributed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0), some parts of this project are subject to the terms of the Apache License 2.0.
For full details, see the [LICENSE.txt file](LICENSE.txt).

### Acknowledgments

This library leverages the power of Rust's type system and libraries like [DataFusion](https://datafusion.apache.org/)
, Arrow for efficient query processing. Special thanks to the open-source community for making this project possible.


## ๐Ÿšง Disclaimer: Under Development ๐Ÿšง


This crate is currently **under active development and testing**. It is not considered stable or ready for production use.

We are actively working to improve the features, performance, and reliability of this library. Breaking changes might occur between versions as we continue to refine the API and functionality.

If you want to contribute or experiment with the crate, feel free to do so, but please be aware of the current limitations and evolving nature of the project.

Thank you for your understanding and support!


## Where you can find me:


LindkedIn - [LinkedIn](https://www.linkedin.com/in/borivojgrujicic/ )
YouTube channel - [YouTube](https://www.youtube.com/@RustyBiz)
Udemy Instructor - [Udemy](https://www.udemy.com/user/borivoj-grujicic/)