timeseries-table-datafusion 0.1.2

DataFusion SQL integration for timeseries-table-core tables (TableProvider + segment pruning).
Documentation
# timeseries-table-datafusion

This crate lets you query a `timeseries-table-core` table using DataFusion SQL.
It focuses on time-series workloads and includes **segment-level pruning**:
when a query has a time predicate, we determine which data segments cannot
possibly match (based on each segment's `ts_min`/`ts_max`), and skip them
before execution.

The goal is simple: keep SQL queries fast without changing your data.

## Getting Started

### Installation

```toml
[dependencies]
timeseries-table-datafusion = "0.1"
datafusion = "51"
tokio = { version = "1", features = ["rt-multi-thread"] }
```

### Rust API Example

```rust
use datafusion::prelude::*;
use timeseries_table_core::TimeSeriesTable;
use timeseries_table_datafusion::TsTableProvider;
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Open an existing table
    let table = TimeSeriesTable::open("./my_table")?;

    // 2. Create a TsTableProvider (implements DataFusion's TableProvider)
    let provider = TsTableProvider::try_new(Arc::new(table)).await?;

    // 3. Register it in a DataFusion SessionContext
    let ctx = SessionContext::new();
    ctx.register_table("my_table", Arc::new(provider))?;

    // 4. Run a query with time filters + projection
    //    Time filters enable segment pruning automatically
    let df = ctx.sql("
        SELECT ts, symbol, close
        FROM my_table
        WHERE ts >= '2024-01-01T00:00:00Z'
          AND ts <  '2024-02-01T00:00:00Z'
        ORDER BY ts
        LIMIT 100
    ").await?;

    // 5. Collect and print results
    let batches = df.collect().await?;
    for batch in &batches {
        println!("{:?}", batch);
    }

    // Or use show() for a formatted table
    // df.show().await?;

    Ok(())
}
```

### How Pruning Works

When you include time predicates in your query (e.g., `WHERE ts >= '...' AND ts < '...'`),
the `TsTableProvider` extracts those filters and compares them against each segment's
metadata (`ts_min`, `ts_max`). Segments that fall entirely outside the query range
are skipped—no I/O is performed for them.

This happens automatically; you don't need to do anything special.

### Current Limitations

- **Read-only**: This integration is for querying only. Use `timeseries-table-core` or the CLI for writes.
- **Best-effort filter extraction**: Complex predicates may not be fully recognized (see below). Unrecognized predicates fall back to scanning all segments—correctness is preserved.
- **No custom execution plan**: We use DataFusion's default `ParquetExec` after pruning. Future versions may add custom operators.

---

## Pruning Reference

Below is a detailed reference of which SQL predicates enable segment pruning.

### What works well today

### Basic time comparisons
We recognize direct comparisons between the timestamp column and a literal:

- `ts < '2024-01-01T00:00:00Z'`
- `ts >= '2024-01-01T00:00:00Z'`
- `ts = '2024-01-01T00:00:00Z'`
- `ts != '2024-01-01T00:00:00Z'`

The literal can be an RFC3339 string or a timestamp literal. Date-only strings
like `'2024-01-01'` are treated as midnight UTC.

### BETWEEN / IN / NOT / AND / OR
These are supported as long as they ultimately boil down to time comparisons:

- `ts BETWEEN '...' AND '...'`
- `ts IN ('...', '...')`
- `ts NOT IN ('...')`
- `NOT (ts < '...')`
- Combined with `AND` / `OR`

When a predicate can’t be understood, we **do not prune** (safe fallback).

### Interval arithmetic on `ts`
We support SQL interval arithmetic as long as it’s based on **interval literals**:

- `ts + INTERVAL '1 day' < '...'`
- `ts - INTERVAL '2 hours' <= '...'`
- `ts + INTERVAL '1 day' + INTERVAL '1 hour' < '...'`
- `INTERVAL '1 day' + ts < '...'`

Intervals can include common units (seconds, millis, micros, nanos, minutes,
hours, days, months, years). Mixed intervals are handled by combining months,
days, and sub-day nanos.

### `to_timestamp*` scalar functions
We recognize DataFusion’s timestamp helper functions when they use **literals**:

- `to_timestamp(...)`
- `to_timestamp_seconds(...)`
- `to_timestamp_millis(...)`
- `to_timestamp_micros(...)`
- `to_timestamp_nanos(...)`

Both string literals and numeric epoch literals are supported.

Example:

```
WHERE ts < to_timestamp_millis(1704672000123)
```

### `to_unixtime(ts)`
We also recognize `to_unixtime(ts)` when it is compared to a **numeric** literal:

- `to_unixtime(ts) >= 1704672000`
- `1704672000 < to_unixtime(ts)`

This uses seconds since epoch. String literals are not supported here.

### `to_date(ts)`
We recognize `to_date(ts)` when it is compared to a **date literal**:

- `to_date(ts) = '2024-01-01'`
- `to_date(ts) < '2024-01-01'`
- `to_date(ts) >= '2024-01-01'`

The date literal should be `YYYY-MM-DD`. The comparison is expanded into a
timestamp range for that whole day. If the timestamp column has a timezone,
day boundaries are computed in that timezone.

### `date_trunc(precision, ts)`
We recognize `date_trunc` when it is compared to a **timestamp literal**:

- `date_trunc('hour', ts) = '2024-01-01T10:00:00Z'`
- `date_trunc('minute', ts) > '2024-01-01T10:30:00Z'`
- `date_trunc('day', ts) <= '2024-01-01T00:00:00Z'`

Supported precisions: `second`, `minute`, `hour`, `day`.

Behavior notes:
- Non-aligned literals (e.g. `10:30` for hour) are handled by moving the
  comparison to the next bucket boundary.
- If the timestamp column has an Olson timezone, hour/day boundaries are
  computed in that timezone (DST-aware). Fine granularity (second/minute) uses
  UTC arithmetic, matching DataFusion’s fast path.

### `date_bin(interval, ts[, origin])`
We recognize `date_bin` when all arguments are **literals**:

- `date_bin(interval '15 minutes', ts) = '2024-01-01T10:30:00Z'`
- `date_bin(interval '1 day', ts, '2024-01-01T03:00:00Z') > '2024-01-02T03:00:00Z'`

Supported interval forms:
- Day/time intervals (days, hours, minutes, seconds, millis, micros, nanos)
- Month intervals (months only; no mixed month+day+nanos)

Behavior notes:
- Default origin is the Unix epoch (`1970-01-01T00:00:00Z`) when omitted.
- Binning uses UTC arithmetic (timezone is not applied to bin boundaries).

## What does NOT prune (yet)

These are intentionally treated as “unknown” to avoid incorrect pruning:

- `ts + 1 < ...` (numeric, not interval)
- `ts + other_column < ...`
- `to_timestamp(...) + ts < ...`
- `interval - ts` (non-commutative, ambiguous for our matcher)
- `to_unixtime(ts) < '1704672000'` (string literal)
- `to_date(ts) = 'not-a-date'` (invalid date literal)

Queries still run correctly; they just won’t be pruned.

---

## Related

- `timeseries-table-cli` — Command-line tool for managing and querying tables (no Rust code required)
- `timeseries-table-core` — Core library for table creation, appends, and coverage tracking