# atelier-data
Market data infrastructure for the **atelier-rs** trading engine.
This crate provides everything needed to connect to cryptocurrency exchanges,
normalise their heterogeneous WebSocket feeds into a common data model,
synchronise events onto a uniform time grid, and persist the result to
Apache Parquet files.
## Core Data Types
**Off-chain activity** (market microstructure):
| `Orderbook` | Full-depth limit order book snapshot (bid/ask levels) |
| `OrderbookDelta` | Incremental order book maintained via `NormalizedDelta` updates |
| `Trade` | Public trade execution (price, size, side, timestamp) |
| `Liquidation` | Forced liquidation event |
| `FundingRate` | Perpetual futures funding rate observation |
| `OpenInterest` | Aggregate open interest snapshot |
**Composed types:**
| `MarketSnapshot` | Time-aligned bundle of all market data for one grid period |
| `MarketAggregate` | 15-scalar feature vector derived from a `MarketSnapshot` |
## Exchange Sources
| Bybit | CEX | WSS | YES / YES | YES / YES | YES / YES | YES / YES | YES / YES |
| Coinbase | CEX | WSS | YES / YES | YES / YES | — | — | — |
| Kraken | CEX | WSS | YES / YES | YES / YES | — | — | — |
*Format: Implemented / Tested. Dashes indicate the exchange does not expose
the data type on its spot/linear WebSocket API.*
## Workers
Two worker types handle end-to-end data collection:
**DataWorker** — raw event ingestion without synchronisation. Connects to a
live exchange WebSocket feed, decodes events, and delivers them through a
pluggable `OutputSink` pipeline. Configuration is driven by a TOML manifest
(`DataWorkerManifest`). Handles reconnection, backoff, health monitoring,
and gap detection automatically.
**MarketWorker** — synchronised market snapshots. Extends `DataWorker`'s
ingestion with a `MarketSynchronizer` that bins heterogeneous events onto
a uniform nanosecond grid, producing `MarketSnapshot` objects at each tick.
Multiple `ClockMode` strategies are supported: `OrderbookDriven`,
`TradeDriven`, `LiquidationDriven`, and `ExternalClock`. Snapshots are
delivered through the same `OutputSink` pipeline and can be flushed to
Parquet automatically.
## Output Sinks
The `OutputSink` trait defines where worker output goes. Multiple sinks
run simultaneously via `OutputSinkSet` (fan-out):
| `ChannelSink` | Working | Wraps `TopicRegistry` broadcast channels for pub/sub |
| `TerminalSink` | Working | Debug/tracing terminal output |
| `ParquetSink` | Working | Buffers `MarketSnapshot`s, decomposes and flushes to per-datatype Parquet files |
## Parquet Persistence
Requires `--features parquet`. All five data types support read and write:
| Orderbooks | `write_ob_parquet` | `read_ob_parquet` |
| Trades | `write_trades_parquet_timestamped` | `read_trades_parquet` |
| Liquidations | `write_liquidations_parquet_timestamped` | `read_liquidations_parquet` |
| Funding Rates | `write_funding_parquet_timestamped` | `read_funding_parquet` |
| Open Interest | `write_oi_parquet_timestamped` | `read_oi_parquet` |
### Filename Convention
All timestamped writers produce files following this pattern:
```
{SYMBOL}_{DATATYPE}_{MODE}_{TIMESTAMP}.parquet
```
Where `MODE` is `"sync"` for grid-aligned data or `"raw"` for unprocessed
captures. Symbols containing `/` (e.g. Kraken's `BTC/USDT`) are sanitised
to `-` in the filename (`BTC-USDT`) while the Parquet data retains the
original symbol string. Examples:
```
BTCUSDT_ob_sync_20260226_153000.123.parquet
ETHUSDT_trades_raw_20260226_160000.456.parquet
BTC-USDT_ob_sync_20260226_153000.123.parquet
```
Files are organised into subdirectories per data type: `orderbooks/`,
`trades/`, `liquidations/`, `fundings/`, `open_interests/`.
## Feature Flags
| `parquet` | Enables Apache Parquet I/O (adds `arrow` + `parquet` deps) |
| `torch` | Enables `tch`-based tensor conversion in the `datasets` module |
## Examples
| `run_data_worker` | Raw event ingestion via DataWorker | `cargo run -p atelier_data --example run_data_worker -- --config <path>` |
| `run_market_worker` | Synchronised snapshots to Parquet via MarketWorker | `cargo run -p atelier_data --example run_market_worker --features parquet -- --config <path>` |
| `read_market_worker` | Read Parquet files and print per-symbol stats | `cargo run -p atelier_data --example read_market_worker --features parquet -- --dir <path>` |
| `bybit_markets` | Bybit market snapshot collection (standalone) | `cargo run -p atelier_data --example bybit_markets --features parquet -- --config <path>` |
| `coinbase_markets` | Coinbase market snapshot collection | `cargo run -p atelier_data --example coinbase_markets --features parquet -- --config <path>` |
| `kraken_markets` | Kraken market snapshot collection | `cargo run -p atelier_data --example kraken_markets --features parquet -- --config <path>` |
| `market_load` | Load and verify most recent Parquet files | `cargo run -p atelier_data --example market_load --features parquet -- --config <path>` |
| `market_fetch` | Multi-exchange raw stream collector (Bybit/Coinbase/Kraken) | `cargo run -p atelier_data --example market_fetch --features parquet` |
| `multi_sync_workers` | Multi-worker manifest parser (stub) | `cargo run -p atelier_data --example multi_sync_workers -- --config <path>` |
---
**`atelier-data`** is a member of the [atelier-rs](https://github.com/iteralabs/atelier-rs) workspace:
- [atelier-engine](https://crates.io/crates/atelier-engine)
- [atelier-quant](https://crates.io/crates/atelier-quant)
- [atelier-retro](https://crates.io/crates/atelier-retro)
- [atelier-rs](https://crates.io/crates/atelier-rs)
Development resources:
- [examples](https://github.com/IteraLabs/atelier-rs/tree/main/atelier-data/examples)
- [tests](https://github.com/IteraLabs/atelier-rs/tree/main/atelier-data/tests)
- [benches](https://github.com/IteraLabs/atelier-rs/tree/main/benches)
- [datasets](https://github.com/IteraLabs/atelier-rs/tree/main/datasets)