data-preprocess
Historical market data storage and preprocessing CLI. Imports tick and OHLCV bar data from CSV files into Parquet (default) or DuckDB storage, with support for multiple exchanges, deduplication, querying, and management.
Storage Backends
| Backend | Feature Flag | Default | Build Time | Description |
|---|---|---|---|---|
| Parquet + Polars | parquet |
✅ Yes | ~30s | Hive-partitioned Parquet files. No C++ compilation. zstd compressed. |
| DuckDB | duckdb-backend |
No | ~150s | Embedded columnar database. Opt-in for SQL exploration. |
Parquet Directory Layout (Hive-Style Partitioning)
{data_dir}/
├── ticks/
│ └── exchange={exchange}/
│ └── symbol={symbol}/
│ ├── 2026-01-15.parquet
│ └── 2026-01-16.parquet
└── bars/
└── exchange={exchange}/
└── symbol={symbol}/
└── timeframe={timeframe}/
├── 2026-01-15.parquet
└── 2026-01-16.parquet
Each file covers one date for one exchange+symbol (or exchange+symbol+timeframe for bars). Files are sorted by timestamp ascending and compressed with zstd.
Quick Start
# Build (default: parquet backend)
# Import tick data (symbol extracted from filename, UTC+2 default)
# Import bar data (timeframe required)
# View statistics
# Query ticks
# Query bars
# Remove data
# Run tests (parquet only, default)
# Run tests (both backends)
# Use DuckDB backend at runtime
CLI Reference
data-preprocess [OPTIONS] <COMMAND>
Global options:
--backend <parquet|duckdb> Storage backend [default: parquet]
--data-dir <PATH> Root directory for Parquet files [default: market_data]
Also reads DATA_PREPROCESS_DIR env var
--db <PATH> Path to DuckDB file (duckdb backend only) [default: market_data.duckdb]
Also reads DATA_PREPROCESS_DB env var
Commands:
input Import market data from CSV file(s)
remove Remove data by exchange / symbol / type / date range
stats Show summary statistics
view Query and display stored data
input tick
data-preprocess input tick [OPTIONS] <FILES>...
-e, --exchange <EX> Exchange name (REQUIRED)
--symbol <SYM> Override symbol (default: from filename)
--tz-offset <TZ> Source timezone offset [default: +02:00]
input bar
data-preprocess input bar [OPTIONS] <FILES>...
-e, --exchange <EX> Exchange name (REQUIRED)
-t, --timeframe <TF> Timeframe: 1m, 5m, 15m, 30m, 1h, 4h, 1d, 1w, 1M (REQUIRED)
--symbol <SYM> Override symbol (default: from filename)
--tz-offset <TZ> Source timezone offset [default: +02:00]
stats
data-preprocess stats [--exchange <EX>] [--symbol <SYM>]
view tick / view bar
data-preprocess view tick -e <EX> --symbol <SYM> [--from <DT>] [--to <DT>] [--limit N] [--tail] [--desc]
data-preprocess view bar -e <EX> --symbol <SYM> -t <TF> [--from <DT>] [--to <DT>] [--limit N] [--tail] [--desc]
remove
data-preprocess remove tick -e <EX> --symbol <SYM> [--from <DT>] [--to <DT>]
data-preprocess remove bar -e <EX> --symbol <SYM> -t <TF> [--from <DT>] [--to <DT>]
data-preprocess remove symbol -e <EX> <SYMBOL>
data-preprocess remove exchange <EXCHANGE>
Input CSV Formats
Tick CSV
Tab-delimited, with header. Filename convention: {SYMBOL}_*.csv
<DATE> <TIME> <BID> <ASK> <LAST> <VOLUME> <FLAGS>
2026.02.16 19:00:00.083 67849.69 67861.69 6
Bar CSV
Tab-delimited, with header. Filename convention: {SYMBOL}_*.csv
<DATE> <TIME> <OPEN> <HIGH> <LOW> <CLOSE> <TICKVOL> <VOL> <SPREAD>
2026.02.21 00:45:00 67932.44 67934.19 67888.89 67910.24 184 0 1200
Data Conventions
- Exchanges are always stored lowercase (
ctrader,binance) - Symbols are always stored uppercase (
BTCUSD,EURUSD) - Timestamps are stored in UTC — source timezone is converted on import
- Deduplication uses
(exchange, symbol, ts)for ticks and(exchange, symbol, timeframe, ts)for bars- Parquet: read-merge-write per date partition file (bounded to one file per dedup operation)
- DuckDB:
INSERT OR IGNOREwith UNIQUE constraints
Library Usage
Parquet backend (default)
use ;
let store = open?;
// Import ticks
let inserted = store.insert_ticks?;
// Query ticks
let = store.query_ticks?;
// Stats
let stats = store.stats?;
// Delete
let deleted = store.delete_ticks?;
let = store.delete_symbol?;
DuckDB backend (opt-in)
Requires features = ["duckdb-backend"] in your Cargo.toml.
use ;
let db = open?;
let = db.query_ticks?;
Consumer crates (models only, no backend)
For crates that only need the type definitions (Tick, Bar, Timeframe, QueryOpts), disable default features to avoid pulling in Polars:
[]
= { = "../data-preprocess", = false }
Feature Flags
| Feature | Dependencies | Use Case |
|---|---|---|
parquet (default) |
polars |
Parquet read/write with Polars |
duckdb-backend |
duckdb |
DuckDB embedded database |
| (none) | — | Model types + parsers only (fastest build) |
API Parity
Both backends provide the same logical operations with identical return types:
| Operation | ParquetStore |
Database |
|---|---|---|
| Open | open(root_dir) |
open(file_path) |
| Insert ticks | insert_ticks(&[Tick]) -> usize |
insert_ticks(&[Tick]) -> usize |
| Insert bars | insert_bars(&[Bar]) -> usize |
insert_bars(&[Bar]) -> usize |
| Query ticks | query_ticks(&QueryOpts) -> (Vec<Tick>, u64) |
query_ticks(&QueryOpts) -> (Vec<Tick>, u64) |
| Query bars | query_bars(&BarQueryOpts) -> (Vec<Bar>, u64) |
query_bars(&BarQueryOpts) -> (Vec<Bar>, u64) |
| Delete ticks | delete_ticks(ex, sym, from, to) -> usize |
delete_ticks(ex, sym, from, to) -> usize |
| Delete bars | delete_bars(ex, sym, tf, from, to) -> usize |
delete_bars(ex, sym, tf, from, to) -> usize |
| Delete symbol | delete_symbol(ex, sym) -> (usize, usize) |
delete_symbol(ex, sym) -> (usize, usize) |
| Delete exchange | delete_exchange(ex) -> (usize, usize) |
delete_exchange(ex) -> (usize, usize) |
| Stats | stats(ex?, sym?) -> Vec<StatRow> |
stats(ex?, sym?) -> Vec<StatRow> |
| Size | total_size() -> Option<u64> |
file_size() -> Option<u64> |
Used By
| Crate | How |
|---|---|
qs-backtest |
default-features = false — imports Tick, Bar, Timeframe, QueryOpts, BarQueryOpts model types only (no Polars/DuckDB). Uses ticks_to_feed() / bars_to_feed() converters. |
qs-backtest-server |
Full parquet feature — opens ParquetStore at runtime to query ticks/bars for backtest requests. Uses stats() for the list_symbols RPC. |
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)