data-preprocess
Historical market data storage and preprocessing CLI. Imports tick and OHLCV bar data from CSV files into a local DuckDB database, with support for multiple exchanges, deduplication, querying, and management.
Features
- Import tick (bid/ask/last) and bar (OHLCV) data from tab-delimited CSV files
- DuckDB embedded storage — single-file, no server required
- Exchange-partitioned data — same symbol on different exchanges stored independently
- Automatic deduplication on import (idempotent re-imports)
- Timezone conversion from source offset to UTC
- Symbol auto-extraction from filenames
- Query and display stored data with filtering, pagination, and sort options
- Delete by exchange, symbol, timeframe, or date range
Quick Start
# Build
# Import tick data (symbol extracted from filename, UTC+2 default)
# Import bar data (timeframe required)
# View statistics
# Query ticks
# Query bars
# Remove data
# Run tests
CLI Reference
data-preprocess [--db <PATH>] <COMMAND>
Global options:
--db <PATH> Path to DuckDB file [default: market_data.duckdb]
Also reads DATA_PREPROCESS_DB env var
Commands:
input Import market data from CSV file(s)
remove Remove data by exchange / symbol / type / date range
stats Show summary statistics
view Query and display stored data
input tick
data-preprocess input tick [OPTIONS] <FILES>...
-e, --exchange <EX> Exchange name (REQUIRED)
--symbol <SYM> Override symbol (default: from filename)
--tz-offset <TZ> Source timezone offset [default: +02:00]
input bar
data-preprocess input bar [OPTIONS] <FILES>...
-e, --exchange <EX> Exchange name (REQUIRED)
-t, --timeframe <TF> Timeframe: 1m, 5m, 15m, 30m, 1h, 4h, 1d, 1w, 1M (REQUIRED)
--symbol <SYM> Override symbol (default: from filename)
--tz-offset <TZ> Source timezone offset [default: +02:00]
stats
data-preprocess stats [--exchange <EX>] [--symbol <SYM>]
view tick / view bar
data-preprocess view tick -e <EX> --symbol <SYM> [--from <DT>] [--to <DT>] [--limit N] [--tail] [--desc]
data-preprocess view bar -e <EX> --symbol <SYM> -t <TF> [--from <DT>] [--to <DT>] [--limit N] [--tail] [--desc]
remove
data-preprocess remove tick -e <EX> --symbol <SYM> [--from <DT>] [--to <DT>]
data-preprocess remove bar -e <EX> --symbol <SYM> -t <TF> [--from <DT>] [--to <DT>]
data-preprocess remove symbol -e <EX> <SYMBOL>
data-preprocess remove exchange <EXCHANGE>
Input CSV Formats
Tick CSV
Tab-delimited, with header. Filename convention: {SYMBOL}_*.csv
<DATE> <TIME> <BID> <ASK> <LAST> <VOLUME> <FLAGS>
2026.02.16 19:00:00.083 67849.69 67861.69 6
Bar CSV
Tab-delimited, with header. Filename convention: {SYMBOL}_*.csv
<DATE> <TIME> <OPEN> <HIGH> <LOW> <CLOSE> <TICKVOL> <VOL> <SPREAD>
2026.02.21 00:45:00 67932.44 67934.19 67888.89 67910.24 184 0 1200
Data Conventions
- Exchanges are always stored lowercase (
ctrader,binance) - Symbols are always stored uppercase (
BTCUSD,EURUSD) - Timestamps are stored in UTC — source timezone is converted on import
- Deduplication uses
INSERT OR IGNOREon(exchange, symbol, ts)for ticks and(exchange, symbol, timeframe, ts)for bars
For full schema details, query examples, and client integration guides (Python, Rust), see db-details.md.
Library Usage
The crate also exposes a library for programmatic access:
use ;
let db = open?;
let = db.query_ticks?;
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)