pulsedb 0.1.0 - Docs.rs

                ██████╗ ██╗   ██╗██╗     ███████╗███████╗██████╗ ██████╗
                ██╔══██╗██║   ██║██║     ██╔════╝██╔════╝██╔══██╗██╔══██╗
                ██████╔╝██║   ██║██║     ███████╗█████╗  ██║  ██║██████╔╝
                ██╔═══╝ ██║   ██║██║     ╚════██║██╔══╝  ██║  ██║██╔══██╗
                ██║     ╚██████╔╝███████╗███████║███████╗██████╔╝██████╔╝
                ╚═╝      ╚═════╝ ╚══════╝╚══════╝╚══════╝╚═════╝ ╚═════╝
                     High-Performance Time-Series Database for Rust

✨ Feature Highlights

Feature	Description
Pure Rust	Zero C dependencies. Single static binary. Cross-compiles anywhere Rust does.
Columnar Storage	Fields stored column-by-column in immutable segments for cache-friendly scans and dramatic compression.
Gorilla Float Compression	Facebook's XOR-based float encoding — 8–15× compression on metric data.
Delta-of-Delta Timestamps	Regular-interval timestamps compress to ~1 byte per point (10–50×).
Write-Ahead Log	Append-only WAL with CRC32 checksums. Crash recovery replays unflushed data on startup.
Inverted Tag Index	Tag key-value pairs map to sorted posting lists. O(n+m) intersection for compound predicates.
Time-Based Partitioning	Hourly partition directories for fast time-range pruning. Drop old data by deleting directories.
PulseQL	SQL-like query language: `SELECT mean(cpu) FROM metrics WHERE host='a' GROUP BY time(5m)`.
InfluxDB Line Protocol	Compatible ingestion format — existing Telegraf, Prometheus, and IoT collectors work out of the box.
LZ4 Compression	Outer compression layer on encoded columns. ~4GB/s decompression speed.
Concurrent Reads	`parking_lot::RwLock` for minimal contention between writers and readers.
Background Compactor	Merges small segments within partitions for fewer files and faster scans.
Retention Policies	Auto-drop data older than a configurable duration. Delete a directory, reclaim space.
Regex Tag Matching	`=~` and `!~` operators in PulseQL WHERE clauses for flexible tag filtering.
Schema Enforcement	Schema-on-write prevents type conflicts — first write sets the type, mismatches are rejected.

🏗️ Architecture

  TCP :8086                                                    HTTP :8087
  (line protocol)                                              (PulseQL)
       │                                                           │
       ▼                                                           ▼
  ┌─────────┐    ┌─────────────────────────────────────────┐   ┌────────┐
  │  Parser  │───►│              Database Engine            │◄──│ Parser │
  └─────────┘    │                                         │   └────────┘
                 │  WAL ──► MemTable ──► Flush ──► Segment  │
                 │                        │      (columnar) │
                 │                        ▼         ▲       │
                 │                   Compactor ──────┘       │
                 │                                          │
                 │       SeriesIndex ◄── InvertedIndex       │
                 └─────────────────────────────────────────┘

Write Path

Line Protocol Parser — Parse incoming InfluxDB-compatible text
WAL — Append-only log with CRC32 for durability
MemTable — In-memory sorted buffer (BTreeMap per series)
Flush — When memtable exceeds 64MB, freeze and write columnar segments

Read Path

PulseQL Parser — Parse SQL-like query into an AST
Planner — Resolve series via tag index, prune segments by time range
Executor — Decompress and scan only needed columns
Aggregator — Compute mean, sum, min, max, count with GROUP BY time(interval)

📊 Compression

PulseDB uses type-aware encodings tuned for time-series patterns, then wraps each column in LZ4:

Data Type	Encoding	Algorithm	Typical Ratio
Timestamps	Delta-of-delta	`delta[i] - delta[i-1]` → zigzag → varint	10–50×
Floats	Gorilla XOR	XOR consecutive values → leading zeros + meaningful bits	8–15×
Integers	Delta + zigzag	Delta encode → zigzag → varint	5–20×
Booleans	Bit-packing	8 values per byte	8×

Combined: For typical metric workloads (regular timestamps, slowly changing floats), expect 12–25× total compression over raw storage.

📐 Data Model

cpu,host=server01,region=us-east usage_idle=98.2,usage_system=1.3 1672531200000000000
│    │                            │                                 │
│    └─ tags (indexed)            └─ fields (values)                └─ timestamp (ns)
measurement

Measurement — Logical grouping (like a table)
Tags — Indexed string key-value pairs for filtering and grouping
Fields — The actual data: f64, i64, u64, bool
Timestamp — Nanosecond Unix epoch

📦 Installation

From Source

git clone https://github.com/matthart1983/pulsedb.git
cd pulsedb
cargo build --release
# Binary is at ./target/release/pulsedb

From crates.io

cargo install pulsedb

🚀 Quick Start

Start the Server

# Start with defaults (data in ./pulsedb_data, TCP :8086, HTTP :8087)
pulsedb server

# Custom configuration
pulsedb server \
  --data-dir /var/lib/pulsedb \
  --tcp-port 8086 \
  --http-port 8087 \
  --wal-fsync batch \
  --memtable-size 67108864

Write Data (Line Protocol)

Send data over TCP using InfluxDB line protocol:

# Single point
echo 'cpu,host=server01,region=us-east usage_idle=98.2,usage_system=1.3' | nc localhost 8086

# Batch write
cat <<EOF | nc localhost 8086
cpu,host=server01 usage_idle=98.2,usage_system=1.3 1672531200000000000
cpu,host=server02 usage_idle=95.1,usage_system=3.7 1672531200000000000
mem,host=server01 available=8589934592i,total=17179869184i 1672531200000000000
sensor,device=D-42 temperature=23.5,healthy=t
EOF

Or via HTTP:

curl -X POST http://localhost:8087/write \
  -H 'Content-Type: text/plain' \
  -d 'cpu,host=server01 usage_idle=98.2 1672531200000000000'

Query Data (PulseQL)

# Interactive REPL
pulsedb query

# HTTP API
curl -X POST http://localhost:8087/query \
  -H 'Content-Type: application/json' \
  -d '{"q": "SELECT mean(usage_idle) FROM cpu WHERE host='\''server01'\'' GROUP BY time(5m)"}'

📝 Query Language — PulseQL

SQL-like, purpose-built for time-series:

-- Aggregation with time bucketing
SELECT mean(usage_idle), max(usage_system)
FROM cpu
WHERE host = 'server01' AND time > now() - 1h
GROUP BY time(5m)

-- Multi-tag filter with regex
SELECT sum(bytes_in)
FROM network
WHERE region = 'us-east' AND host =~ /web-\d+/
GROUP BY time(1m), host

-- Raw data retrieval
SELECT *
FROM temperature
WHERE sensor_id = 'T-42'
  AND time BETWEEN '2024-01-01' AND '2024-01-02'
ORDER BY time DESC
LIMIT 1000

-- Downsampling with fill
SELECT mean(value) AS avg_temp, min(value), max(value)
FROM temperature
GROUP BY time(1h), location
FILL(linear)

Aggregation Functions

count · sum · mean / avg · min · max · first · last · stddev · percentile(field, N)

Operators

= · != · > · < · >= · <= · =~ (regex) · !~ · IN · AND · OR · BETWEEN

Duration Syntax

1ns · 100us · 5ms · 10s · 5m · 1h · 7d · 2w

🔌 Wire Protocol

Ingestion — TCP :8086

InfluxDB-compatible line protocol. Works with Telegraf, Prometheus remote_write adapters, and any tool that speaks line protocol.

<measurement>,<tag1>=<val1> <field1>=<fval1>,<field2>=<fval2> <timestamp_ns>

Field type suffixes: 1.0 (float), 1i (integer), 1u (unsigned), t/f (boolean), "hello" (string).

Query — HTTP :8087

Endpoint	Method	Description
`/query`	POST	Execute PulseQL query, return JSON
`/write`	POST	Ingest line protocol over HTTP
`/health`	GET	Liveness check
`/status`	GET	Engine statistics (series count, throughput, disk usage)

⚙️ Configuration

PulseDB is configured via CLI flags (config file support coming):

Flag	Default	Description
`--data-dir`	`./pulsedb_data`	Root directory for all data
`--tcp-port`	`8086`	Line protocol ingestion port
`--http-port`	`8087`	HTTP query API port
`--wal-fsync`	`batch`	WAL fsync policy: `every` / `batch` / `none`
`--memtable-size`	`64MB`	Flush threshold for in-memory buffer
`--segment-duration`	`3600`	Partition duration in seconds (1 hour)
`--retention`	∞	Auto-drop data older than duration (e.g., `30d`)
`--log-level`	`info`	Logging: `trace` / `debug` / `info` / `warn` / `error`

Data Directory Layout

pulsedb_data/
├── wal/
│   └── wal.log                    # Write-ahead log
├── partitions/
│   ├── 2024-01-15T14/             # Hourly partition
│   │   ├── cpu_host=server01.seg  # Compressed columnar segment
│   │   └── mem_host=server01.seg
│   └── 2024-01-15T15/
│       └── ...
├── index/
│   ├── series.idx                 # Series key → ID mapping
│   └── tags.idx                   # Tag inverted index
└── meta/
    └── measurements.json          # Schema (field names + types)

🎯 Performance Targets

Metric	Target
Write throughput	≥ 1M points/sec (batch)
Single-point write latency	< 10μs (WAL + memtable)
Time-range query (1h, 1 series)	< 1ms
Time-range query (1h, 1K series)	< 50ms
Aggregation (24h, GROUP BY 5m)	< 10ms
Compression ratio (float metrics)	≥ 10×
Memory (1M active series)	< 2GB
Segment flush (1M points)	< 100ms

🏛️ Tech Stack

Layer	Crate	Purpose
Async Runtime	`tokio`	TCP/HTTP server, background tasks
Compression	`lz4_flex`	Fast outer compression layer
Checksums	`crc32fast`	WAL and segment integrity
Concurrency	`parking_lot`	Low-overhead RwLock
CLI	`clap` (derive)	Command-line argument parsing
Serialization	`serde`, `serde_json`	Config, WAL payload, HTTP responses
Time	`chrono`	Partition key formatting
Hashing	`xxhash-rust` (xxh3)	Fast non-crypto hashing
Memory Mapping	`memmap2`	Zero-copy segment reads
Logging	`tracing`, `tracing-subscriber`	Structured logging
Errors	`thiserror`, `anyhow`	Error handling

Module Structure

src/
├── main.rs              # CLI entry point, server bootstrap
├── model/               # DataPoint, FieldValue, Tags, SeriesKey, SeriesId
├── encoding/            # Compression codecs
│   ├── timestamp.rs     # Delta-of-delta + zigzag + varint
│   ├── float.rs         # Gorilla XOR (Facebook paper)
│   ├── integer.rs       # Delta + zigzag + varint
│   └── boolean.rs       # Bit-packing
├── engine/              # Core database engine
│   ├── database.rs      # Write path coordinator
│   ├── wal.rs           # Write-ahead log
│   ├── memtable.rs      # In-memory sorted buffer
│   └── config.rs        # Engine configuration
├── storage/             # On-disk storage
│   ├── segment.rs       # Columnar segment reader/writer
│   ├── partition.rs     # Hourly time partitions
│   ├── cache.rs         # Segment metadata cache
│   └── compactor.rs     # Background segment merging
├── index/               # Series & tag indexing
│   ├── series.rs        # Key → ID mapping
│   └── inverted.rs      # Tag inverted index (posting lists)
├── query/               # Query engine (PulseQL parser, planner, executor)
├── server/              # TCP + HTTP network layer
└── cli/                 # CLI commands (server, query, import, status)

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Make your changes with tests (cargo test)
Ensure formatting (cargo fmt) and lints pass (cargo clippy)
Open a pull request

Building & Testing

cargo build              # Debug build
cargo build --release    # Optimized release build
cargo test               # Run all tests (198 tests)
cargo clippy             # Lint checks
cargo fmt --check        # Format check
cargo bench              # Run benchmarks

🗺️ Roadmap

Core data model (DataPoint, FieldValue, Tags, SeriesKey)
Compression codecs (delta-of-delta, Gorilla XOR, delta+zigzag, bit-pack)
Write-ahead log with CRC32 crash recovery
MemTable with freeze/rotate
Columnar segment writer/reader with LZ4
Time-based partitioning
Series index + tag inverted index
Segment flush integration (memtable → disk)
Line protocol parser
PulseQL query engine (lexer, parser, planner, executor)
Aggregation functions (count, sum, mean, min, max, GROUP BY)
TCP ingestion server
HTTP query API
CLI (server, query, import, status)
Background compactor
Retention policies
Regex tag matching (=~ and !~ operators)
Schema enforcement (type-mismatch rejection)
Criterion benchmarks (ingestion, query, compression)
Flamegraph profiling + hot-path optimization
GitHub Actions CI

📄 License

MIT — see LICENSE for details.