pulsedb 0.1.0

A high-performance time-series database written in Rust
Documentation
                ██████╗ ██╗   ██╗██╗     ███████╗███████╗██████╗ ██████╗
                ██╔══██╗██║   ██║██║     ██╔════╝██╔════╝██╔══██╗██╔══██╗
                ██████╔╝██║   ██║██║     ███████╗█████╗  ██║  ██║██████╔╝
                ██╔═══╝ ██║   ██║██║     ╚════██║██╔══╝  ██║  ██║██╔══██╗
                ██║     ╚██████╔╝███████╗███████║███████╗██████╔╝██████╔╝
                ╚═╝      ╚═════╝ ╚══════╝╚══════╝╚══════╝╚═════╝ ╚═════╝
                     High-Performance Time-Series Database for Rust

✨ Feature Highlights

Feature Description
Pure Rust Zero C dependencies. Single static binary. Cross-compiles anywhere Rust does.
Columnar Storage Fields stored column-by-column in immutable segments for cache-friendly scans and dramatic compression.
Gorilla Float Compression Facebook's XOR-based float encoding — 8–15× compression on metric data.
Delta-of-Delta Timestamps Regular-interval timestamps compress to ~1 byte per point (10–50×).
Write-Ahead Log Append-only WAL with CRC32 checksums. Crash recovery replays unflushed data on startup.
Inverted Tag Index Tag key-value pairs map to sorted posting lists. O(n+m) intersection for compound predicates.
Time-Based Partitioning Hourly partition directories for fast time-range pruning. Drop old data by deleting directories.
PulseQL SQL-like query language: SELECT mean(cpu) FROM metrics WHERE host='a' GROUP BY time(5m).
InfluxDB Line Protocol Compatible ingestion format — existing Telegraf, Prometheus, and IoT collectors work out of the box.
LZ4 Compression Outer compression layer on encoded columns. ~4GB/s decompression speed.
Concurrent Reads parking_lot::RwLock for minimal contention between writers and readers.
Background Compactor Merges small segments within partitions for fewer files and faster scans.
Retention Policies Auto-drop data older than a configurable duration. Delete a directory, reclaim space.
Regex Tag Matching =~ and !~ operators in PulseQL WHERE clauses for flexible tag filtering.
Schema Enforcement Schema-on-write prevents type conflicts — first write sets the type, mismatches are rejected.

🏗️ Architecture

  TCP :8086                                                    HTTP :8087
  (line protocol)                                              (PulseQL)
       │                                                           │
       ▼                                                           ▼
  ┌─────────┐    ┌─────────────────────────────────────────┐   ┌────────┐
  │  Parser  │───►│              Database Engine            │◄──│ Parser │
  └─────────┘    │                                         │   └────────┘
                 │  WAL ──► MemTable ──► Flush ──► Segment  │
                 │                        │      (columnar) │
                 │                        ▼         ▲       │
                 │                   Compactor ──────┘       │
                 │                                          │
                 │       SeriesIndex ◄── InvertedIndex       │
                 └─────────────────────────────────────────┘

Write Path

  1. Line Protocol Parser — Parse incoming InfluxDB-compatible text
  2. WAL — Append-only log with CRC32 for durability
  3. MemTable — In-memory sorted buffer (BTreeMap per series)
  4. Flush — When memtable exceeds 64MB, freeze and write columnar segments

Read Path

  1. PulseQL Parser — Parse SQL-like query into an AST
  2. Planner — Resolve series via tag index, prune segments by time range
  3. Executor — Decompress and scan only needed columns
  4. Aggregator — Compute mean, sum, min, max, count with GROUP BY time(interval)

📊 Compression

PulseDB uses type-aware encodings tuned for time-series patterns, then wraps each column in LZ4:

Data Type Encoding Algorithm Typical Ratio
Timestamps Delta-of-delta delta[i] - delta[i-1] → zigzag → varint 10–50×
Floats Gorilla XOR XOR consecutive values → leading zeros + meaningful bits 8–15×
Integers Delta + zigzag Delta encode → zigzag → varint 5–20×
Booleans Bit-packing 8 values per byte

Combined: For typical metric workloads (regular timestamps, slowly changing floats), expect 12–25× total compression over raw storage.


📐 Data Model

cpu,host=server01,region=us-east usage_idle=98.2,usage_system=1.3 1672531200000000000
│    │                            │                                 │
│    └─ tags (indexed)            └─ fields (values)                └─ timestamp (ns)
measurement
  • Measurement — Logical grouping (like a table)
  • Tags — Indexed string key-value pairs for filtering and grouping
  • Fields — The actual data: f64, i64, u64, bool
  • Timestamp — Nanosecond Unix epoch

📦 Installation

From Source

git clone https://github.com/matthart1983/pulsedb.git
cd pulsedb
cargo build --release
# Binary is at ./target/release/pulsedb

From crates.io

cargo install pulsedb

🚀 Quick Start

Start the Server

# Start with defaults (data in ./pulsedb_data, TCP :8086, HTTP :8087)
pulsedb server

# Custom configuration
pulsedb server \
  --data-dir /var/lib/pulsedb \
  --tcp-port 8086 \
  --http-port 8087 \
  --wal-fsync batch \
  --memtable-size 67108864

Write Data (Line Protocol)

Send data over TCP using InfluxDB line protocol:

# Single point
echo 'cpu,host=server01,region=us-east usage_idle=98.2,usage_system=1.3' | nc localhost 8086

# Batch write
cat <<EOF | nc localhost 8086
cpu,host=server01 usage_idle=98.2,usage_system=1.3 1672531200000000000
cpu,host=server02 usage_idle=95.1,usage_system=3.7 1672531200000000000
mem,host=server01 available=8589934592i,total=17179869184i 1672531200000000000
sensor,device=D-42 temperature=23.5,healthy=t
EOF

Or via HTTP:

curl -X POST http://localhost:8087/write \
  -H 'Content-Type: text/plain' \
  -d 'cpu,host=server01 usage_idle=98.2 1672531200000000000'

Query Data (PulseQL)

# Interactive REPL
pulsedb query

# HTTP API
curl -X POST http://localhost:8087/query \
  -H 'Content-Type: application/json' \
  -d '{"q": "SELECT mean(usage_idle) FROM cpu WHERE host='\''server01'\'' GROUP BY time(5m)"}'

📝 Query Language — PulseQL

SQL-like, purpose-built for time-series:

-- Aggregation with time bucketing
SELECT mean(usage_idle), max(usage_system)
FROM cpu
WHERE host = 'server01' AND time > now() - 1h
GROUP BY time(5m)

-- Multi-tag filter with regex
SELECT sum(bytes_in)
FROM network
WHERE region = 'us-east' AND host =~ /web-\d+/
GROUP BY time(1m), host

-- Raw data retrieval
SELECT *
FROM temperature
WHERE sensor_id = 'T-42'
  AND time BETWEEN '2024-01-01' AND '2024-01-02'
ORDER BY time DESC
LIMIT 1000

-- Downsampling with fill
SELECT mean(value) AS avg_temp, min(value), max(value)
FROM temperature
GROUP BY time(1h), location
FILL(linear)

Aggregation Functions

count · sum · mean / avg · min · max · first · last · stddev · percentile(field, N)

Operators

= · != · > · < · >= · <= · =~ (regex) · !~ · IN · AND · OR · BETWEEN

Duration Syntax

1ns · 100us · 5ms · 10s · 5m · 1h · 7d · 2w


🔌 Wire Protocol

Ingestion — TCP :8086

InfluxDB-compatible line protocol. Works with Telegraf, Prometheus remote_write adapters, and any tool that speaks line protocol.

<measurement>,<tag1>=<val1> <field1>=<fval1>,<field2>=<fval2> <timestamp_ns>

Field type suffixes: 1.0 (float), 1i (integer), 1u (unsigned), t/f (boolean), "hello" (string).

Query — HTTP :8087

Endpoint Method Description
/query POST Execute PulseQL query, return JSON
/write POST Ingest line protocol over HTTP
/health GET Liveness check
/status GET Engine statistics (series count, throughput, disk usage)

⚙️ Configuration

PulseDB is configured via CLI flags (config file support coming):

Flag Default Description
--data-dir ./pulsedb_data Root directory for all data
--tcp-port 8086 Line protocol ingestion port
--http-port 8087 HTTP query API port
--wal-fsync batch WAL fsync policy: every / batch / none
--memtable-size 64MB Flush threshold for in-memory buffer
--segment-duration 3600 Partition duration in seconds (1 hour)
--retention Auto-drop data older than duration (e.g., 30d)
--log-level info Logging: trace / debug / info / warn / error

Data Directory Layout

pulsedb_data/
├── wal/
│   └── wal.log                    # Write-ahead log
├── partitions/
│   ├── 2024-01-15T14/             # Hourly partition
│   │   ├── cpu_host=server01.seg  # Compressed columnar segment
│   │   └── mem_host=server01.seg
│   └── 2024-01-15T15/
│       └── ...
├── index/
│   ├── series.idx                 # Series key → ID mapping
│   └── tags.idx                   # Tag inverted index
└── meta/
    └── measurements.json          # Schema (field names + types)

🎯 Performance Targets

Metric Target
Write throughput ≥ 1M points/sec (batch)
Single-point write latency < 10μs (WAL + memtable)
Time-range query (1h, 1 series) < 1ms
Time-range query (1h, 1K series) < 50ms
Aggregation (24h, GROUP BY 5m) < 10ms
Compression ratio (float metrics) ≥ 10×
Memory (1M active series) < 2GB
Segment flush (1M points) < 100ms

🏛️ Tech Stack

Layer Crate Purpose
Async Runtime tokio TCP/HTTP server, background tasks
Compression lz4_flex Fast outer compression layer
Checksums crc32fast WAL and segment integrity
Concurrency parking_lot Low-overhead RwLock
CLI clap (derive) Command-line argument parsing
Serialization serde, serde_json Config, WAL payload, HTTP responses
Time chrono Partition key formatting
Hashing xxhash-rust (xxh3) Fast non-crypto hashing
Memory Mapping memmap2 Zero-copy segment reads
Logging tracing, tracing-subscriber Structured logging
Errors thiserror, anyhow Error handling

Module Structure

src/
├── main.rs              # CLI entry point, server bootstrap
├── model/               # DataPoint, FieldValue, Tags, SeriesKey, SeriesId
├── encoding/            # Compression codecs
│   ├── timestamp.rs     # Delta-of-delta + zigzag + varint
│   ├── float.rs         # Gorilla XOR (Facebook paper)
│   ├── integer.rs       # Delta + zigzag + varint
│   └── boolean.rs       # Bit-packing
├── engine/              # Core database engine
│   ├── database.rs      # Write path coordinator
│   ├── wal.rs           # Write-ahead log
│   ├── memtable.rs      # In-memory sorted buffer
│   └── config.rs        # Engine configuration
├── storage/             # On-disk storage
│   ├── segment.rs       # Columnar segment reader/writer
│   ├── partition.rs     # Hourly time partitions
│   ├── cache.rs         # Segment metadata cache
│   └── compactor.rs     # Background segment merging
├── index/               # Series & tag indexing
│   ├── series.rs        # Key → ID mapping
│   └── inverted.rs      # Tag inverted index (posting lists)
├── query/               # Query engine (PulseQL parser, planner, executor)
├── server/              # TCP + HTTP network layer
└── cli/                 # CLI commands (server, query, import, status)

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Make your changes with tests (cargo test)
  4. Ensure formatting (cargo fmt) and lints pass (cargo clippy)
  5. Open a pull request

Building & Testing

cargo build              # Debug build
cargo build --release    # Optimized release build
cargo test               # Run all tests (198 tests)
cargo clippy             # Lint checks
cargo fmt --check        # Format check
cargo bench              # Run benchmarks

🗺️ Roadmap

  • Core data model (DataPoint, FieldValue, Tags, SeriesKey)
  • Compression codecs (delta-of-delta, Gorilla XOR, delta+zigzag, bit-pack)
  • Write-ahead log with CRC32 crash recovery
  • MemTable with freeze/rotate
  • Columnar segment writer/reader with LZ4
  • Time-based partitioning
  • Series index + tag inverted index
  • Segment flush integration (memtable → disk)
  • Line protocol parser
  • PulseQL query engine (lexer, parser, planner, executor)
  • Aggregation functions (count, sum, mean, min, max, GROUP BY)
  • TCP ingestion server
  • HTTP query API
  • CLI (server, query, import, status)
  • Background compactor
  • Retention policies
  • Regex tag matching (=~ and !~ operators)
  • Schema enforcement (type-mismatch rejection)
  • Criterion benchmarks (ingestion, query, compression)
  • Flamegraph profiling + hot-path optimization
  • GitHub Actions CI

📄 License

MIT — see LICENSE for details.