pulsedb 0.1.0

A high-performance time-series database written in Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414

<p align="center">

```
                ██████╗ ██╗   ██╗██╗     ███████╗███████╗██████╗ ██████╗
                ██╔══██╗██║   ██║██║     ██╔════╝██╔════╝██╔══██╗██╔══██╗
                ██████╔╝██║   ██║██║     ███████╗█████╗  ██║  ██║██████╔╝
                ██╔═══╝ ██║   ██║██║     ╚════██║██╔══╝  ██║  ██║██╔══██╗
                ██║     ╚██████╔╝███████╗███████║███████╗██████╔╝██████╔╝
                ╚═╝      ╚═════╝ ╚══════╝╚══════╝╚══════╝╚═════╝ ╚═════╝
                     High-Performance Time-Series Database for Rust
```

</p>

<p align="center">
  <a href="https://crates.io/crates/pulsedb"><img src="https://img.shields.io/crates/v/pulsedb.svg" alt="crates.io"></a>
  <a href="https://github.com/matthart1983/pulsedb/blob/main/LICENSE"><img src="https://img.shields.io/crates/l/pulsedb.svg" alt="License: MIT"></a>
</p>

<p align="center">
  <b>A purpose-built time-series database written in pure Rust — columnar storage, type-aware compression, and a SQL-like query language. All from a single binary.</b>
</p>

---

## ✨ Feature Highlights

| Feature | Description |
|---|---|
| **Pure Rust** | Zero C dependencies. Single static binary. Cross-compiles anywhere Rust does. |
| **Columnar Storage** | Fields stored column-by-column in immutable segments for cache-friendly scans and dramatic compression. |
| **Gorilla Float Compression** | Facebook's XOR-based float encoding — 8–15× compression on metric data. |
| **Delta-of-Delta Timestamps** | Regular-interval timestamps compress to ~1 byte per point (10–50×). |
| **Write-Ahead Log** | Append-only WAL with CRC32 checksums. Crash recovery replays unflushed data on startup. |
| **Inverted Tag Index** | Tag key-value pairs map to sorted posting lists. O(n+m) intersection for compound predicates. |
| **Time-Based Partitioning** | Hourly partition directories for fast time-range pruning. Drop old data by deleting directories. |
| **PulseQL** | SQL-like query language: `SELECT mean(cpu) FROM metrics WHERE host='a' GROUP BY time(5m)`. |
| **InfluxDB Line Protocol** | Compatible ingestion format — existing Telegraf, Prometheus, and IoT collectors work out of the box. |
| **LZ4 Compression** | Outer compression layer on encoded columns. ~4GB/s decompression speed. |
| **Concurrent Reads** | `parking_lot::RwLock` for minimal contention between writers and readers. |
| **Background Compactor** | Merges small segments within partitions for fewer files and faster scans. |
| **Retention Policies** | Auto-drop data older than a configurable duration. Delete a directory, reclaim space. |
| **Regex Tag Matching** | `=~` and `!~` operators in PulseQL WHERE clauses for flexible tag filtering. |
| **Schema Enforcement** | Schema-on-write prevents type conflicts — first write sets the type, mismatches are rejected. |

---

## 🏗️ Architecture

```
  TCP :8086                                                    HTTP :8087
  (line protocol)                                              (PulseQL)
       │                                                           │
       ▼                                                           ▼
  ┌─────────┐    ┌─────────────────────────────────────────┐   ┌────────┐
  │  Parser  │───►│              Database Engine            │◄──│ Parser │
  └─────────┘    │                                         │   └────────┘
                 │  WAL ──► MemTable ──► Flush ──► Segment  │
                 │                        │      (columnar) │
                 │                        ▼         ▲       │
                 │                   Compactor ──────┘       │
                 │                                          │
                 │       SeriesIndex ◄── InvertedIndex       │
                 └─────────────────────────────────────────┘
```

### Write Path

1. **Line Protocol Parser** — Parse incoming InfluxDB-compatible text
2. **WAL** — Append-only log with CRC32 for durability
3. **MemTable** — In-memory sorted buffer (BTreeMap per series)
4. **Flush** — When memtable exceeds 64MB, freeze and write columnar segments

### Read Path

1. **PulseQL Parser** — Parse SQL-like query into an AST
2. **Planner** — Resolve series via tag index, prune segments by time range
3. **Executor** — Decompress and scan only needed columns
4. **Aggregator** — Compute `mean`, `sum`, `min`, `max`, `count` with `GROUP BY time(interval)`

---

## 📊 Compression

PulseDB uses type-aware encodings tuned for time-series patterns, then wraps each column in LZ4:

| Data Type | Encoding | Algorithm | Typical Ratio |
|---|---|---|---|
| Timestamps | Delta-of-delta | `delta[i] - delta[i-1]` → zigzag → varint | 10–50× |
| Floats | Gorilla XOR | XOR consecutive values → leading zeros + meaningful bits | 8–15× |
| Integers | Delta + zigzag | Delta encode → zigzag → varint | 5–20× |
| Booleans | Bit-packing | 8 values per byte ||

**Combined**: For typical metric workloads (regular timestamps, slowly changing floats), expect **12–25× total compression** over raw storage.

---

## 📐 Data Model

```
cpu,host=server01,region=us-east usage_idle=98.2,usage_system=1.3 1672531200000000000
│    │                            │                                 │
│    └─ tags (indexed)            └─ fields (values)                └─ timestamp (ns)
measurement
```

- **Measurement** — Logical grouping (like a table)
- **Tags** — Indexed string key-value pairs for filtering and grouping
- **Fields** — The actual data: `f64`, `i64`, `u64`, `bool`
- **Timestamp** — Nanosecond Unix epoch

---

## 📦 Installation

### From Source

```bash
git clone https://github.com/matthart1983/pulsedb.git
cd pulsedb
cargo build --release
# Binary is at ./target/release/pulsedb
```

### From crates.io

```bash
cargo install pulsedb
```

---

## 🚀 Quick Start

### Start the Server

```bash
# Start with defaults (data in ./pulsedb_data, TCP :8086, HTTP :8087)
pulsedb server

# Custom configuration
pulsedb server \
  --data-dir /var/lib/pulsedb \
  --tcp-port 8086 \
  --http-port 8087 \
  --wal-fsync batch \
  --memtable-size 67108864
```

### Write Data (Line Protocol)

Send data over TCP using InfluxDB line protocol:

```bash
# Single point
echo 'cpu,host=server01,region=us-east usage_idle=98.2,usage_system=1.3' | nc localhost 8086

# Batch write
cat <<EOF | nc localhost 8086
cpu,host=server01 usage_idle=98.2,usage_system=1.3 1672531200000000000
cpu,host=server02 usage_idle=95.1,usage_system=3.7 1672531200000000000
mem,host=server01 available=8589934592i,total=17179869184i 1672531200000000000
sensor,device=D-42 temperature=23.5,healthy=t
EOF
```

Or via HTTP:

```bash
curl -X POST http://localhost:8087/write \
  -H 'Content-Type: text/plain' \
  -d 'cpu,host=server01 usage_idle=98.2 1672531200000000000'
```

### Query Data (PulseQL)

```bash
# Interactive REPL
pulsedb query

# HTTP API
curl -X POST http://localhost:8087/query \
  -H 'Content-Type: application/json' \
  -d '{"q": "SELECT mean(usage_idle) FROM cpu WHERE host='\''server01'\'' GROUP BY time(5m)"}'
```

---

## 📝 Query Language — PulseQL

SQL-like, purpose-built for time-series:

```sql
-- Aggregation with time bucketing
SELECT mean(usage_idle), max(usage_system)
FROM cpu
WHERE host = 'server01' AND time > now() - 1h
GROUP BY time(5m)

-- Multi-tag filter with regex
SELECT sum(bytes_in)
FROM network
WHERE region = 'us-east' AND host =~ /web-\d+/
GROUP BY time(1m), host

-- Raw data retrieval
SELECT *
FROM temperature
WHERE sensor_id = 'T-42'
  AND time BETWEEN '2024-01-01' AND '2024-01-02'
ORDER BY time DESC
LIMIT 1000

-- Downsampling with fill
SELECT mean(value) AS avg_temp, min(value), max(value)
FROM temperature
GROUP BY time(1h), location
FILL(linear)
```

### Aggregation Functions

`count` · `sum` · `mean` / `avg` · `min` · `max` · `first` · `last` · `stddev` · `percentile(field, N)`

### Operators

`=` · `!=` · `>` · `<` · `>=` · `<=` · `=~` (regex) · `!~` · `IN` · `AND` · `OR` · `BETWEEN`

### Duration Syntax

`1ns` · `100us` · `5ms` · `10s` · `5m` · `1h` · `7d` · `2w`

---

## 🔌 Wire Protocol

### Ingestion — TCP :8086

InfluxDB-compatible line protocol. Works with Telegraf, Prometheus remote_write adapters, and any tool that speaks line protocol.

```
<measurement>,<tag1>=<val1> <field1>=<fval1>,<field2>=<fval2> <timestamp_ns>
```

Field type suffixes: `1.0` (float), `1i` (integer), `1u` (unsigned), `t`/`f` (boolean), `"hello"` (string).

### Query — HTTP :8087

| Endpoint | Method | Description |
|---|---|---|
| `/query` | POST | Execute PulseQL query, return JSON |
| `/write` | POST | Ingest line protocol over HTTP |
| `/health` | GET | Liveness check |
| `/status` | GET | Engine statistics (series count, throughput, disk usage) |

---

## ⚙️ Configuration

PulseDB is configured via CLI flags (config file support coming):

| Flag | Default | Description |
|---|---|---|
| `--data-dir` | `./pulsedb_data` | Root directory for all data |
| `--tcp-port` | `8086` | Line protocol ingestion port |
| `--http-port` | `8087` | HTTP query API port |
| `--wal-fsync` | `batch` | WAL fsync policy: `every` / `batch` / `none` |
| `--memtable-size` | `64MB` | Flush threshold for in-memory buffer |
| `--segment-duration` | `3600` | Partition duration in seconds (1 hour) |
| `--retention` || Auto-drop data older than duration (e.g., `30d`) |
| `--log-level` | `info` | Logging: `trace` / `debug` / `info` / `warn` / `error` |

### Data Directory Layout

```
pulsedb_data/
├── wal/
│   └── wal.log                    # Write-ahead log
├── partitions/
│   ├── 2024-01-15T14/             # Hourly partition
│   │   ├── cpu_host=server01.seg  # Compressed columnar segment
│   │   └── mem_host=server01.seg
│   └── 2024-01-15T15/
│       └── ...
├── index/
│   ├── series.idx                 # Series key → ID mapping
│   └── tags.idx                   # Tag inverted index
└── meta/
    └── measurements.json          # Schema (field names + types)
```

---

## 🎯 Performance Targets

| Metric | Target |
|---|---|
| Write throughput | ≥ 1M points/sec (batch) |
| Single-point write latency | < 10μs (WAL + memtable) |
| Time-range query (1h, 1 series) | < 1ms |
| Time-range query (1h, 1K series) | < 50ms |
| Aggregation (24h, GROUP BY 5m) | < 10ms |
| Compression ratio (float metrics) | ≥ 10× |
| Memory (1M active series) | < 2GB |
| Segment flush (1M points) | < 100ms |

---

## 🏛️ Tech Stack

| Layer | Crate | Purpose |
|---|---|---|
| Async Runtime | `tokio` | TCP/HTTP server, background tasks |
| Compression | `lz4_flex` | Fast outer compression layer |
| Checksums | `crc32fast` | WAL and segment integrity |
| Concurrency | `parking_lot` | Low-overhead RwLock |
| CLI | `clap` (derive) | Command-line argument parsing |
| Serialization | `serde`, `serde_json` | Config, WAL payload, HTTP responses |
| Time | `chrono` | Partition key formatting |
| Hashing | `xxhash-rust` (xxh3) | Fast non-crypto hashing |
| Memory Mapping | `memmap2` | Zero-copy segment reads |
| Logging | `tracing`, `tracing-subscriber` | Structured logging |
| Errors | `thiserror`, `anyhow` | Error handling |

### Module Structure

```
src/
├── main.rs              # CLI entry point, server bootstrap
├── model/               # DataPoint, FieldValue, Tags, SeriesKey, SeriesId
├── encoding/            # Compression codecs
│   ├── timestamp.rs     # Delta-of-delta + zigzag + varint
│   ├── float.rs         # Gorilla XOR (Facebook paper)
│   ├── integer.rs       # Delta + zigzag + varint
│   └── boolean.rs       # Bit-packing
├── engine/              # Core database engine
│   ├── database.rs      # Write path coordinator
│   ├── wal.rs           # Write-ahead log
│   ├── memtable.rs      # In-memory sorted buffer
│   └── config.rs        # Engine configuration
├── storage/             # On-disk storage
│   ├── segment.rs       # Columnar segment reader/writer
│   ├── partition.rs     # Hourly time partitions
│   ├── cache.rs         # Segment metadata cache
│   └── compactor.rs     # Background segment merging
├── index/               # Series & tag indexing
│   ├── series.rs        # Key → ID mapping
│   └── inverted.rs      # Tag inverted index (posting lists)
├── query/               # Query engine (PulseQL parser, planner, executor)
├── server/              # TCP + HTTP network layer
└── cli/                 # CLI commands (server, query, import, status)
```

---

## 🤝 Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/my-feature`)
3. Make your changes with tests (`cargo test`)
4. Ensure formatting (`cargo fmt`) and lints pass (`cargo clippy`)
5. Open a pull request

### Building & Testing

```bash
cargo build              # Debug build
cargo build --release    # Optimized release build
cargo test               # Run all tests (198 tests)
cargo clippy             # Lint checks
cargo fmt --check        # Format check
cargo bench              # Run benchmarks
```

---

## 🗺️ Roadmap

- [x] Core data model (DataPoint, FieldValue, Tags, SeriesKey)
- [x] Compression codecs (delta-of-delta, Gorilla XOR, delta+zigzag, bit-pack)
- [x] Write-ahead log with CRC32 crash recovery
- [x] MemTable with freeze/rotate
- [x] Columnar segment writer/reader with LZ4
- [x] Time-based partitioning
- [x] Series index + tag inverted index
- [x] Segment flush integration (memtable → disk)
- [x] Line protocol parser
- [x] PulseQL query engine (lexer, parser, planner, executor)
- [x] Aggregation functions (count, sum, mean, min, max, GROUP BY)
- [x] TCP ingestion server
- [x] HTTP query API
- [x] CLI (server, query, import, status)
- [x] Background compactor
- [x] Retention policies
- [x] Regex tag matching (=~ and !~ operators)
- [x] Schema enforcement (type-mismatch rejection)
- [x] Criterion benchmarks (ingestion, query, compression)
- [ ] Flamegraph profiling + hot-path optimization
- [ ] GitHub Actions CI

---

## 📄 License

MIT — see [LICENSE](LICENSE) for details.

---

<p align="center">
  <sub>Built with 🦀 Rust — designed for speed, compressed for efficiency</sub>
</p>