# xgrep
Search compressed production logs in milliseconds instead of seconds or minutes — by touching only the blocks that might match.
xgrep is an indexed search accelerator for compressed log files. After a one-time index build, repeated searches over large `.gz` corpora are **100x to 1,200x faster than zgrep** because xgrep reads only candidate blocks instead of decompressing and scanning the entire corpus.
## How it works
```bash
# First time: build an index (one-time cost)
xgrep --build-index logs/*.gz
# Every search after that is near-instant
xgrep "blk_1073741825" logs/*.gz # specific ID: ~25ms
xgrep -F "ERROR" logs/*.gz # broad query: ~25ms
xgrep "timeout.*connection" logs/*.gz # regex: ~25ms
```
Traditional tools like `zgrep` decompress and scan every byte on every query. xgrep builds a sidecar index of per-block bloom filters, then uses memory-mapped I/O to touch only the blocks that might contain your search term. The OS never pages in the rest.
**Architecture:**
1. **Index build** — Decompress each `.gz`, split into 64KB blocks, build token-level bloom filters, write consolidated index
2. **Cached search** — Load bloom index (small), check each block's bloom (nanoseconds), mmap the cache, access only candidate blocks (~1% of data)
3. **Output** — Grep-compatible: line numbers, filenames, context lines, color, count mode
## Benchmark: Real Production Logs
Tested on three datasets from [LogHub](https://github.com/logpai/loghub) — real production logs from Hadoop, a supercomputer, and Spark.
### HDFS (Hadoop Distributed File System)
| Query | xgrep (cached) | zgrep | Speedup |
|---|---|---|---|
| `blk_-1608999687919862906` (specific block ID) | 30ms | 27.4s | **913x** |
| `WARN` | 28ms | 25.4s | **907x** |
| `INFO` (broad — most lines) | 23ms | 28.0s | **1,217x** |
### BGL (Blue Gene/L Supercomputer)
| Query | xgrep (cached) | zgrep | Speedup |
|---|---|---|---|
| `R02-M1-N0` (specific node) | 26ms | 17.4s | **655x** |
| `FATAL` | 25ms | 17.4s | **708x** |
| `INFO` (broad) | 27ms | 17.7s | **655x** |
### Spark (Distributed Compute)
| Query | xgrep (cached) | zgrep | Speedup |
|---|---|---|---|
| `Executor` | 5.1s | 10m 2s | **118x** |
| `ERROR` | 2.7s | ~10m | **220x** |
### First query vs repeated query
| `xgrep --build-index` | ~2 min | One-time cost |
| xgrep first query (no cache) | 2.4s | Parallel decompress + scan. Already 18x faster than zgrep. |
| xgrep repeated query (cached) | 11-30ms | Bloom skip + mmap. Only candidate blocks paged in. |
| zgrep | 44s | Full decompress + scan. Every time. |
### Key metric
**Bytes touched per query: ~0.1-1% of corpus vs 100% for zgrep.**
xgrep's cost scales with candidate data, not corpus size. As your logs grow, the advantage widens.
## JSON Mode (`-j`)
Field-aware search for NDJSON/JSONL logs. Skips blocks using bloom-indexed field-value pairs.
```bash
xgrep -j 'level=error' app.jsonl.gz
xgrep -j 'level=error status=500' logs/*.json.gz # AND logic
xgrep -j 'http.method=post' access.jsonl.gz # nested fields
```
### JSON Benchmark
**1M NDJSON lines, 244MB uncompressed, 22MB gzip**
| `user_id=42042` (9 matches) | 40.5s | 0.82s | **0.13s** | **319x** |
| `status=503` (111K matches) | 40.5s | 0.66s | **1.97s** | **20x** |
| `level=error status=503` (multi-clause) | 40.6s | — | **1.83s** | **22x** |
## Who this is for
Engineers who repeatedly search compressed log archives:
- **Incident investigation** — dozens of queries against the same log corpus
- **Log forensics** — searching rotated/archived `.gz` files
- **Debugging** — narrowing down by user ID, request ID, error code
- **Any workflow** where you've waited on `zgrep`
xgrep is not a universal grep replacement. It is an indexed search accelerator for cold compressed data.
## Usage
```bash
# Build index
xgrep --build-index logs/*.gz
xgrep --build-index logs/ # or pass directory
# Text search (uses cache automatically if available)
xgrep "ERROR" logs/*.gz
xgrep -F "user_id=12345" logs/*.gz # fixed string
xgrep -i "timeout" logs/*.gz # case insensitive
xgrep -n -C 3 "Exception" logs/*.gz # line numbers + context
xgrep -c "WARN" logs/*.gz # count only
xgrep -l "FATAL" logs/*.gz # filenames only
xgrep --stats "ERROR" logs/ # show skip statistics
# JSON field search
xgrep -j 'level=error' logs/*.jsonl.gz
xgrep -j 'user_id=12345 status=500' logs/
# Plain text (no index needed — parallel SIMD search)
xgrep "TODO" src/**/*.rs
```
### Flags
| `-F` | Fixed string (no regex) |
| `-E` | Extended regex |
| `-i` | Case insensitive |
| `-n` | Line numbers |
| `-c` | Count matches |
| `-l` | List matching files |
| `-q` | Quiet (exit 0 on match) |
| `-o` | Print only matched part |
| `-m N` | Stop after N matches |
| `-A/-B/-C N` | Context lines |
| `-j` | JSON field filter mode |
| `--build-index` | Build sidecar cache |
| `--stats` | Show block skip stats |
| `--include` | Glob filter |
## Install
```bash
cargo install xgrep-cli
# Or from source
cargo install --path .
```
The crate is published as `xgrep-cli` but the binary is `xgrep`.
## How the index works
```
logs/
app-001.log.gz
app-002.log.gz
.xgrep/
index.xgi # bloom filters (~6% of decompressed size)
index.xgd # decompressed content (memory-mapped)
```
- **Bloom filters**: 4KB per 64KB block, token-level, ~3% false positive rate
- **Staleness**: File size + mtime checked every query. Stale cache ignored automatically.
- **Cache size**: ~5-6x compressed size (decompressed data + bloom index)
- **JSON mode**: Hashes `(field, value)` pairs into blooms for field-aware pruning
## License
Apache-2.0