xgrep
Search compressed production logs in milliseconds instead of seconds or minutes — by touching only the blocks that might match.
xgrep is an indexed search accelerator for compressed log files. After a one-time index build, repeated searches over large .gz corpora are 100x to 1,200x faster than zgrep because xgrep reads only candidate blocks instead of decompressing and scanning the entire corpus.
How it works
# First time: build an index (one-time cost)
# Every search after that is near-instant
Traditional tools like zgrep decompress and scan every byte on every query. xgrep builds a sidecar index of per-block bloom filters, then uses memory-mapped I/O to touch only the blocks that might contain your search term. The OS never pages in the rest.
Architecture:
- Index build — Decompress each
.gz, split into 64KB blocks, build token-level bloom filters, write consolidated index - Cached search — Load bloom index (small), check each block's bloom (nanoseconds), mmap the cache, access only candidate blocks (~1% of data)
- Output — Grep-compatible: line numbers, filenames, context lines, color, count mode
Benchmark: Real Production Logs
Tested on three datasets from LogHub — real production logs from Hadoop, a supercomputer, and Spark.
HDFS (Hadoop Distributed File System)
76 files | 743MB compressed | 7.5GB decompressed | 55.8M lines
| Query | xgrep (cached) | zgrep | Speedup |
|---|---|---|---|
blk_-1608999687919862906 (specific block ID) |
30ms | 27.4s | 913x |
WARN |
28ms | 25.4s | 907x |
INFO (broad — most lines) |
23ms | 28.0s | 1,217x |
BGL (Blue Gene/L Supercomputer)
50 files | 421MB compressed | 5.0GB decompressed | 33.2M lines
| Query | xgrep (cached) | zgrep | Speedup |
|---|---|---|---|
R02-M1-N0 (specific node) |
26ms | 17.4s | 655x |
FATAL |
25ms | 17.4s | 708x |
INFO (broad) |
27ms | 17.7s | 655x |
Spark (Distributed Compute)
3,852 files | 189MB compressed | 2.8GB decompressed | 33.2M lines
| Query | xgrep (cached) | zgrep | Speedup |
|---|---|---|---|
Executor |
5.1s | 10m 2s | 118x |
ERROR |
2.7s | ~10m | 220x |
First query vs repeated query
| Mode | Time | Notes |
|---|---|---|
xgrep --build-index |
~2 min | One-time cost |
| xgrep first query (no cache) | 2.4s | Parallel decompress + scan. Already 18x faster than zgrep. |
| xgrep repeated query (cached) | 11-30ms | Bloom skip + mmap. Only candidate blocks paged in. |
| zgrep | 44s | Full decompress + scan. Every time. |
Key metric
Bytes touched per query: ~0.1-1% of corpus vs 100% for zgrep.
xgrep's cost scales with candidate data, not corpus size. As your logs grow, the advantage widens.
JSON Mode (-j)
Field-aware search for NDJSON/JSONL logs. Skips blocks using bloom-indexed field-value pairs.
JSON Benchmark
1M NDJSON lines, 244MB uncompressed, 22MB gzip
| Query | zcat | jq |
zgrep | xgrep -j | Speedup vs jq |
|---|---|---|---|---|
user_id=42042 (9 matches) |
40.5s | 0.82s | 0.13s | 319x |
status=503 (111K matches) |
40.5s | 0.66s | 1.97s | 20x |
level=error status=503 (multi-clause) |
40.6s | — | 1.83s | 22x |
Who this is for
Engineers who repeatedly search compressed log archives:
- Incident investigation — dozens of queries against the same log corpus
- Log forensics — searching rotated/archived
.gzfiles - Debugging — narrowing down by user ID, request ID, error code
- Any workflow where you've waited on
zgrep
xgrep is not a universal grep replacement. It is an indexed search accelerator for cold compressed data.
Usage
# Build index
# Text search (uses cache automatically if available)
# JSON field search
# Plain text (no index needed — parallel SIMD search)
Flags
| Flag | Description |
|---|---|
-F |
Fixed string (no regex) |
-E |
Extended regex |
-i |
Case insensitive |
-n |
Line numbers |
-c |
Count matches |
-l |
List matching files |
-q |
Quiet (exit 0 on match) |
-o |
Print only matched part |
-m N |
Stop after N matches |
-A/-B/-C N |
Context lines |
-j |
JSON field filter mode |
--build-index |
Build sidecar cache |
--stats |
Show block skip stats |
--include |
Glob filter |
Install
# Or from source
The crate is published as xgrep-cli but the binary is xgrep.
How the index works
logs/
app-001.log.gz
app-002.log.gz
.xgrep/
index.xgi # bloom filters (~6% of decompressed size)
index.xgd # decompressed content (memory-mapped)
- Bloom filters: 4KB per 64KB block, token-level, ~3% false positive rate
- Staleness: File size + mtime checked every query. Stale cache ignored automatically.
- Cache size: ~5-6x compressed size (decompressed data + bloom index)
- JSON mode: Hashes
(field, value)pairs into blooms for field-aware pruning
License
Apache-2.0