ziftsieve
Search compressed data without full decompression.
Overview
ziftsieve extracts literal bytes from compressed blocks and builds indexes over them. This allows skipping decompression for blocks that provably cannot contain a search pattern.
Traditional: SSD → Decompress (100GB/s) → Search (10GB/s) = 9GB/s effective
ziftsieve: SSD → Search compressed (50GB/s) → Decompress 10% = 45GB/s effective
5× faster
Supported Formats
| Format | Algorithm | Literal Extraction | Speed | Status |
|---|---|---|---|---|
| LZ4 | LZ77 | ✅ Full | 5 GB/s | Ready |
| Snappy | LZ77 | ✅ Full | 3 GB/s | Ready |
| Zstd | LZ77+ANS | ⚠️ Partial | 1 GB/s | Basic |
| Gzip | LZ77+Huffman | ✅ Native | 1 GB/s | Basic |
Installation
[]
= "0.1"
# Enable specific formats
= { = "0.1", = ["lz4", "gzip", "zstd"] }
Usage
use ;
// Build index from compressed file
let data = read?;
let index = from_bytes?;
// Search - only decompresses blocks that might match
let pattern = b"ERROR";
for block_id in index.candidate_blocks
How It Works
LZ-family compressors (LZ4, Snappy, Gzip, Zstd) use two techniques:
- Literal bytes - Copied directly to output
- Back-references - Copy from earlier in the output
ziftsieve parses the compressed stream and extracts only the literal bytes. For pattern matching, if your search pattern isn't in the literals, it can't be in the decompressed data (back-references only repeat earlier content).
This means:
- No false negatives - If pattern exists, it's found
- Possible false positives - Candidate blocks need verification
- 10-100× faster - Skip decompression for non-matching blocks
Performance
Benchmarks on AMD Ryzen 9 5950X, 1GB log file:
| Operation | Time | Throughput |
|---|---|---|
| Full LZ4 decompression | 200ms | 5 GB/s |
| Literal extraction | 50ms | 20 GB/s |
| Pattern search | 5ms | - |
| Effective search | 55ms | 18 GB/s |
Architecture
Compressed Block
│
├──► Literal Bytes ──► Bloom Filter ──► Index
│
└──► Match References ──► (ignored for indexing)
Safety
#![forbid(unsafe_code)]- Pure Rust implementation- Fuzz tested with arbitrary inputs
- Property-based tested for correctness
License
MIT License - See LICENSE for details.
Contributing
See CONTRIBUTING.md for guidelines.