indexed_deflate
Gzip/Zlib/DEFLATE decoder with efficient random access.
As DEFLATE does not normally support random access, we build an index while decompressing the
entire input. This contains a set of access points, typically one per 1MB of input.
We can restart decompression from any access point, letting us seek to any byte for the
cost of decompressing at most 1MB of discarded data (a few milliseconds on a desktop CPU).
The index is saved to disk and can be reused for any subsequent processing of the same file.
Decompression is implemented with the pure-Rust miniz_oxide.
Performance
With the default configuration, the index file stored on disk will be up to 3% of the size of the input file.
Only a small map of file offsets is stored in RAM, roughly 0.003% of the size of the input.
This minimises the startup cost when a process only wants to use a small part of the index:
the total time to open, seek and start reading is only a few milliseconds, even if the input
file is many GBs.
Usage
An example implementing random access to .tar.gz files:
use std::{collections::HashMap, fs::File, io::{Read, Seek, SeekFrom, Write}, str};
use indexed_deflate::{AccessPointSpan, GzDecoder, GzIndexBuilder, Result};
fn build_tar_index() -> Result<()> {
let gz = File::open("example.tar.gz")?;
let mut index = File::create("example.tar.gz.index")?;
let mut builder = GzIndexBuilder::new(gz, &index, AccessPointSpan::default())?;
let mut archive = tar::Archive::new(&mut builder);
let files: HashMap<String, (u64, u64)> = archive
.entries_with_seek()?
.map(|file| {
let file = file.unwrap();
let path = str::from_utf8(&file.path_bytes()).unwrap().to_owned();
(path, (file.raw_file_position(), file.size()))
})
.collect();
builder.finish()?;
index.write_all(&postcard::to_stdvec(&files).unwrap())?;
Ok(())
}
fn use_tar_index() -> Result<()> {
let gz = File::open("example.tar.gz")?;
let index = File::open("example.tar.gz.index")?;
let mut stream = GzDecoder::new(gz, index)?;
let files: HashMap<String, (u64, u64)> = stream.with_index(|index| {
let mut buf = Vec::new();
index.read_to_end(&mut buf)?;
Ok(postcard::from_bytes(&buf).unwrap())
})?;
let (file_pos, file_size) = files.get("example.txt").unwrap();
stream.seek(SeekFrom::Start(*file_pos))?;
let mut buf = vec![0; *file_size as usize];
stream.read_exact(&mut buf)?;
println!("{}", str::from_utf8(&buf).unwrap());
Ok(())
}