Expand description
Gzip/Zlib/DEFLATE decoder with efficient random access.
As DEFLATE does not normally support random access, we build an index while decompressing the entire input. This contains a set of access points, typically one per 1MB of input. We can restart decompression from any access point, letting us seek to any byte for the cost of decompressing at most 1MB of discarded data (a few milliseconds on a desktop CPU).
The index is saved to disk and can be reused for any subsequent processing of the same file.
Decompression is implemented with the pure-Rust miniz_oxide.
§Memory
Each access point takes up to 32KB of storage (less if the compression ratio is good). If we have one access point for every 1MB of input, the index will be up to 3% of the size of the input file.
For both building and using an index file, only a small map of file offsets (about 32 bytes per access point, or 0.003% the size of the input file) is stored in RAM. The bulky decompressor state will be kept on disk. This allows indexes to be larger than RAM, and minimises the startup cost when a process only wants to use a small part of the index.
This also allows a multi-threaded application to create multiple readers over the same file, allowing parallel decompression of different sections of the file, with only a small RAM cost.
§Examples
§Basic usage
fn build_index() -> Result<()> {
let gz = File::open("example.gz")?;
// If you seek backwards while building, the index must be opened in read-write mode.
// Otherwise you can use the write-only `File::create()` instead.
let index = File::options()
.create(true)
.truncate(true)
.read(true)
.write(true)
.open("example.gz.index")?;
let mut builder = GzIndexBuilder::new(gz, index, AccessPointSpan::default())?;
// Trigger decompression of the entire file. This may take a long time
builder.seek(SeekFrom::End(0))?;
// Finish writing the index to disk
builder.finish()?;
Ok(())
}
fn use_index() -> Result<()> {
let gz = File::open("example.gz")?;
let index = File::open("example.gz.index")?;
let mut decoder = GzDecoder::new(gz, index)?;
// This seek should only take a few milliseconds
decoder.seek(SeekFrom::Start(100_000_000))?;
let mut buf = vec![0u8; 1024];
decoder.read_exact(&mut buf)?;
Ok(())
}§.tar.gz random access
fn build_tar_index() -> Result<()> {
let gz = File::open("example.tar.gz")?;
let mut index = File::create("example.tar.gz.index")?;
// GzIndexBuilder supports Read and Seek
let mut builder = GzIndexBuilder::new(gz, &index, AccessPointSpan::default())?;
// Extract the tar file listing, while decompressing
let mut archive = tar::Archive::new(&mut builder);
let files: HashMap<String, (u64, u64)> = archive
.entries_with_seek()?
.map(|file| {
let file = file.unwrap();
let path = str::from_utf8(&file.path_bytes()).unwrap().to_owned();
(path, (file.raw_file_position(), file.size()))
})
.collect();
// Finish writing the index to disk
builder.finish()?;
// Append our serialized file listing to the index file
index.write_all(&postcard::to_stdvec(&files).unwrap())?;
Ok(())
}
fn use_tar_index() -> Result<()> {
let gz = File::open("example.tar.gz")?;
let index = File::open("example.tar.gz.index")?;
// GzDecoder supports Read and Seek
let mut stream = GzDecoder::new(gz, index)?;
// Load the tar file listing from the end of the index file
let files: HashMap<String, (u64, u64)> = stream.with_index(|index| {
let mut buf = Vec::new();
index.read_to_end(&mut buf)?;
Ok(postcard::from_bytes(&buf).unwrap())
})?;
let (file_pos, file_size) = files.get("example.txt").unwrap();
// Seek in the decompressed stream to read the file
stream.seek(SeekFrom::Start(*file_pos))?;
let mut buf = vec![0; *file_size as usize];
stream.read_exact(&mut buf)?;
println!("{}", str::from_utf8(&buf).unwrap());
Ok(())
}§Other considerations
§Backward compatibility
Currently there are no compatibility guarantees for the index file format.
If the format changes incompatibly, the decoder will reject it with
Error::IndexIncompatibleVersion and you must rebuild the index.
§DEFLATE block sizes
To avoid having to store the entire decompressor state in the index file (including Huffman
trees etc), we only create access points at the boundaries between DEFLATE blocks.
If the blocks are large relative to the requested AccessPointSpan (default 1MB), this may
significantly increase the spacing of access points.
Experiments indicate that GNU Gzip (the standard command-line gzip) and
miniz_oxide have a maximum block size of roughly 64KB compressed, so they should not
be a problem.
(Uncompressed blocks may be several MB, but access points are based on compressed size.)
libdeflate has a maximum block size of roughly 300KB uncompressed
(under its default configuration).
zopfli has a maximum block size of roughly 1MB uncompressed (default).
That will make our index less efficient; but these implementations are explicitly not designed for
compressing very large files, so you are less likely to encounter them in this context.
§Errors
If any API returns a std::io::Error, it is likely that the internal state will be corrupted
(losing track of its position in the input files, etc). Do not try to recover and reuse the
object – drop it and start again.
When decoding, there is no mechanism to detect whether the index file corresponds to the compressed input file. Loading a mismatched index file may result in errors or corrupted output.
Structs§
- Access
Point Span - Approximate number of bytes of compressed input between access points. Larger spans will result in smaller indexes, but slower seek times.
- Deflate
Decoder - Decompresses the input file, using the previously-built index to allow fast seeking.
- Deflate
Index Builder - Decompresses the input file and builds an index to allow fast seeking with the
corresponding
Decodertype. - GzDecoder
- Decompresses the input file, using the previously-built index to allow fast seeking.
- GzIndex
Builder - Decompresses the input file and builds an index to allow fast seeking with the
corresponding
Decodertype. - Zlib
Decoder - Decompresses the input file, using the previously-built index to allow fast seeking.
- Zlib
Index Builder - Decompresses the input file and builds an index to allow fast seeking with the
corresponding
Decodertype.