safe_unzip
Secure archive extraction. Supports ZIP (core), TAR, and 7z (optional features).
The Problem
Zip files can contain malicious paths that escape the extraction directory:
# Extracts ../../etc/cron.d/pwned β /etc/cron.d/pwned π
This is Zip Slip, and Python's default behavior is still vulnerable.
"But didn't Python fix this?"
Sort of. Python added warnings and ZipInfo.filename sanitization in 2014. In Python 3.12+, there's a filter parameter:
# The "safe" way β but who knows this exists?
The problem: the safe option is opt-in. The default is still vulnerable. Most developers don't read the docs carefully enough to discover filter="data".
safe_unzip makes security the default, not an afterthought.
The Solution
use extract_file;
extract_file?;
// Err(PathEscape { entry: "../../etc/cron.d/pwned", ... })
# Python bindings β same safety
# Raises: PathEscapeError
Security is the default. No special flags, no opt-in safety. Every path is validated. Malicious archives are rejected, not extracted.
Why Not Just Use zip / tar / zipfile?
Because archive extraction is a security boundary, and most libraries treat it as a convenience function.
| Library | Default Behavior | Safe Option |
|---|---|---|
Python zipfile |
Vulnerable | filter="data" (opt-in, obscure) |
Python tarfile |
Vulnerable | filter="data" (opt-in, Python 3.12+) |
Rust zip |
Vulnerable | Manual path validation |
Rust tar |
Vulnerable | Manual path validation |
safe_unzip |
Safe by default | N/A β always safe |
If you're extracting untrusted archives, you need a library designed for that threat model.
Who Should Use This
- Backend services handling user-uploaded zip files
- CI/CD systems unpacking third-party artifacts
- SaaS platforms with file import features
- Forensics / malware analysis pipelines
- Anything running as a privileged user
If your zip files only come from trusted sources you control, the standard zip crate is fine. If users can upload archives, use safe_unzip.
Features
- CLI Tool β
safe_unzipcommand with--list,--verify, limits, and filtering - Archive Verification β Check CRC32 integrity without extracting
- Multi-Format Support β ZIP (core), TAR, and 7z (feature flags)
- Partial Extraction β Extract specific files with
only()or glob patterns - Progress Callbacks β Monitor extraction progress (Rust API)
- Async API β Optional tokio-based async extraction (feature flag)
- Zip Slip Protection β Path traversal attacks blocked via path_jail
- Zip Bomb Protection β Configurable limits on size, file count, and path depth
- Strict Size Enforcement β Catches files that decompress larger than declared
- Filename Sanitization β Blocks control characters and Windows reserved names
- Symlink Handling β Skip or reject symlinks (no symlink-based escapes)
- Secure Overwrite β Removes symlinks before overwriting to prevent symlink attacks
- Atomic File Creation β TOCTOU-safe file creation using
O_EXCL - Overwrite Policies β Error, skip, or overwrite existing files
- Filter Callback β Extract only the files you want
- Two-Pass Mode β Validate everything before writing anything
- Permission Stripping β Removes setuid/setgid bits on Unix
Installation
CLI:
Rust:
[]
= "0.1"
Python:
Feature Flags
| Feature | Default | Description |
|---|---|---|
tar |
β | TAR/TAR.GZ extraction |
async |
β | Tokio-based async API |
sevenz |
β | 7z extraction (heavier deps) |
# ZIP only (smallest, ~30 deps)
= "0.1"
# With TAR support (~40 deps)
= { = "0.1", = ["tar"] }
# With async API
= { = "0.1", = ["async"] }
# Kitchen sink (~85 deps)
= { = "0.1", = ["tar", "async", "sevenz"] }
Note: Python bindings always include TAR support.
Python Bindings
The Python bindings are thin wrappers over the Rust implementation via PyO3. This means:
- β Identical security guarantees β same code path, same validation
- β Identical limits β same defaults (1GB total, 10K files, 100MB per file)
- β Identical semantics β same error conditions, same behavior
- β No re-implementation β Python calls Rust directly, no logic duplication
Security reviewers: the Python API is a direct binding, not a port.
CLI Usage
# Extract archive to destination
# List contents without extracting
# Verify integrity (CRC32 check)
# With limits
# Glob filtering
# Partial extraction
# Verbose output
Quick Start
use extract_file;
// Extract with safe defaults
let report = extract_file?;
println!;
Usage Examples
Basic Extraction
use Extractor;
let report = new?
.extract_file?;
Create Destination if Missing
use Extractor;
// Extractor::new() errors if destination doesn't exist (catches typos)
// Extractor::new_or_create() creates it automatically
let report = new_or_create?
.extract_file?;
// The convenience functions (extract_file, extract) also create automatically
use extract_file;
extract_file?;
Custom Limits (Prevent Zip Bombs)
use ;
let report = new?
.limits
.extract_file?;
Filter by Extension
use Extractor;
// Only extract images
let report = new?
.filter
.extract_file?;
println!;
Partial Extraction (New in v0.1.5)
Extract specific files by name or glob pattern:
use Extractor;
// Extract only specific files
let report = new?
.only
.extract_file?;
// Include by glob pattern
let report = new?
.include_glob
.extract_file?;
// Exclude by glob pattern
let report = new?
.exclude_glob
.extract_file?;
Python:
# Extract only specific files
=
# Include by pattern
=
# Exclude by pattern
=
Progress Callbacks
Monitor extraction progress:
use Extractor;
let report = new?
.on_progress
.extract_file?;
Python:
# Or with tqdm for a progress bar
=
=
Archive Verification
Check archive integrity without extracting:
use verify_file;
// Verify CRC32 for all entries
let report = verify_file?;
println!;
Or with the Extractor (useful if you want to verify then extract):
use Extractor;
let extractor = new?;
// First verify
extractor.verify_file?;
// Then extract (re-reads archive, but guarantees integrity)
extractor.extract_file?;
Overwrite Policies
use ;
// Skip files that already exist
let report = new?
.overwrite
.extract_file?;
// Or overwrite them
let report = new?
.overwrite
.extract_file?;
// Default: Error if file exists
let report = new?
.overwrite // This is the default
.extract_file?;
Symlink Policies
use ;
// Default: silently skip symlinks
let report = new?
.symlinks
.extract_file?;
// Or reject archives containing symlinks
let report = new?
.symlinks
.extract_file?;
Extraction Modes
| Mode | Speed | On Failure | Use When |
|---|---|---|---|
Streaming (default) |
Fast (1 pass) | Partial files remain | Speed matters; you'll clean up on error |
ValidateFirst |
Slower (2 passes) | No files if validation fails | Can't tolerate partial state |
β οΈ Neither mode is truly atomic. If extraction fails mid-write (e.g., disk full), partial files remain regardless of mode. ValidateFirst only prevents writes when validation fails (bad paths, limits exceeded), not when I/O fails during extraction.
use ;
// Two-pass extraction:
// 1. Validate ALL entries (no disk writes)
// 2. Extract (only if validation passed)
let report = new?
.mode
.extract_file?;
Use ValidateFirst when you can't tolerate partial state from malicious archives. Use Streaming (default) when speed matters and you can clean up on error.
Extracting from Memory
use Extractor;
use Cursor;
let zip_bytes: = download_zip_somehow;
let cursor = new;
let report = new?
.extract?;
TAR Extraction (New in v0.1.2)
use ;
// Extract a .tar file
let report = new?
.extract_tar_file?;
// Extract a .tar.gz file
let report = new?
.extract_tar_gz_file?;
// With options
let report = new?
.filter
.validation
.extract_tar_file?;
The new Driver API provides a unified interface for all archive formats with the same security guarantees.
7z Extraction (Requires sevenz Feature)
Enable the sevenz feature:
[]
= { = "0.1", = ["sevenz"] }
use ;
// Extract a .7z file
let report = new?
.extract_7z_file?;
// Or from bytes
let report = new?
.extract_7z_bytes?;
Note: 7z archives are fully decompressed into memory before extraction, so large archives may use significant RAM.
Python:
# Simple extraction
=
# With options
=
Async Extraction (New)
Enable the async feature for tokio-based async extraction:
[]
= { = "0.1", = ["async"] }
use r#;
async
Concurrent extraction of multiple archives:
use r#;
let = join!;
The async API uses spawn_blocking internally, so extraction runs in a thread pool without blocking the async runtime.
Python Async API
Python async support uses asyncio.to_thread() to run extraction in a thread pool:
# Simple async extraction
= await
# TAR extraction
= await
# With options
= await
Concurrent extraction:
=
return await
Security Model
| Threat | Attack Vector | Defense |
|---|---|---|
| Zip Slip | Entry named ../../etc/cron.d/pwned |
path_jail validates every path |
| Zip Bomb (size) | 42KB β 4PB expansion | max_total_bytes limit + streaming enforcement |
| Zip Bomb (count) | 1 million empty files | max_file_count limit |
| Zip Bomb (lying) | Declared 1KB, decompresses to 1GB | Strict size reader detects mismatch |
| Symlink Escape | Symlink to /etc/passwd |
Skip or reject symlinks |
| Symlink Overwrite | Create symlink, then overwrite target | Symlinks removed before overwrite |
| Path Depth | a/b/c/.../1000levels |
max_path_depth limit |
| Invalid Filename | Control chars, CON, NUL |
Filename sanitization |
| Overwrite | Replace sensitive files | OverwritePolicy::Error default |
| Setuid | Create setuid executables | Permission bits stripped |
| Encrypted Archives | Password handling complexity | Rejected (see Encrypted Archives) |
Default Limits
| Limit | Default | Description |
|---|---|---|
max_total_bytes |
1 GB | Total uncompressed size |
max_file_count |
10,000 | Number of files |
max_single_file |
100 MB | Largest single file |
max_path_depth |
50 | Directory nesting depth |
Error Handling
use ;
match extract_file
Limitations
Format Limitations
- ZIP, TAR, 7z only β RAR not supported
- Requires seekable input for ZIP β ZIP format requires reading the central directory at the end
- TAR is sequential β TAR files are read in order;
ValidateFirstmode caches entries in memory - No encrypted archives β See below
Encrypted Archives
safe_unzip does not support password-protected zip files. Encrypted entries are rejected with Error::EncryptedEntry.
If you need to extract encrypted archives:
- Decrypt first using the
zipcrate directly - Then extract with
safe_unzip
This is intentionalβencryption handling is outside our security scope. Password management, key derivation, and cryptographic validation are complex domains that deserve dedicated tooling.
Extraction Behavior
- Partial state in Streaming mode β If extraction fails mid-way, already-extracted files remain on disk. Use
ExtractionMode::ValidateFirstto validate before writing. - Filters not applied during validation β In
ValidateFirstmode, limits are checked against ALL entries. Filtered entries still count toward limits. This is conservative: validation may reject archives that would succeed with filtering.
Security Scope
These threats are not fully addressed (by design or complexity):
| Limitation | Reason |
|---|---|
| Case-insensitive collisions | On Windows/macOS, File.txt and file.txt map to the same file. We don't track extracted names to detect this. |
| Unicode normalization | cafΓ© (NFC) vs cafΓ© (NFD) appear identical but are different bytes. Full normalization requires ICU. |
| Concurrent extraction | If multiple threads/processes extract to the same destination, race conditions can occur. Use file locking or separate destinations. |
| Sparse file attacks | Not applicable to zip format. |
| Hard links | Zip format doesn't support hard links. |
| Device files | Zip format doesn't support special device files. |
TOCTOU Mitigations
For OverwriteMode::Error and OverwriteMode::Skip, we use atomic file creation (O_CREAT | O_EXCL) instead of check-then-create. This eliminates race conditions between checking if a file exists and creating it.
For OverwriteMode::Overwrite, symlinks are removed before writing to prevent symlink-following attacks, but there's a brief window between removal and creation.
Filename Restrictions
These filenames are rejected for security:
- Control characters (including null bytes)
- Backslashes (
\) β prevents Windows path separator confusion - Paths longer than 1024 bytes
- Path components longer than 255 bytes
- Windows reserved names:
CON,PRN,AUX,NUL,COM1-9,LPT1-9
Development
Fuzzing
We use cargo-fuzz with two targets:
# Install cargo-fuzz (requires nightly)
# Run the main extraction fuzzer
# Run the adapter fuzzer (tests parsing layer)
Fuzzing targets are in fuzz/fuzz_targets/. Run fuzzing before releases to catch parsing edge cases.
License
MIT OR Apache-2.0