# safe_unzip
Archive extraction that won't ruin your day. Supports **ZIP** and **TAR** formats.
## The Problem
Zip files can contain malicious paths that escape the extraction directory:
```python
import zipfile
zipfile.ZipFile("evil.zip").extractall("/var/uploads")
# Extracts ../../etc/cron.d/pwned β /etc/cron.d/pwned π
```
This is [Zip Slip](https://snyk.io/research/zip-slip-vulnerability), and Python's default behavior is still vulnerable.
### "But didn't Python fix this?"
Sort of. Python added warnings and `ZipInfo.filename` sanitization in 2014. In Python 3.12+, there's a `filter` parameter:
```python
# The "safe" way β but who knows this exists?
zipfile.ZipFile("evil.zip").extractall("/var/uploads", filter="data")
```
The problem: **the safe option is opt-in**. The default is still vulnerable. Most developers don't read the docs carefully enough to discover `filter="data"`.
`safe_unzip` makes security the default, not an afterthought.
## The Solution
```rust
use safe_unzip::extract_file;
extract_file("/var/uploads", "evil.zip")?;
// Err(PathEscape { entry: "../../etc/cron.d/pwned", ... })
```
```python
# Python bindings β same safety
from safe_unzip import extract_file
extract_file("/var/uploads", "evil.zip")
# Raises: PathEscapeError
```
**Security is the default.** No special flags, no opt-in safety. Every path is validated. Malicious archives are rejected, not extracted.
## Why Not Just Use `zip` / `tar` / `zipfile`?
Because **archive extraction is a security boundary**, and most libraries treat it as a convenience function.
| Python `zipfile` | Vulnerable | `filter="data"` (opt-in, obscure) |
| Python `tarfile` | Vulnerable | `filter="data"` (opt-in, Python 3.12+) |
| Rust `zip` | Vulnerable | Manual path validation |
| Rust `tar` | Vulnerable | Manual path validation |
| `safe_unzip` | **Safe by default** | N/A β always safe |
If you're extracting untrusted archives, you need a library designed for that threat model.
## Who Should Use This
- **Backend services** handling user-uploaded zip files
- **CI/CD systems** unpacking third-party artifacts
- **SaaS platforms** with file import features
- **Forensics / malware analysis** pipelines
- **Anything running as a privileged user**
If your zip files only come from trusted sources you control, the standard `zip` crate is fine. If users can upload archives, use `safe_unzip`.
## Features
- **Multi-Format Support** β ZIP and TAR (`.tar`, `.tar.gz`) archives
- **Async API** β Optional tokio-based async extraction (feature flag)
- **Zip Slip Protection** β Path traversal attacks blocked via [path_jail](https://crates.io/crates/path_jail)
- **Zip Bomb Protection** β Configurable limits on size, file count, and path depth
- **Strict Size Enforcement** β Catches files that decompress larger than declared
- **Filename Sanitization** β Blocks control characters and Windows reserved names
- **Symlink Handling** β Skip or reject symlinks (no symlink-based escapes)
- **Secure Overwrite** β Removes symlinks before overwriting to prevent symlink attacks
- **Atomic File Creation** β TOCTOU-safe file creation using `O_EXCL`
- **Overwrite Policies** β Error, skip, or overwrite existing files
- **Filter Callback** β Extract only the files you want
- **Two-Pass Mode** β Validate everything before writing anything
- **Permission Stripping** β Removes setuid/setgid bits on Unix
## Installation
**Rust:**
```toml
[dependencies]
safe_unzip = "0.1"
```
**Python:**
```bash
pip install safe-unzip
```
### Python Bindings
The Python bindings are **thin wrappers** over the Rust implementation via PyO3. This means:
- β
**Identical security guarantees** β same code path, same validation
- β
**Identical limits** β same defaults (1GB total, 10K files, 100MB per file)
- β
**Identical semantics** β same error conditions, same behavior
- β
**No re-implementation** β Python calls Rust directly, no logic duplication
Security reviewers: the Python API is a direct binding, not a port.
## Quick Start
```rust
use safe_unzip::extract_file;
// Extract with safe defaults
let report = extract_file("/var/uploads", "archive.zip")?;
println!("Extracted {} files ({} bytes)",
report.files_extracted,
report.bytes_written
);
```
## Usage Examples
### Basic Extraction
```rust
use safe_unzip::Extractor;
let report = Extractor::new("/var/uploads")?
.extract_file("archive.zip")?;
```
### Create Destination if Missing
```rust
use safe_unzip::Extractor;
// Extractor::new() errors if destination doesn't exist (catches typos)
// Extractor::new_or_create() creates it automatically
let report = Extractor::new_or_create("/var/uploads/new_folder")?
.extract_file("archive.zip")?;
// The convenience functions (extract_file, extract) also create automatically
use safe_unzip::extract_file;
extract_file("/var/uploads/new_folder", "archive.zip")?;
```
### Custom Limits (Prevent Zip Bombs)
```rust
use safe_unzip::{Extractor, Limits};
let report = Extractor::new("/var/uploads")?
.limits(Limits {
max_total_bytes: 500 * 1024 * 1024, // 500 MB total
max_file_count: 1_000, // Max 1000 files
max_single_file: 50 * 1024 * 1024, // 50 MB per file
max_path_depth: 10, // No deeper than 10 levels
})
.extract_file("archive.zip")?;
```
### Filter by Extension
```rust
use safe_unzip::Extractor;
// Only extract images
let report = Extractor::new("/var/uploads")?
.filter(|entry| {
entry.name.ends_with(".png") ||
entry.name.ends_with(".jpg") ||
entry.name.ends_with(".gif")
})
.extract_file("archive.zip")?;
println!("Extracted {} images, skipped {} other files",
report.files_extracted,
report.entries_skipped
);
```
### Overwrite Policies
```rust
use safe_unzip::{Extractor, OverwritePolicy};
// Skip files that already exist
let report = Extractor::new("/var/uploads")?
.overwrite(OverwritePolicy::Skip)
.extract_file("archive.zip")?;
// Or overwrite them
let report = Extractor::new("/var/uploads")?
.overwrite(OverwritePolicy::Overwrite)
.extract_file("archive.zip")?;
// Default: Error if file exists
let report = Extractor::new("/var/uploads")?
.overwrite(OverwritePolicy::Error) // This is the default
.extract_file("archive.zip")?;
```
### Symlink Policies
```rust
use safe_unzip::{Extractor, SymlinkPolicy};
// Default: silently skip symlinks
let report = Extractor::new("/var/uploads")?
.symlinks(SymlinkPolicy::Skip)
.extract_file("archive.zip")?;
// Or reject archives containing symlinks
let report = Extractor::new("/var/uploads")?
.symlinks(SymlinkPolicy::Error)
.extract_file("archive.zip")?;
```
### Extraction Modes
| `Streaming` (default) | Fast (1 pass) | Partial files remain | Speed matters; you'll clean up on error |
| `ValidateFirst` | Slower (2 passes) | No files if validation fails | Can't tolerate partial state |
**β οΈ Neither mode is truly atomic.** If extraction fails mid-write (e.g., disk full), partial files remain regardless of mode. `ValidateFirst` only prevents writes when *validation* fails (bad paths, limits exceeded), not when I/O fails during extraction.
```rust
use safe_unzip::{Extractor, ExtractionMode};
// Two-pass extraction:
// 1. Validate ALL entries (no disk writes)
// 2. Extract (only if validation passed)
let report = Extractor::new("/var/uploads")?
.mode(ExtractionMode::ValidateFirst)
.extract_file("untrusted.zip")?;
```
Use `ValidateFirst` when you can't tolerate partial state from malicious archives. Use `Streaming` (default) when speed matters and you can clean up on error.
### Extracting from Memory
```rust
use safe_unzip::Extractor;
use std::io::Cursor;
let zip_bytes: Vec<u8> = download_zip_somehow();
let cursor = Cursor::new(zip_bytes);
let report = Extractor::new("/var/uploads")?
.extract(cursor)?;
```
### TAR Extraction (New in v0.1.2)
```rust
use safe_unzip::{Driver, TarAdapter};
// Extract a .tar file
let report = Driver::new("/var/uploads")?
.extract_tar_file("archive.tar")?;
// Extract a .tar.gz file
let report = Driver::new("/var/uploads")?
.extract_tar_gz_file("archive.tar.gz")?;
// With options
let report = Driver::new("/var/uploads")?
.filter(|entry| entry.name.ends_with(".txt"))
.validation(safe_unzip::ValidationMode::ValidateFirst)
.extract_tar_file("archive.tar")?;
```
The new `Driver` API provides a unified interface for all archive formats with the same security guarantees.
### Async Extraction (New)
Enable the `async` feature for tokio-based async extraction:
```toml
[dependencies]
safe_unzip = { version = "0.1", features = ["async"] }
```
```rust
use safe_unzip::r#async::{extract_file, extract_tar_file, AsyncExtractor};
#[tokio::main]
async fn main() -> Result<(), safe_unzip::Error> {
// Simple async extraction
let report = extract_file("/var/uploads", "archive.zip").await?;
// TAR extraction
let report = extract_tar_file("/var/uploads", "archive.tar").await?;
// With options
let report = AsyncExtractor::new("/var/uploads")?
.max_total_bytes(500 * 1024 * 1024)
.max_file_count(1000)
.extract_file("archive.zip")
.await?;
Ok(())
}
```
Concurrent extraction of multiple archives:
```rust
use safe_unzip::r#async::{extract_file, extract_tar_bytes};
let (zip_result, tar_result) = tokio::join!(
extract_file("/uploads/a", "first.zip"),
extract_tar_bytes("/uploads/b", tar_data),
);
```
The async API uses `spawn_blocking` internally, so extraction runs in a thread pool without blocking the async runtime.
## Security Model
| **Zip Slip** | Entry named `../../etc/cron.d/pwned` | `path_jail` validates every path |
| **Zip Bomb (size)** | 42KB β 4PB expansion | `max_total_bytes` limit + streaming enforcement |
| **Zip Bomb (count)** | 1 million empty files | `max_file_count` limit |
| **Zip Bomb (lying)** | Declared 1KB, decompresses to 1GB | Strict size reader detects mismatch |
| **Symlink Escape** | Symlink to `/etc/passwd` | Skip or reject symlinks |
| **Symlink Overwrite** | Create symlink, then overwrite target | Symlinks removed before overwrite |
| **Path Depth** | `a/b/c/.../1000levels` | `max_path_depth` limit |
| **Invalid Filename** | Control chars, `CON`, `NUL` | Filename sanitization |
| **Overwrite** | Replace sensitive files | `OverwritePolicy::Error` default |
| **Setuid** | Create setuid executables | Permission bits stripped |
| **Encrypted Archives** | Password handling complexity | Rejected (see [Encrypted Archives](#encrypted-archives)) |
## Default Limits
| `max_total_bytes` | 1 GB | Total uncompressed size |
| `max_file_count` | 10,000 | Number of files |
| `max_single_file` | 100 MB | Largest single file |
| `max_path_depth` | 50 | Directory nesting depth |
## Error Handling
```rust
use safe_unzip::{extract_file, Error};
match extract_file("/var/uploads", "archive.zip") {
Ok(report) => {
println!("Success: {} files", report.files_extracted);
}
Err(Error::PathEscape { entry, detail }) => {
eprintln!("Blocked path traversal in '{}': {}", entry, detail);
}
Err(Error::TotalSizeExceeded { limit, would_be }) => {
eprintln!("Archive too large: {} bytes (limit: {})", would_be, limit);
}
Err(Error::FileTooLarge { entry, size, limit }) => {
eprintln!("File '{}' too large: {} bytes (limit: {})", entry, size, limit);
}
Err(Error::FileCountExceeded { limit }) => {
eprintln!("Too many files (limit: {})", limit);
}
Err(Error::AlreadyExists { path }) => {
eprintln!("File already exists: {}", path);
}
Err(Error::InvalidFilename { entry, .. }) => {
eprintln!("Invalid filename: {}", entry);
}
Err(Error::EncryptedEntry { entry }) => {
eprintln!("Encrypted entry not supported: {}", entry);
}
Err(e) => {
eprintln!("Extraction failed: {}", e);
}
}
```
## Limitations
### Format Limitations
- **ZIP and TAR only** β Other formats (7z, rar) not supported
- **Requires seekable input for ZIP** β ZIP format requires reading the central directory at the end
- **TAR is sequential** β TAR files are read in order; `ValidateFirst` mode caches entries in memory
- **No encrypted archives** β See below
### Encrypted Archives
`safe_unzip` does not support password-protected zip files. Encrypted entries are rejected with `Error::EncryptedEntry`.
If you need to extract encrypted archives:
1. Decrypt first using the `zip` crate directly
2. Then extract with `safe_unzip`
This is intentionalβencryption handling is outside our security scope. Password management, key derivation, and cryptographic validation are complex domains that deserve dedicated tooling.
### Extraction Behavior
- **Partial state in Streaming mode** β If extraction fails mid-way, already-extracted files remain on disk. Use `ExtractionMode::ValidateFirst` to validate before writing.
- **Filters not applied during validation** β In `ValidateFirst` mode, limits are checked against ALL entries. Filtered entries still count toward limits. This is conservative: validation may reject archives that would succeed with filtering.
### Security Scope
These threats are **not fully addressed** (by design or complexity):
| **Case-insensitive collisions** | On Windows/macOS, `File.txt` and `file.txt` map to the same file. We don't track extracted names to detect this. |
| **Unicode normalization** | `cafΓ©` (NFC) vs `cafΓ©` (NFD) appear identical but are different bytes. Full normalization requires ICU. |
| **Concurrent extraction** | If multiple threads/processes extract to the same destination, race conditions can occur. Use file locking or separate destinations. |
| **Sparse file attacks** | Not applicable to zip format. |
| **Hard links** | Zip format doesn't support hard links. |
| **Device files** | Zip format doesn't support special device files. |
### TOCTOU Mitigations
For `OverwriteMode::Error` and `OverwriteMode::Skip`, we use **atomic file creation** (`O_CREAT | O_EXCL`) instead of check-then-create. This eliminates race conditions between checking if a file exists and creating it.
For `OverwriteMode::Overwrite`, symlinks are removed before writing to prevent symlink-following attacks, but there's a brief window between removal and creation.
### Filename Restrictions
These filenames are **rejected** for security:
- Control characters (including null bytes)
- Backslashes (`\`) β prevents Windows path separator confusion
- Paths longer than 1024 bytes
- Path components longer than 255 bytes
- Windows reserved names: `CON`, `PRN`, `AUX`, `NUL`, `COM1-9`, `LPT1-9`
## Development
### Fuzzing
We use [cargo-fuzz](https://github.com/rust-fuzz/cargo-fuzz) with two targets:
```bash
# Install cargo-fuzz (requires nightly)
cargo install cargo-fuzz
# Run the main extraction fuzzer
cargo +nightly fuzz run fuzz_extract
# Run the adapter fuzzer (tests parsing layer)
cargo +nightly fuzz run fuzz_zip_adapter
```
Fuzzing targets are in `fuzz/fuzz_targets/`. Run fuzzing before releases to catch parsing edge cases.
## License
MIT OR Apache-2.0